大数据
基础组件
Hive

Hive(二)

简介:Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

1. 使用MySQL作为Hive的元数据库

一般来说,我们会使用MySQL作为Hive的元数据存储库,虽然Hive内置了Derby数据库,但是该库不支持多个连接。接下来我们将配置Hive的MySQL连接。

  • -- 创建用户
  • mysql> create user 'hive' identified by 'hive';
  • Query OK, 0 rows affected (0.00 sec)
  • -- 给用户授权
  • mysql> grant all privileges on *.* to 'hive'@'%' with grant option;
  • Query OK, 0 rows affected (0.00 sec)
  • -- 刷新权限
  • mysql> flush privileges;
  • Query OK, 0 rows affected (0.00 sec)
  • mysql> create database myhive default character set latin1;
  • Query OK, 1 row affected (0.00 sec)

1.1. 修改配置文件

要使Hive连接MySQL,首先需要将MySQL的连接配置告知Hive,这些配置都在hive-site.xml文件中,配置如下:

  • <property>
  • <name>javax.jdo.option.ConnectionDriverName</name>
  • <value>com.mysql.jdbc.Driver</value>
  • </property>
  • <property>
  • <name>javax.jdo.option.ConnectionURL</name>
  • <value>jdbc:mysql://localhost:3306/myhive</value>
  • </property>
  • <property>
  • <name>javax.jdo.option.ConnectionUserName</name>
  • <value>hive</value>
  • </property>
  • <property>
  • <name>javax.jdo.option.ConnectionPassword</name>
  • <value>hive</value>
  • </property>

上面的四项配置分别确定了数据库连接驱动类,数据库和表名,数据库连接的用户和密码。将hive-site.xml中的这四项配置更新后,我们还需要将MySQL的JDBC驱动Jar包放置在${HIVE_HOME}/lib目录下,这里我们使用的是mysql-connector-java-5.0.8.jar

1.2. 初始化Schema

做完以上配置后,尝试执行schematool -initSchema -dbType mysql初始化Schema。在初始化之前,我们还需要在MySQL中创建默认的myhive库,这里就不再赘述了。

注:尝试初始化Hive系统,可能会报以下错误:

  • ubuntu@s100:~$ schematool -initSchema -dbType mysql
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Metastore connection URL: jdbc:mysql://192.168.127.1:3306/myhive
  • Metastore Connection Driver : com.mysql.jdbc.Driver
  • Metastore connection User: root
  • org.apache.hadoop.hive.metastore.HiveMetaException: Failed to get schema version.
  • Underlying cause: com.mysql.jdbc.CommunicationsException : Communications link failure due to underlying exception:
  • ** BEGIN NESTED EXCEPTION **
  • java.net.ConnectException
  • MESSAGE: Connection timed out
  • STACKTRACE:
  • java.net.ConnectException: Connection timed out
  • at java.net.PlainSocketImpl.socketConnect(Native Method)
  • at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
  • at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
  • at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
  • at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  • at java.net.Socket.connect(Socket.java:589)
  • at java.net.Socket.connect(Socket.java:538)
  • at java.net.Socket.<init>(Socket.java:434)
  • at java.net.Socket.<init>(Socket.java:244)
  • at com.mysql.jdbc.StandardSocketFactory.connect(StandardSocketFactory.java:256)
  • at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:271)
  • at com.mysql.jdbc.Connection.createNewIO(Connection.java:2771)
  • at com.mysql.jdbc.Connection.<init>(Connection.java:1555)
  • at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:285)
  • at java.sql.DriverManager.getConnection(DriverManager.java:664)
  • at java.sql.DriverManager.getConnection(DriverManager.java:247)
  • at org.apache.hive.beeline.HiveSchemaHelper.getConnectionToMetastore(HiveSchemaHelper.java:73)
  • at org.apache.hive.beeline.HiveSchemaTool.getConnectionToMetastore(HiveSchemaTool.java:120)
  • at org.apache.hive.beeline.HiveSchemaTool.testConnectionToMetastore(HiveSchemaTool.java:174)
  • at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:278)
  • at org.apache.hive.beeline.HiveSchemaTool.doInit(HiveSchemaTool.java:264)
  • at org.apache.hive.beeline.HiveSchemaTool.main(HiveSchemaTool.java:505)
  • at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  • at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  • at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  • at java.lang.reflect.Method.invoke(Method.java:497)
  • at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
  • at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
  • ** END NESTED EXCEPTION **
  • Last packet sent to the server was 2 ms ago.
  • SQL Error code: 0
  • Use --verbose for detailed stacktrace.
  • *** schemaTool failed ***

这可能是由于MySQL的root用户远程登录权限没有打开,可以用下面的命令打开MySQL的root用户的远程登录权限:

  • mysql> update user set host="%" where user='root';
  • Query OK, 0 rows affected (0.00 sec)
  • Rows matched: 1 Changed: 0 Warnings: 0
  • mysql> select host,user from user;
  • +------+------+
  • | host | user |
  • +------+------+
  • | % | root |
  • +------+------+
  • 1 row in set (0.00 sec)
  • mysql> flush privileges;
  • Query OK, 0 rows affected (0.00 sec)

如果刷新的了权限还是无法登录,可能是由于Windows的权限控制问题,可以尝试下面的命令更新权限:

  • C:\Windows\system32>netsh advfirewall firewall add rule name="MySQL Server" action=allow protocol=TCP dir=in localport=3306
  • 确定

需要注意的是,上面的命令需要在管理员权限下执行。

  • ubuntu@s100:~$ schematool -initSchema -dbType mysql
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Metastore connection URL: jdbc:mysql://192.168.127.1:3306/myhive
  • Metastore Connection Driver : com.mysql.jdbc.Driver
  • Metastore connection User: root
  • Starting metastore schema initialization to 2.1.0
  • Initialization script hive-schema-2.1.0.mysql.sql
  • Initialization script completed
  • schemaTool completed

在初始化Schema后,可以查看一下MySQL数据库,可以发现在myhive库中多了很多表,如下:

  • mysql> show databases;
  • +--------------------+
  • | Database |
  • +--------------------+
  • | information_schema |
  • | bigdata |
  • | ee19_crmdb |
  • | myhive |
  • | mysql |
  • | performance_schema |
  • | test |
  • +--------------------+
  • 7 rows in set (0.00 sec)
  • mysql> use myhive;
  • Database changed
  • mysql> show tables;
  • +---------------------------+
  • | Tables_in_myhive |
  • +---------------------------+
  • | aux_table |
  • | bucketing_cols |
  • | cds |
  • | columns_v2 |
  • | compaction_queue |
  • | completed_compactions |
  • | completed_txn_components |
  • | database_params |
  • | db_privs |
  • | dbs |
  • | delegation_tokens |
  • | func_ru |
  • | funcs |
  • | global_privs |
  • | hive_locks |
  • | idxs |
  • | index_params |
  • | key_constraints |
  • | master_keys |
  • | next_compaction_queue_id |
  • | next_lock_id |
  • | next_txn_id |
  • | notification_log |
  • | notification_sequence |
  • | nucleus_tables |
  • | part_col_privs |
  • | part_col_stats |
  • | part_privs |
  • | partition_events |
  • | partition_key_vals |
  • | partition_keys |
  • | partition_params |
  • | partitions |
  • | role_map |
  • | roles |
  • | sd_params |
  • | sds |
  • | sequence_table |
  • | serde_params |
  • | serdes |
  • | skewed_col_names |
  • | skewed_col_value_loc_map |
  • | skewed_string_list |
  • | skewed_string_list_values |
  • | skewed_values |
  • | sort_cols |
  • | tab_col_stats |
  • | table_params |
  • | tbl_col_privs |
  • | tbl_privs |
  • | tbls |
  • | txn_components |
  • | txns |
  • | type_fields |
  • | types |
  • | version |
  • | write_set |
  • +---------------------------+
  • 57 rows in set (0.00 sec)

我们查看一下其中的dbs表,发现并没有数据:

  • mysql> select * from dbs;
  • Empty set (0.00 sec)

此时我们可以尝试向Hive中创建一个名为hive1的库:

  • ubuntu@s100:~$ hive
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Logging initialized using configuration in jar:file:/soft/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
  • Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • hive> create database hive1;
  • OK
  • Time taken: 0.274 seconds
  • hive> create database hive1;
  • OK
  • Time taken: 0.274 seconds

再次查看dbs表,可以发现已经多出了刚刚创建的hive1库的数据:

  • mysql> select * from dbs;
  • +-------+-----------------------+------------------------------------------+---------+------------+------------+
  • | DB_ID | DESC | DB_LOCATION_URI | NAME | OWNER_NAME | OWNER_TYPE |
  • +-------+-----------------------+------------------------------------------+---------+------------+------------+
  • | 1 | Default Hive database | hdfs://s100/user/hive/warehouse | default | public | ROLE |
  • | 2 | NULL | hdfs://s100/user/hive/warehouse/hive1.db | hive1 | ubuntu | USER |
  • +-------+-----------------------+------------------------------------------+---------+------------+------------+
  • 2 rows in set (0.00 sec)

再尝试在hive1库中创建一张名为mytable表:

  • hive> create table mytable(id int, name string, age int);
  • OK
  • Time taken: 0.507 seconds

Hive元数据中表的数据存放在MySQL中tbls表中,该表描述了表的基本信息:

  • mysql> select * from tbls;
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
  • | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
  • | 1 | 1498139809 | 1 | 0 | ubuntu | 0 | 1 | mytable | MANAGED_TABLE | NULL | NULL |
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+---------------+--------------------+--------------------+
  • 1 row in set (0.00 sec)

而表的详细信息存放在MySQL的columns_v2表中:

  • mysql> select * from columns_v2;
  • +-------+---------+-------------+-----------+-------------+
  • | CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
  • +-------+---------+-------------+-----------+-------------+
  • | 1 | NULL | age | int | 2 |
  • | 1 | NULL | id | int | 0 |
  • | 1 | NULL | name | string | 1 |
  • +-------+---------+-------------+-----------+-------------+
  • 3 rows in set (0.00 sec)

需要注意的是,CD_ID列和TBL_ID列是对应的,即这是一个外键。

2. API操作

2.1. 基础配置和测试

我们可以在Java环境下以Client方式操作Hive数据库。在演示之前,我们需要对Hive的hive-site.xml文件中的一些配置做一些更改,具体项如下:

  • <property>
  • <name>hive.server2.enable.doAs</name>
  • <value>false</value>
  • <description>Setting this property to true will have HiveServer2 execute Hive operations as the user making the calls to it.</description>
  • </property>
  • <property>
  • <name>hive.metastore.sasl.enabled</name>
  • <value>false</value>
  • <description>If true, the metastore Thrift interface will be secured with SASL. Clients must authenticate with Kerberos.</description>
  • </property>
  • <property>
  • <name>hive.server2.authentication</name>
  • <value>NONE</value>
  • <description>Expects one of [nosasl, none, ldap, kerberos, pam, custom].Client authentication types.NONE: no authentication check LDAP: LDAP/AD based authentication
  • KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom authentication provider (Use with property hive.server2.custom.authentication.class) PAM: Pluggable authentication module NOSASL: Raw transport</description>
  • </property>

并且启动Hive的hiveserver2服务,该服务会监听10000端口,Java API可以通过此端口对Hive进行操作:

  • ubuntu@s100:~$ hive --service hiveserver2 start
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • ^Z
  • [1]+ Stopped hive --service hiveserver2 start
  • ubuntu@s100:~$ bg %1
  • [1]+ hive --service hiveserver2 start &
  • ubuntu@s100:~$ netstat -ano | grep 10000
  • tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN off (0.00/0/0)

往hive1库的mytable表中插入一条数据:

  • hive> insert into mytable(id,name,age) values(1,'Tom',18);
  • ...
  • Time taken: 32.753 seconds

做完以上准备后,我们创建一个Maven项目,并且在pom.xml中引入操作Hive需要的依赖:

  • <!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc -->
  • <dependency>
  • <groupId>org.apache.hive</groupId>
  • <artifactId>hive-jdbc</artifactId>
  • <version>2.1.0</version>
  • </dependency>

然后编写如下的Java代码:

  • package com.coderap.hive;
  • import java.sql.Connection;
  • import java.sql.DriverManager;
  • import java.sql.PreparedStatement;
  • import java.sql.ResultSet;
  • public class App {
  • public static void main(String[] args) throws Exception {
  • Class.forName("org.apache.hive.jdbc.HiveDriver");
  • Connection connection = DriverManager.getConnection("jdbc:hive2://192.168.127.100:10000/hive1", "ubuntu",
  • "123456");
  • PreparedStatement prepareStatement = connection.prepareStatement("select * from mytable");
  • ResultSet resultSet = prepareStatement.executeQuery();
  • while (resultSet.next()) {
  • int id = resultSet.getInt("id");
  • String name = resultSet.getString("name");
  • int age = resultSet.getInt("age");
  • System.out.println("name : " + name);
  • System.out.println("age : " + age);
  • System.out.println("id : " + id);
  • }
  • resultSet.close();
  • prepareStatement.close();
  • connection.close();
  • }
  • }

执行上述代码,会打印以下内容:

  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/D:/Maven-Repository/org/apache/logging/log4j/log4j-slf4j-impl/2.4.1/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/D:/Maven-Repository/org/slf4j/slf4j-log4j12/1.6.1/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.
  • name : Tom
  • age : 18
  • id : 1

注:上面的测试报ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.是由于log4j的配置文件没有引入,我们可以在${HIVE_HOME}/conf目录下找到该配置文件。

2.2. CRUD测试

有了上面的实验,我们就可以使用Java API对Hive进行测试了,编写一个TestCRUD测试类,因为需要初始化一个连接,所以我们可以在@Before方法中创建数据库连接:

注:Maven中默认的测试类会继承TestCase,需要将这个去掉,否则会报错。

  • package com.coderap.hive;
  • import java.sql.Connection;
  • import java.sql.DriverManager;
  • import java.sql.PreparedStatement;
  • import org.junit.Before;
  • public class TestCRUD {
  • private Connection connection;
  • @Before
  • public void initConn() throws Exception {
  • Class.forName("org.apache.hive.jdbc.HiveDriver");
  • connection = DriverManager.getConnection("jdbc:hive2://192.168.127.100:10000/hive1", "ubuntu",
  • "123456");
  • }
  • }

这样就可以在每次测试前提前创建好连接。

  1. 创建表操作

使用API创建表的代码如下:

  • @Test
  • public void insert() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("create table hive1.users(id int, name string, age int)");
  • prepareStatement.execute();
  • prepareStatement.close();
  • connection.close();
  • }

执行测试代码后,可以去命令行中查看创建详情:

  • hive> show tables;
  • OK
  • mytable
  • users
  • values__tmp__table__1
  • Time taken: 0.119 seconds, Fetched: 3 row(s)
  • hive> describle hive1.users;
  • OK
  • id int
  • name string
  • age int
  • Time taken: 0.093 seconds, Fetched: 3 row(s)
  • hive> describe hive1.users;
  • OK
  • id int
  • name string
  • age int
  • Time taken: 0.093 seconds, Fetched: 3 row(s)
  1. 批量插入数据:
  • @Test
  • public void batchInsert() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("insert into hive1.users(id,name,age) values(?,?,?)");
  • prepareStatement.setInt(1, 0);
  • prepareStatement.setString(2, "Tom");
  • prepareStatement.setInt(3, 20);
  • prepareStatement.executeUpdate();
  • prepareStatement.setInt(1, 1);
  • prepareStatement.setString(2, "Jack");
  • prepareStatement.setInt(3, 22);
  • prepareStatement.executeUpdate();
  • prepareStatement.setInt(1, 2);
  • prepareStatement.setString(2, "Rose");
  • prepareStatement.setInt(3, 21);
  • prepareStatement.executeUpdate();
  • prepareStatement.close();
  • connection.close();
  • }

查看插入的数据:

  • hive> select * from hive1.users;
  • OK
  • 0 Tom 20
  • 1 Jack 22
  • 2 Rose 21
  • Time taken: 0.08 seconds, Fetched: 3 row(s)

插入的数据其实会更新到HDFS上:

  • ubuntu@s100:~$ hdfs dfs -ls -R /
  • ...
  • -rwxr-xr-x 3 ubuntu supergroup 9 2017-06-22 09:21 /user/hive/warehouse/hive1.db/users/000000_0
  • -rwxr-xr-x 3 ubuntu supergroup 10 2017-06-22 09:22 /user/hive/warehouse/hive1.db/users/000000_0_copy_1
  • -rwxr-xr-x 3 ubuntu supergroup 10 2017-06-22 09:22 /user/hive/warehouse/hive1.db/users/000000_0_copy_2
  • ...
  • ubuntu@s100:~$ hdfs dfs -cat /user/hive/warehouse/hive1.db/users/000000_0
  • 0Tom20
  • ubuntu@s100:~$ hdfs dfs -cat /user/hive/warehouse/hive1.db/users/000000_0_copy_1
  • 1Jack22
  • ubuntu@s100:~$ hdfs dfs -cat /user/hive/warehouse/hive1.db/users/000000_0_copy_2
  • 2Rose21

字段和字段之间的分隔符是Hive默认指定的,这里看不出来。

  1. count统计数据:
  • @Test
  • public void count() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("select count(*) from hive1.users");
  • ResultSet resultSet = prepareStatement.executeQuery();
  • resultSet.next();
  • System.out.println(resultSet.getInt(1));
  • prepareStatement.close();
  • connection.close();
  • }

运行上述代码会打印数据3

  1. 删除表:
  • @Test
  • public void dropTable() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("drop table hive1.users");
  • prepareStatement.execute();
  • prepareStatement.close();
  • connection.close();
  • }

3. Beeline

Beeline依赖于hiveserver2,所以先要将hiveserver2服务打开;

  • ubuntu@s100:/soft/hive/bin$ hive --service beeline -u jdbd:hive2://s100:10000/hive1
  • ls: cannot access /soft/hive/lib/hive-jdbc-*-standalone.jar: No such file or directory
  • scan complete in 1ms
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • scan complete in 1888ms
  • No known driver to handle "jdbd:hive2://s100:10000/hive1"
  • Beeline version 2.1.0 by Apache Hive
  • beeline>
  • ubuntu@s100:~$ beeline
  • ls: cannot access /soft/hive/lib/hive-jdbc-*-standalone.jar: No such file or directory
  • Beeline version 2.1.0 by Apache Hive
  • beeline> show databases;
  • No current connection
  • beeline> !connect jdbc:hive2://localhost:10000/hive1
  • Connecting to jdbc:hive2://localhost:10000/hive1
  • Enter username for jdbc:hive2://localhost:10000/hive1: ubuntu
  • Enter password for jdbc:hive2://localhost:10000/hive1: ******
  • Connected to: Apache Hive (version 2.1.0)
  • Driver: Hive JDBC (version 2.1.0)
  • 17/06/24 00:41:14 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
  • Transaction isolation: TRANSACTION_REPEATABLE_READ
  • 0: jdbc:hive2://localhost:10000/hive1>

可以使用Beeline对数据库做一些操作:

  • 0: jdbc:hive2://localhost:10000/hive1> show databases;
  • +----------------+--+
  • | database_name |
  • +----------------+--+
  • | default |
  • | hive1 |
  • +----------------+--+
  • 2 rows selected (2.615 seconds)
  • 0: jdbc:hive2://localhost:10000/hive1> use hive1;
  • No rows affected (0.057 seconds)
  • 0: jdbc:hive2://localhost:10000/hive1> show tables;
  • +-----------+--+
  • | tab_name |
  • +-----------+--+
  • | mytable |
  • | users |
  • +-----------+--+
  • 2 rows selected (0.08 seconds)

4. Hive中的数据类型

Type Size Literal syntax examples Java
TINYINT 1 byte signed integer. 20 byte
SMALLINT 2 byte signed integer. 20 short
INT 4 byte signed integer. 20 int
BIGINT 8 byte signed integer. 20 long
BOOLEAN Boolean true or false. TRUE boolean
FLOAT Single precision floating point. 3.14159 float
DOUBLE Double precision floating point. 3.14159 double
STRING Sequence of characters. The character set can be specified. Single or double quotes can be used. ‘Now is the time’, “for all good men” String
TIMESTAMP (v0.8.0+) Integer, float, or string. 1327882394 (Unix epoch seconds), 1327882394.123456789 (Unix epoch seconds plus nanoseconds), and ‘2012-02-03 12:34:56.123456789’ (JDBCcompliant java.sql.Timestamp format) 时间戳
BINARY (v0.8.0+) Array of bytes. See discussion below 字节数组
STRUCT Analogous to a Cstruct or an “object.” Fields can be accessed using the “dot” notation. For example, if a column name is of type STRUCT struct(‘John’, ‘Doe’) 结构体
MAP A collection of key-value tuples, where the fields are accessed using array notation (e.g., [‘key’]). For example, if a column name is of type MAP with key→value pairs ‘first’→’John’ and ‘last’→’Doe’, then the last name can be referenced using name[‘last’]. map(‘first’, ‘John’,’last’, ‘Doe’) Map
ARRAY Ordered sequences of the same type that are indexable using zero-based integers. For example, if a column name is of type ARRAY of strings with the value [‘John’, ‘Doe’], then the second element can be referenced using name[1]. array(‘John’, ‘Doe’) 数组

5. 外部表

默认创建的表都是托管表,Hive可以控制托管表的生命周期,我们在使用Hive操作托管表的时候会对相应的数据也进行操作,例如当我们使用Hive删除一张托管表时,Hive也会将表元数据和数据全部删除,这是非常危险的。

但是外部表就不一样,在删除外部表的时候,只会删除其元数据信息,它的的数据是不会被删除。

  1. 创建外部表:
  • hive> create external table hive1.table3 like hive1.table2;
  • OK
  • Time taken: 0.078 seconds

我们可以查看MySQL中存放的表的元数据:

  • mysql> select * from tbls;
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+----------------+--------------------+--------------------+
  • | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+----------------+--------------------+--------------------+
  • | 1 | 1498139809 | 1 | 0 | ubuntu | 0 | 1 | mytable | MANAGED_TABLE | NULL | NULL |
  • | 18 | 1498297815 | 11 | 0 | ubuntu | 0 | 18 | table1 | MANAGED_TABLE | NULL | NULL |
  • | 19 | 1498299532 | 11 | 0 | ubuntu | 0 | 19 | table2 | MANAGED_TABLE | NULL | NULL |
  • | 20 | 1498300345 | 11 | 0 | ubuntu | 0 | 20 | table3 | EXTERNAL_TABLE | NULL | NULL |
  • +--------+-------------+-------+------------------+--------+-----------+-------+----------+----------------+--------------------+--------------------+
  • 4 rows in set (0.00 sec)

其中EXTERNAL_TABLE表示该表的类型是外部表。

注:MANAGED_TABLE管理表即托管表。

我们向该外部表中插入一些数据:

  • hive> insert into hive1.table3(id,name,age) values(0,'Tom',20);
  • ...
  • hive> select * from hive1.table3;
  • OK
  • table3.id table3.name table3.age
  • 0 Tom 20
  • Time taken: 0.104 seconds, Fetched: 1 row(s)

然后查看HDFS系统中的数据可以发现有table3的数据的:

  • hive> dfs -ls -R /;
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:00 /user/ubuntu/data/hive1.db/table3
  • -rwxr-xr-x 3 ubuntu supergroup 9 2017-06-24 04:00 /user/ubuntu/data/hive1.db/table3/000000_0
  • hive> dfs -cat /user/ubuntu/data/hive1.db/table3/000000_0;
  • 0Tom20

我们删除表table3:,再次查看HDFS中的数据,可以发现,由于table3是外部表,表数据是没有被删除的,Hive只删除了它的元数据:

  • hive> drop table hive1.table3;
  • OK
  • Time taken: 0.125 seconds
  • hive> show tables;
  • OK
  • tab_name
  • table1
  • table2
  • values__tmp__table__1
  • values__tmp__table__2
  • Time taken: 0.018 seconds, Fetched: 4 row(s)
  • hive> dfs -ls -R /;
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:00 /user/ubuntu/data/hive1.db/table3
  • -rwxr-xr-x 3 ubuntu supergroup 9 2017-06-24 04:00 /user/ubuntu/data/hive1.db/table3/000000_0
  • hive> dfs -cat /user/ubuntu/data/hive1.db/table3/000000_0;
  • 0Tom20

6. 分区表

Hive中的分区表是指将一个表到基于分区进行分别存储,如日期,城市和部门的值相关方式。使用分区,很容易对数据进行部分查询。分区利用目录提供额外的存储结构,可以使用更高效的查询的数据。例如,一个表包含人员数据,如 id、name、age、province和city,假设需要检索指定province的人员,就需要查询搜索整个表所有员工的详细信息。但是,如果用province分区将对应的数据存储在一个单独的文件,就可以减少查询处理时间。下面的示例演示如何分区的文件和数据。

6.1. 分区表的创建

我们创建一个以province和city分区的数据表:

  • hive> create table hive1.table3(id int, name string, age int) partitioned by (province string, city string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
  • OK
  • Time taken: 0.101 seconds

可以查看该分区表的详细信息:

  • hive> desc formatted hive1.table3;
  • OK
  • col_name data_type comment
  • # col_name data_type comment
  • id int
  • name string
  • age int
  • # Partition Information
  • # col_name data_type comment
  • province string
  • city string
  • # Detailed Table Information
  • Database: hive1
  • Owner: ubuntu
  • CreateTime: Sat Jun 24 04:31:51 PDT 2017
  • LastAccessTime: UNKNOWN
  • Retention: 0
  • Location: hdfs://s100/user/ubuntu/data/hive1.db/table3
  • Table Type: MANAGED_TABLE
  • Table Parameters:
  • transient_lastDdlTime 1498303911
  • # Storage Information
  • SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  • InputFormat: org.apache.hadoop.mapred.TextInputFormat
  • OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  • Compressed: No
  • Num Buckets: -1
  • Bucket Columns: []
  • Sort Columns: []
  • Storage Desc Params:
  • field.delim \t
  • line.delim \n
  • serialization.format \t
  • Time taken: 0.066 seconds, Fetched: 35 row(s)

6.2. 向分区表载入数据

我们在~/software目录下创建persons文件,其中有以下数据:

  • ubuntu@s100:~/software$ cat persons
  • 0 Tom 18
  • 1 Jack 20
  • 2 Jerry 22

然后使用LOAD DATA装载数据并指定分区:

  • hive> load data local inpath 'software/persons' into table hive1.table3 partition(province='CA', city='Sacramento');
  • Loading data to table hive1.table3 partition (province=CA, city=Sacramento)
  • OK
  • Time taken: 0.571 seconds
  • hive> dfs -ls -R /;
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3
  • -rwxr-xr-x 3 ubuntu supergroup 9 2017-06-24 04:00 /user/ubuntu/data/hive1.db/table3/000000_0
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA/city=Sacramento
  • -rwxr-xr-x 3 ubuntu supergroup 30 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA/city=Sacramento/persons

可以发现,装载的数据在HDFS中存储的方式是按分区进行目录划分的。此时我们查看该表的数据:

  • hive> select * from hive1.table3;
  • OK
  • table3.id table3.name table3.age table3.province table3.city
  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • Time taken: 0.139 seconds, Fetched: 3 row(s)

会发现分区信息在表中会以字段的形式展现出来。

6.3. 分区表数据的查询

我们再追加一部分的分区数据:

  • hive> load data local inpath 'software/persons2' into table hive1.table3 partition(province='WA', city='Olympia');
  • Loading data to table hive1.table3 partition (province=WA, city=Olympia)
  • OK
  • Time taken: 0.498 seconds
  • hive> select * from hive1.table3;
  • OK
  • table3.id table3.name table3.age table3.province table3.city
  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia
  • Time taken: 0.11 seconds, Fetched: 6 row(s)

然后可以根据分区来查询相应的数据:

  • hive> select * from hive1.table3 where province='CA';
  • OK
  • table3.id table3.name table3.age table3.province table3.city
  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • Time taken: 0.407 seconds, Fetched: 3 row(s)

虽然分区表查询的时候使用的语法是使用字段过滤的方式,但是实际上它的查询会根据目录进行分区,这样会避免其他无用的查询过滤,大大降低了性能消耗。通过Java API也可以进行查询:

  • @Test
  • public void selectAll() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("select * from hive1.table3");
  • ResultSet resultSet = prepareStatement.executeQuery();
  • while (resultSet.next()) {
  • int id = resultSet.getInt(1);
  • String name = resultSet.getString(2);
  • int age = resultSet.getInt(3);
  • String province = resultSet.getString(4);
  • String city = resultSet.getString(5);
  • System.out.println(id + "\t" + name + "\t" + age + "\t" + province + "\t" + city);
  • }
  • prepareStatement.close();
  • connection.close();
  • }

执行上述代码打印结果如下:

  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia

在Java API查询中添加分区条件:

  • @Test
  • public void selectPartiton() throws Exception {
  • PreparedStatement prepareStatement = connection.prepareStatement("select * from hive1.table3 where province='CA'");
  • ResultSet resultSet = prepareStatement.executeQuery();
  • while (resultSet.next()) {
  • int id = resultSet.getInt(1);
  • String name = resultSet.getString(2);
  • int age = resultSet.getInt(3);
  • String province = resultSet.getString(4);
  • String city = resultSet.getString(5);
  • System.out.println(id + "\t" + name + "\t" + age + "\t" + province + "\t" + city);
  • }
  • prepareStatement.close();
  • connection.close();
  • }

执行上述代码打印结果如下:

  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento

6.4. Strict和Nonstrict模式

当Hive中的数据量非常大时,如果查询不使用分区进行过滤,将会创建一个计算量非常大的MapReduce作业,这是十分消耗性能的,所以我们可以设置Hive的查找模式为Strict模式,这种模式下不使用分区进行过滤查找的操作将不被允许:

  • hive> set hive.mapred.mode=strict;
  • hive> select * from hive1.table3;
  • FAILED: SemanticException Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please make sure that hive.strict.checks.large.query is set to false and that hive.mapred.mode is not set to 'strict' to enable them. No partition predicate for Alias "table3" Table "table3"

可以发现,在strict模式下,不使用分区查询的操作是不允许的。而使用分区查找则是可以的:

  • hive> select * from hive1.table3 where province='WA';
  • OK
  • table3.id table3.name table3.age table3.province table3.city
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia
  • Time taken: 0.119 seconds, Fetched: 3 row(s)

6.5. 查看分区信息

可以使用下列命令查看所有的分区信息:

  • hive> show partitions hive1.table3;
  • OK
  • partition
  • province=CA/city=Sacramento
  • province=WA/city=Olympia
  • Time taken: 0.106 seconds, Fetched: 2 row(s)

还可以指定查看某个特定分区的下级分区:

  • hive> show partitions hive1.table3 partition(province='CA');
  • OK
  • partition
  • province=CA/city=Sacramento
  • Time taken: 0.089 seconds, Fetched: 1 row(s)

6.6. 添加分区

可以使用下列的命令为已有的表添加分区:

  • hive> alter table hive1.table3 add partition(province='VA', city='Richmond');
  • OK
  • Time taken: 0.118 seconds
  • hive> dfs -ls -R /;
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA/city=Sacramento
  • -rwxr-xr-x 3 ubuntu supergroup 30 2017-06-24 04:44 /user/ubuntu/data/hive1.db/table3/province=CA/city=Sacramento/persons
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 05:51 /user/ubuntu/data/hive1.db/table3/province=VA
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 05:51 /user/ubuntu/data/hive1.db/table3/province=VA/city=Richmond
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 05:17 /user/ubuntu/data/hive1.db/table3/province=WA
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 05:17 /user/ubuntu/data/hive1.db/table3/province=WA/city=Olympia
  • -rwxr-xr-x 3 ubuntu supergroup 30 2017-06-24 05:17 /user/ubuntu/data/hive1.db/table3/province=WA/city=Olympia/persons2

6.7. 复制数据到分区表

可以通过以下命令复制某些表的某些数据到当前表的指定分区下:

  • hive> select * from table2;
  • OK
  • table2.id table2.name table2.age
  • 0 Tom 20
  • 1 Tom1 22
  • 1 Tom1 24
  • 3 Tom2 26
  • Time taken: 0.048 seconds, Fetched: 4 row(s)
  • hive> insert into hive1.table3 partition(province='WA',city='Olympia') select * from hive1.table2 where age>22;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170624062259_322f1c13-2642-4566-9ebf-9a0d019c3fce
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498289214510_0007, Tracking URL = http://s100:8088/proxy/application_1498289214510_0007/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498289214510_0007
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
  • 2017-06-24 06:23:06,881 Stage-1 map = 0%, reduce = 0%
  • 2017-06-24 06:23:18,288 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.11 sec
  • MapReduce Total cumulative CPU time: 3 seconds 110 msec
  • Ended Job = job_1498289214510_0007
  • Stage-4 is filtered out by condition resolver.
  • Stage-3 is selected by condition resolver.
  • Stage-5 is filtered out by condition resolver.
  • Launching Job 3 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498289214510_0008, Tracking URL = http://s100:8088/proxy/application_1498289214510_0008/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498289214510_0008
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-24 06:23:28,323 Stage-3 map = 0%, reduce = 0%
  • 2017-06-24 06:23:33,792 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.97 sec
  • MapReduce Total cumulative CPU time: 970 msec
  • Ended Job = job_1498289214510_0008
  • Loading data to table hive1.table3 partition (province=WA, city=Olympia)
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Cumulative CPU: 3.11 sec HDFS Read: 8794 HDFS Write: 206 SUCCESS
  • Stage-Stage-3: Map: 1 Cumulative CPU: 0.97 sec HDFS Read: 2536 HDFS Write: 20 SUCCESS
  • Total MapReduce CPU Time Spent: 4 seconds 80 msec
  • OK
  • table2.id table2.name table2.age
  • Time taken: 35.918 seconds
  • hive> select * from hive1.table3 where province="WA";
  • OK
  • table3.id table3.name table3.age table3.province table3.city
  • 1 Tom1 24 WA Olympia
  • 3 Tom2 26 WA Olympia
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia
  • Time taken: 0.094 seconds, Fetched: 5 row(s)

注:如果复制的时候字段的个数不同的话,则需要使用投影查询。

但上面这种模式我们还需要手动在前半部分指定分区,我们也可以使用动态分区的方式,自动添加相应的分区:

  • hive> insert into hive1.table3 partition(province,city) select id,name,age,'FL' as province,'Tallahassee' as city from hive1.table2;
  • FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

但是这种模式会报错,由于hive.exec.dynamic.partition.mode是严格模式的,我们需要首先将该模式关闭:

  • hive> set hive.exec.dynamic.partition.mode=nonstrict;
  • hive> insert into hive1.table3 partition(province,city) select id,name,age,'FL' as province,'Tallahassee' as city from hive1.table2;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625010042_02365247-cd8d-4c37-abc0-9c83f10f5c52
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0001, Tracking URL = http://s100:8088/proxy/application_1498376867650_0001/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0001
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
  • 2017-06-25 01:00:57,704 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 01:01:06,455 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.19 sec
  • 2017-06-25 01:01:07,551 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.44 sec
  • MapReduce Total cumulative CPU time: 2 seconds 440 msec
  • Ended Job = job_1498376867650_0001
  • Stage-4 is filtered out by condition resolver.
  • Stage-3 is selected by condition resolver.
  • Stage-5 is filtered out by condition resolver.
  • Launching Job 3 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0002, Tracking URL = http://s100:8088/proxy/application_1498376867650_0002/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0002
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:01:19,012 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 01:01:25,386 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.95 sec
  • MapReduce Total cumulative CPU time: 950 msec
  • Ended Job = job_1498376867650_0002
  • Loading data to table hive1.table3 partition (province=null, city=null)
  • Loaded : 1/1 partitions.
  • Time taken to load dynamic partitions: 0.487 seconds
  • Time taken for adding to write entity : 0.002 seconds
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Cumulative CPU: 2.44 sec HDFS Read: 8988 HDFS Write: 233 SUCCESS
  • Stage-Stage-3: Map: 1 Cumulative CPU: 0.95 sec HDFS Read: 2999 HDFS Write: 39 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 390 msec
  • OK
  • Time taken: 44.483 seconds
  • hive> select * from hive1.table3;
  • OK
  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • 0 Tom 20 FL Tallahassee
  • 1 Tom1 22 FL Tallahassee
  • 1 Tom1 24 FL Tallahassee
  • 3 Tom2 26 FL Tallahassee
  • 1 Tom1 24 WA Olympia
  • 3 Tom2 26 WA Olympia
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia
  • Time taken: 0.145 seconds, Fetched: 12 row(s)

可以发现,我们并没有在table3表中创建对应的分区,使用动态分区可以在复制数据的时候创建没有的分区。

注:使用动静混合的方式也可以实现,但只能在已有的父级分区下添加新的子级分区:

  • insert into hive1.table3 partition(province='CA',city) select id,name,age,'CA' as province,'Hollywood' as city from hive1.table2;

我们也可以在创建表的时候,直接从已经存在的表中复制并添加相应的分区信息:

  • hive> create table hive1.table4 as select id,name,age from hive1.table3 where province='CA' and city='Sacramento';
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625013153_0f26269a-2826-481a-b3bd-de57d07dfbd0
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0003, Tracking URL = http://s100:8088/proxy/application_1498376867650_0003/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0003
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:32:03,988 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 01:32:11,504 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec
  • MapReduce Total cumulative CPU time: 1 seconds 310 msec
  • Ended Job = job_1498376867650_0003
  • Stage-4 is selected by condition resolver.
  • Stage-3 is filtered out by condition resolver.
  • Stage-5 is filtered out by condition resolver.
  • Moving data to directory hdfs://s100/user/ubuntu/data/hive1.db/.hive-staging_hive_2017-06-25_01-31-53_381_1554904174075424403-1/-ext-10002
  • Moving data to directory hdfs://s100/user/ubuntu/data/hive1.db/table4
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Cumulative CPU: 1.31 sec HDFS Read: 3838 HDFS Write: 98 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 310 msec
  • OK
  • Time taken: 19.689 seconds
  • hive> select * from hive1.table4;
  • OK
  • 0 Tom 18
  • 1 Jack 20
  • 2 Jerry 22
  • Time taken: 0.064 seconds, Fetched: 3 row(s)

推荐阅读

Java虚拟机06——垃圾收集器之CMS

Java
Java虚拟机

2015-01-29 0 320

Java虚拟机在执行Java程序的过程中会把它所管理的内存划分为若干个不同的数据区域。这些区域都有各自的用途,以及创建和...

目录