大数据
基础组件

Hive(一)

简介:Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

1. Hive

Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

1.1. Hive的安装

我们以2.1.0版本的Hive安装为例,Hive安装一共分为以下几步:

  1. 下载.tar.gz安装包,解压到相应的安装目录:
  • ubuntu@s100:~/software$ tar -zxf apache-hive-2.1.0-bin.tar.gz -C /soft/
  • ubuntu@s100:~/software$ cd /soft/
  • ubuntu@s100:/soft$ ln -s apache-hive-2.1.0-bin hive
  • ubuntu@s100:/soft$ ll
  • total 24
  • drwxr-xr-x 6 ubuntu ubuntu 4096 Jun 21 06:13 ./
  • drwxr-xr-x 24 root root 4096 Jun 20 04:43 ../
  • drwxrwxr-x 9 ubuntu ubuntu 4096 Jun 21 06:13 apache-hive-2.1.0-bin/
  • lrwxrwxrwx 1 ubuntu ubuntu 18 Jun 21 05:50 hadoop -> /soft/hadoop-2.7.2/
  • drwxr-xr-x 10 ubuntu ubuntu 4096 Jun 21 05:59 hadoop-2.7.2/
  • lrwxrwxrwx 1 ubuntu ubuntu 21 Jun 21 06:13 hive -> apache-hive-2.1.0-bin/
  • lrwxrwxrwx 1 ubuntu ubuntu 11 Jun 20 04:44 jdk -> jdk1.8.0_65/
  • drwxr-xr-x 8 ubuntu ubuntu 4096 Oct 6 2015 jdk1.8.0_65/

注:为了方便使用,我们为Hive设置了软连接以供操作。

  1. 配置环境变量,编辑/etc/environment文件,内容如下:
  • JAVA_HOME=/soft/jdk
  • HADOOP_HOME=/soft/hadoop
  • HIVE_HOME=/soft/hive
  • PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/soft/jdk/bin:/soft/hadoop/bin:/soft/hadoop$

配置完环境变量后,可以尝试查看Hive版本:

  • ubuntu@s100:~$ hive --version
  • Hive 2.1.0
  • Subversion git://jcamachguezrMBP/Users/jcamachorodriguez/src/workspaces/hive/HIVE-release2/hive -r 9265bc24d75ac945bde9ce1a0999fddd8f2aae29
  • Compiled by jcamachorodriguez on Fri Jun 17 01:03:25 BST 2016
  • From source with checksum 1f896b8fae57fbd29b047d6d67b75f3c
  1. 修改Hive的配置文件。一共需要修改两个配置文件:

首先是${hive}/conf目录下的hive-env.sh文件,这个文件是有模板的,我们只需要将其复制一份,修改该文件中的Hadoop环境变量,然后添加执行权限即可:

  • ubuntu@s100:/soft/hive$ cd conf/
  • ubuntu@s100:/soft/hive/conf$ cp hive-env.sh.template hive-env.sh
  • ubuntu@s100:/soft/hive/conf$ nano hive-env.sh

hive-env.sh添加Hadoop的环境变量:

  • ...
  • HAOOP_HOME=/soft/hadoop
  • ...

然后需要将hive-default.xml.template复制一份为hive-site.xml

  • ubuntu@s100:/soft/hive/conf$ cp hive-default.xml.template hive-site.xml

复制完后,尝试运行hive命令:

  • ubuntu@s100:/soft/hive/conf$ cd ~
  • ubuntu@s100:~$ hive
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Logging initialized using configuration in jar:file:/soft/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
  • Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ?createDatabaseIfNotExist=true for mysql))
  • at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:578)
  • at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:518)
  • at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
  • at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
  • at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  • at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  • at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  • at java.lang.reflect.Method.invoke(Method.java:497)
  • at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
  • at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
  • ...

可以发现,此时会报错,报错的关键信息为please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema.,即我们在运行之前还需要初始化创建schema。HIve默认情况下是需要将元数据存放在RMDBS中的,并且Hive自己内置了Derby数据库,我们可以直接使用该数据库:

  • ubuntu@s100:~$ schematool -initSchema -dbType derby
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
  • Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
  • Metastore connection User: APP
  • Starting metastore schema initialization to 2.1.0
  • Initialization script hive-schema-2.1.0.derby.sql
  • Initialization script completed
  • schemaTool completed

注:如果初始化失败,报一下错误:

  • ubuntu@s100:~$ schematool -initSchema -dbType derby
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Metastore connection URL: jdbc:derby:;databaseName=metastore_db;create=true
  • Metastore Connection Driver : org.apache.derby.jdbc.EmbeddedDriver
  • Metastore connection User: APP
  • Starting metastore schema initialization to 2.1.0
  • Initialization script hive-schema-2.1.0.derby.sql
  • Error: FUNCTION 'NUCLEUS_ASCII' already exists. (state=X0Y68,code=30000)
  • org.apache.hadoop.hive.metastore.HiveMetaException: Schema initialization FAILED! Metastore state would be inconsistent !!
  • Underlying cause: java.io.IOException : Schema script failed, errorcode 2
  • Use --verbose for detailed stacktrace.
  • *** schemaTool failed ***

可以尝试删除初始化目录下的metastore_db目录。

再次尝试执行hive命令:

  • ubuntu@s100:~$ hive
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Logging initialized using configuration in jar:file:/soft/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
  • Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
  • at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  • at org.apache.hadoop.fs.Path.<init>(Path.java:171)
  • at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:631)
  • at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:550)
  • at org.apache.hadoop.hive.ql.session.SessionState.beginStart(SessionState.java:518)
  • at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705)
  • at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
  • at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  • at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  • at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  • at java.lang.reflect.Method.invoke(Method.java:497)
  • at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
  • at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
  • Caused by: java.net.URISyntaxException: Relative path in absolute URI: ${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D
  • at java.net.URI.checkPath(URI.java:1823)
  • at java.net.URI.<init>(URI.java:745)
  • at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  • ... 12 more

会发现还有错误,错误信息中${system:java.io.tmpdir%7D/$%7Bsystem:user.name%7D告诉我们相关的配置项有问题,这个是由于在hive-site.xml文件中有些配置项还需要修改;将${system:java.io.tmpdir}修改为/home/ubuntu/hive,将${system:user.name}修改为ubuntu

  • ubuntu@s100:~$ grep system:java.io /soft/hive/conf/hive-site.xml
  • <value>${system:java.io.tmpdir}/${system:user.name}</value>
  • <value>${system:java.io.tmpdir}/${hive.session.id}_resources</value>
  • <value>${system:java.io.tmpdir}/${system:user.name}</value>
  • <value>${system:java.io.tmpdir}/${system:user.name}/operation_logs</value>
  • ubuntu@s100:~$ sed -i s#'${system:java.io.tmpdir}'#/home/ubuntu/hive#g /soft/hive/conf/hive-site.xml
  • ubuntu@s100:~$ grep system:java.io /soft/hive/conf/hive-site.xml
  • ubuntu@s100:~$ sed -i s#'${system:user.name}'#ubuntu#g /soft/hive/conf/hive-site.xml
  • ubuntu@s100:~$ grep '/home/ubuntu/hive' /soft/hive/conf/hive-site.xml
  • <value>/home/ubuntu/hive/ubuntu</value>
  • <value>/home/ubuntu/hive/${hive.session.id}_resources</value>
  • <value>/home/ubuntu/hive/ubuntu</value>
  • <value>/home/ubuntu/hive/ubuntu/operation_logs</value>

因此我们还需要创建相关的目录:

  • ubuntu@s100:~$ mkdir -p /home/ubuntu/hive/ubuntu

然后尝试执行hive命令:

  • ubuntu@s100:~$ hive
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Logging initialized using configuration in jar:file:/soft/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
  • Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • hive>

可以发现,此时已经进入了Hive的命令行环境,表示Hive的搭建成功了。

1.2. Hive的CLI使用

注:CLI即Command Line。

Hive的使用和数据库是非常类似的,如下面的的操作:

  1. 查看数据库:
  • hive> show databases;
  • OK
  • default
  • Time taken: 1.013 seconds, Fetched: 1 row(s)
  1. 查看表:
  • hive> show tables;
  • OK
  • Time taken: 0.07 seconds

或者指定查看某个数据库中的表:

  • hive> show tables in hive1;
  • OK
  • tab_name
  • table1
  • table2
  • Time taken: 0.034 seconds, Fetched: 2 row(s)
  1. 创建数据库:
  • hive> create database myhive;
  • OK
  • Time taken: 0.343 seconds
  • hive> show databases;
  • OK
  • default
  • myhive
  • Time taken: 0.018 seconds, Fetched: 2 row(s)

在创建数据库后,我们可以查看HDFS的文件信息:

  • ubuntu@s100:~$ hdfs dfs -ls -R /
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp/hive
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d/_tmp_space.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive/warehouse/myhive.db

可以发现,在Hive中,数据库和表都是以目录的形式存放的。

在创建数据库时,可以指定数据存放的位置,如指定hive1库在/user/ubuntu/data/hive1目录下:

  • hive> dfs -ls -R /;
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 01:59 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-22 06:53 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 01:56 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-22 06:56 /user/hive/warehouse/mytable
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 02:01 /user/ubuntu
  • hive> create database hive1 location "/user/ubuntu//data/hive1";
  • OK
  • Time taken: 0.04 seconds
  • hive> dfs -ls -R /;
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 01:59 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-22 06:53 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 01:56 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-22 06:56 /user/hive/warehouse/mytable
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 02:01 /user/ubuntu
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 02:01 /user/ubuntu/data
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-24 02:01 /user/ubuntu/data/hive1

可以以以下方式查看数据库的信息:

  • hive> desc database hive1;
  • OK
  • db_name comment location owner_name owner_type parameters
  • hive1 hdfs://s100/user/ubuntu/data/hive1 ubuntu USER
  • Time taken: 0.037 seconds, Fetched: 1 row(s)

我们在创建数据库的时候还可以指定一些附加信息,然后在查看的时候添加extended关键字:

  • hive> create database hive2 with dbproperties('author'='LimyronChin','createtime'='20170624');
  • OK
  • Time taken: 0.035 seconds
  • hive> desc database extended hive2;
  • OK
  • db_name comment location owner_name owner_type parameters
  • hive2 hdfs://s100/user/hive/warehouse/hive2.db ubuntu USER {createtime=20170624, author=LimyronChin}
  • Time taken: 0.021 seconds, Fetched: 1 row(s)

修改数据库信息:

  • hive> alter database hive2 set dbproperties('author'='Tom');
  • OK
  • Time taken: 0.036 seconds
  • hive> desc database extended hive2;
  • OK
  • db_name comment location owner_name owner_type parameters
  • hive2 hdfs://s100/user/hive/warehouse/hive2.db ubuntu USER {createtime=20170624, author=Tom}
  • Time taken: 0.014 seconds, Fetched: 1 row(s)
  1. 删除数据库:
  • hive> drop database myhive;
  • OK
  • Time taken: 0.311 seconds
  1. 创建表:
  • hive> create table if not exists employee (
  • > eid int, name String, salary String, destination String)
  • > comment 'Employee details'
  • > row format delimited
  • > fields terminated by '\t'
  • > lines terminated by '\n'
  • > stored as textfile;
  • OK
  • Time taken: 0.454 seconds

创建表时可以添加一些辅助信息:

  • create table hive1.table1(id int, name string, age int) TBLPROPERTIES('author'='Tom', 'create_time'='20170624');

或者指定表的存放位置:

  • hive> create table table1(id int, name string, age int) LOCATION '/user/ubuntu/data/hive1.db/';
  • OK
  • Time taken: 0.067 seconds

也可以从一张已经存在的表复制表,但只会复制表结构:

  • hive> create table hive1.table2 like hive1.table1;
  • OK
  • Time taken: 0.097 seconds
  • hive> select * from table1;
  • OK
  • table1.id table1.name table1.age
  • Time taken: 0.103 seconds
  • hive> select * from table2;
  • OK
  • table2.id table2.name table2.age
  • Time taken: 0.085 seconds

如果想要复制表中的数据,需要使用as select方式:

  • hive> select * from hive1.table1;
  • OK
  • table1.id table1.name table1.age
  • 0 Tom 20
  • Time taken: 0.082 seconds, Fetched: 1 row(s)
  • hive> create table hive1.table2 as select * from hive1.table1;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170624035616_04032d49-77d9-48fd-a6c0-aaf8a21db7e4
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498289214510_0002, Tracking URL = http://s100:8088/proxy/application_1498289214510_0002/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498289214510_0002
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  • 2017-06-24 03:56:26,445 Stage-1 map = 0%, reduce = 0%
  • 2017-06-24 03:56:36,072 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.38 sec
  • MapReduce Total cumulative CPU time: 1 seconds 380 msec
  • Ended Job = job_1498289214510_0002
  • Stage-4 is selected by condition resolver.
  • Stage-3 is filtered out by condition resolver.
  • Stage-5 is filtered out by condition resolver.
  • Moving data to directory hdfs://s100/user/ubuntu/data/hive1.db/.hive-staging_hive_2017-06-24_03-56-16_076_5059966805775399523-1/-ext-10002
  • Moving data to directory hdfs://s100/user/ubuntu/data/hive1.db/table2
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Cumulative CPU: 1.38 sec HDFS Read: 3464 HDFS Write: 77 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 380 msec
  • OK
  • table1.id table1.name table1.age
  • Time taken: 21.337 seconds
  • hive> select * from table2;
  • OK
  • table2.id table2.name table2.age
  • 0 Tom 20
  • Time taken: 0.065 seconds, Fetched: 1 row(s)
  1. 查看表及表结构:
  • hive> show tables;
  • OK
  • employee
  • Time taken: 0.03 seconds, Fetched: 1 row(s)
  • hive> desc employee;
  • OK
  • eid int
  • name string
  • salary string
  • destination string
  • Time taken: 0.282 seconds, Fetched: 4 row(s)

可以以详细的有格式的方式查看表信息:

  • hive> desc formatted hive1.table1;
  • OK
  • # col_name data_type comment
  • id int
  • name string
  • age int
  • # Detailed Table Information
  • Database: hive1
  • Owner: ubuntu
  • CreateTime: Sat Jun 24 02:50:15 PDT 2017
  • LastAccessTime: UNKNOWN
  • Retention: 0
  • Location: hdfs://s100/user/ubuntu/data/hive1.db
  • Table Type: MANAGED_TABLE
  • Table Parameters:
  • transient_lastDdlTime 1498297815
  • # Storage Information
  • SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  • InputFormat: org.apache.hadoop.mapred.TextInputFormat
  • OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  • Compressed: No
  • Num Buckets: -1
  • Bucket Columns: []
  • Sort Columns: []
  • Storage Desc Params:
  • serialization.format 1
  • Time taken: 0.08 seconds, Fetched: 27 row(s)
  1. 修改表

使用下面的语句重命名表:

  • hive> ALter TABLE hive1.table1 RENAME TO hive1.new_table1;
  • OK
  • Time taken: 0.124 seconds
  • hive> show tables;
  • OK
  • tab_name
  • new_table1
  • table2
  • table3
  • values__tmp__table__1
  • values__tmp__table__2
  • Time taken: 0.022 seconds, Fetched: 5 row(s)
  1. 载入数据:

Hive插入数据是使用LOAD DATA命令,可以将本地的数据插入到Hive数据库;需要注意的是,我们在编写数据文件时,需要严格按照表结构进行编写:

  • hive> load data local inpath '~/software/employee' overwrite into table employee;
  • FAILED: SemanticException Line 1:23 Invalid path ''~/software/employee'': No files matching path file:/home/ubuntu/~/software/employee
  • hive> load data local inpath '/home/ubuntu/software/employee' overwrite into table employee;
  • Loading data to table myhive.employee
  • OK
  • Time taken: 1.751 seconds

注:Hive不识别类似于~的路径。

  1. 查询数据:
  • hive> select * from employee;
  • OK
  • 1201 Gopal 45000 Technical manager
  • 1202 Manisha 45000 Proof reader
  • 1203 Masthanvali 40000 Technical writer
  • 1204 Krian 40000 Hr Admin
  • 1205 Kranthi 30000 Op Admin
  • Time taken: 1.372 seconds, Fetched: 5 row(s)

其实Hive插入的数据是直接以文件的形式存放在HDFS上的,我们可以查看HDFS的文件系统:

  • ubuntu@s100:~$ hdfs dfs -ls -R /
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp/hive
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu
  • drwx------ - ubuntu supergroup 0 2017-06-21 08:16 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d/_tmp_space.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:26 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:56 /user/hive/warehouse/myhive.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 08:15 /user/hive/warehouse/myhive.db/employee
  • -rwxr-xr-x 3 ubuntu supergroup 164 2017-06-21 08:15 /user/hive/warehouse/myhive.db/employee/employee

其中/user/hive/warehouse/myhive.db/employee/employee就是我们存放有数据的文件,Hive将其使用HDFS以put操作存放到了HDFS上,直接查看该文件的内容:

  • ubuntu@s100:~$ hdfs dfs -cat /user/hive/warehouse/myhive.db/employee/employee
  • 1201 Gopal 45000 Technical manager
  • 1202 Manisha 45000 Proof reader
  • 1203 Masthanvali 40000 Technical writer
  • 1204 Krian 40000 Hr Admin
  • 1205 Kranthi 30000 Op Adminubuntu@s100:~$

因此,如果我们将一个另外的文件按照数据格式编写,直接存放到相应的目录,也相当于将文件中的数据存放到了Hive中:

有data文件内容如下:

  • 1206 Tom 20000 Manager
  • 1207 Jack 50000 Saler
  • 1208 John 40000 CTO

将其上传到HDFS中Hive的employee表目录下:

  • ubuntu@s100:~$ hdfs dfs -put ~/software/data /user/hive/warehouse/myhive.db/employee/
  • ubuntu@s100:~$ hdfs dfs -ls -R /
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp
  • drwx-wx-wx - ubuntu supergroup 0 2017-06-21 07:05 /tmp/hive
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu
  • drwx------ - ubuntu supergroup 0 2017-06-21 08:16 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d
  • drwx------ - ubuntu supergroup 0 2017-06-21 07:12 /tmp/hive/ubuntu/78a34626-5363-4042-b33e-b36754da900d/_tmp_space.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:26 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:56 /user/hive/warehouse/myhive.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 08:34 /user/hive/warehouse/myhive.db/employee
  • -rw-r--r-- 3 ubuntu supergroup 66 2017-06-21 08:34 /user/hive/warehouse/myhive.db/employee/data
  • -rwxr-xr-x 3 ubuntu supergroup 164 2017-06-21 08:15 /user/hive/warehouse/myhive.db/employee/employee

然后使用Hive查询:

  • hive> select * from employee;
  • OK
  • 1206 Tom 20000 Manager
  • 1207 Jack 50000 Saler
  • 1208 John 40000 CTO
  • 1201 Gopal 45000 Technical manager
  • 1202 Manisha 45000 Proof reader
  • 1203 Masthanvali 40000 Technical writer
  • 1204 Krian 40000 Hr Admin
  • 1205 Kranthi 30000 Op Admin
  • Time taken: 0.144 seconds, Fetched: 8 row(s)

可以发现上传的文件也被当做数据解析了。

还可以使用条件查询:

  • hive> select eid,name from employee where eid>1205;
  • OK
  • 1206 Tom
  • 1207 Jack
  • 1208 John
  • Time taken: 0.544 seconds, Fetched: 3 row(s)
  1. 插入数据:

Hive也支持insert插入数据,但是会以MapReduce作业的模式进行:

  • hive> insert into employee(eid,name,salary) values(1209,'Tomas',55000);
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170621084021_fa7c8a0e-bb48-4d65-a0a1-8e1f4842c7ef
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498049976060_0001, Tracking URL = http://s100:8088/proxy/application_1498049976060_0001/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498049976060_0001
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
  • 2017-06-21 08:40:35,476 Stage-1 map = 0%, reduce = 0%
  • 2017-06-21 08:40:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.61 sec
  • MapReduce Total cumulative CPU time: 1 seconds 610 msec
  • Ended Job = job_1498049976060_0001
  • Stage-4 is selected by condition resolver.
  • Stage-3 is filtered out by condition resolver.
  • Stage-5 is filtered out by condition resolver.
  • Moving data to directory hdfs://s100/user/hive/warehouse/myhive.db/employee/.hive-staging_hive_2017-06-21_08-40-21_585_6763523576159484150-1/-ext-10000
  • Loading data to table myhive.employee
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Cumulative CPU: 1.61 sec HDFS Read: 4507 HDFS Write: 91 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 610 msec
  • OK
  • Time taken: 23.841 seconds
  • hive> select * from employee;
  • OK
  • 1209 Tomas 55000 NULL
  • 1206 Tom 20000 Manager
  • 1207 Jack 50000 Saler
  • 1208 John 40000 CTO
  • 1201 Gopal 45000 Technical manager
  • 1202 Manisha 45000 Proof reader
  • 1203 Masthanvali 40000 Technical writer
  • 1204 Krian 40000 Hr Admin
  • 1205 Kranthi 30000 Op Admin
  • Time taken: 0.086 seconds, Fetched: 9 row(s)

此时查看HDFS系统,可以发现新的数据存放到了一个单独的文件中:

  • ubuntu@s100:~$ hdfs dfs -ls -R /
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:14 /user/hive
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:26 /user/hive/warehouse
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 07:56 /user/hive/warehouse/myhive.db
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-21 08:40 /user/hive/warehouse/myhive.db/employee
  • -rwxr-xr-x 3 ubuntu supergroup 20 2017-06-21 08:40 /user/hive/warehouse/myhive.db/employee/000000_0
  • -rw-r--r-- 3 ubuntu supergroup 66 2017-06-21 08:34 /user/hive/warehouse/myhive.db/employee/data
  • -rwxr-xr-x 3 ubuntu supergroup 164 2017-06-21 08:15 /user/hive/warehouse/myhive.db/employee/employee
  • ubuntu@s100:~$ hdfs dfs -cat /user/hive/warehouse/myhive.db/employee/000000_0
  • 1209 Tomas 55000 \N

我们还可以在查询的时候进行更多的操作,这也会出发Hive以MapReduce方式查询:

  • hive> select * from employee orderby eid limit 3,4;
  • FAILED: ParseException line 1:31 missing EOF at 'eid' near 'orderby'
  • hive> select * from employee order by eid limit 3,4;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170621084248_a619e1e6-3eca-4e43-8a1d-50ebf2c32fc7
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498049976060_0002, Tracking URL = http://s100:8088/proxy/application_1498049976060_0002/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498049976060_0002
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-21 08:42:59,324 Stage-1 map = 0%, reduce = 0%
  • 2017-06-21 08:43:09,765 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 0.89 sec
  • 2017-06-21 08:43:10,843 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.71 sec
  • 2017-06-21 08:43:18,311 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.96 sec
  • MapReduce Total cumulative CPU time: 2 seconds 960 msec
  • Ended Job = job_1498049976060_0002
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 2.96 sec HDFS Read: 12106 HDFS Write: 234 SUCCESS
  • Total MapReduce CPU Time Spent: 2 seconds 960 msec
  • OK
  • 1204 Krian 40000 Hr Admin
  • 1205 Kranthi 30000 Op Admin
  • 1206 Tom 20000 Manager
  • 1207 Jack 50000 Saler
  • Time taken: 30.518 seconds, Fetched: 4 row(s)
  1. 在Hive中,也可以执行普通的Shell命令及HDFS命令:

在普通Shell 命令前添加感叹号即可执行:

  • hive> !clear;

使用dfs可以执行HDFS命令:

  • hive> dfs -ls -R /;
  1. -e参数可以在Shell命令行模式下执行Hive的数据库操作语句:
  • ubuntu@s100:~$ hive -e "show databases;"
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • Logging initialized using configuration in jar:file:/soft/apache-hive-2.1.0-bin/lib/hive-common-2.1.0.jar!/hive-log4j2.properties Async: true
  • OK
  • default
  • hive1
  • Time taken: 1.315 seconds, Fetched: 2 row(s)
  1. -S参数指定hive命令行以静默方式执行命令,省掉无用的打印;

  2. -f参数可以直接在Shell命令行模式下批量执行以文件方式存储的数据库操作语句。

  • ubuntu@s100:~$ hive -f "文件路径"
  1. 连续在hive命令行模式下按两次tab键可以显示所有可执行的hive数据库操作语句:
  • hive> Display all 528 possibilities? (y or n)
  • ! != $ELEM$ $KEY$
  • $VALUE$ $elem$ $key$ $sum0
  • $value$ % & (
  • ) ); * +
  • , - . /
  • : < <= <=>
  • <> = == >
  • >= ABORT ADD ALL
  • ALTER AND ARRAY AS
  • ASC BIGINT BINARY BOOLEAN
  • BUCKET BUCKETS BY CAST
  • CLUSTER CLUSTERED COLLECTION COLUMNS
  • COMMENT CONSTRAINT CREATE DATA
  • DATE DATETIME DEFINED DELIMITED
  • DESC DESCRIBE DIRECTORY DISABLE
  • DISTINCT DISTRIBUTE DOUBLE DROP
  • ENABLE EXPLAIN EXTENDED EXTERNAL
  • FALSE FIELDS FLOAT FOREIGN
  • FORMAT FROM FULL FUNCTION
  • GROUP INPATH INPUTFORMAT INSERT
  • INT INTO IS ITEMS
  • JOIN KEY KEYS LAST
  • LEFT LIKE LIMIT LINES
  • LOAD LOCAL LOCATION MAP
  • MSCK NONE NORELY NOT
  • NOVALIDATE NULL NULLS OF
  • OFFSET ON OR ORDER
  • OUT OUTER OUTPUTFORMAT OVERWRITE
  • PARTITION PARTITIONED PARTITIONS PRIMARY
  • PURGE REDUCE REFERENCES REGEXP
  • RELY RENAME REPLACE REWRITE
  • RIGHT RLIKE ROW SELECT
  • SEQUENCEFILE SERDE SERDEPROPERTIES SET
  • SHOW SMALLINT SORT SORTED
  • STORED STRING SUBQUERY TABLE
  • TABLES TABLESAMPLE TBLPROPERTIES TEMPORARY
  • TERMINATED TEXTFILE TIMESTAMP TINYINT
  • TO TRANSACTIONS TRANSFORM TRUE
  • UNION UPDATE USING VALIDATE
  • VALUES WHERE WITH [
  • \' ] ^ abort
  • abs( acos( add add_months(
  • aes_decrypt( aes_encrypt( all alter
  • and and( array array(
  • array_contains( as asc ascii(
  • asin( assert_true( atan( avg(
  • base64 between( bigint bigint(
  • bin( binary binary( boolean
  • boolean( bround( bucket buckets
  • by case( cast cbrt(
  • ceil( ceiling( char( chr(
  • cluster clustered coalesce( collect_list(
  • collect_set( collection columns comment
  • compute_stats( concat( concat_ws( constraint
  • context_ngrams( conv( corr( cos(
  • count( covar_pop( covar_samp( crc32
  • create create_union( cume_dist( current_database(
  • current_date( current_timestamp( current_user( data
  • date date( date_add( date_format(
  • date_sub( datediff( datetime day(
  • dayofmonth( decimal( decode( defined
  • degrees( delimited dense_rank( desc
  • describe directory disable distinct
  • distribute div( double double(
  • drop e( elt( enable
  • encode( ewah_bitmap( ewah_bitmap_and( ewah_bitmap_empty(
  • ewah_bitmap_or( exp( explain explode(
  • extended external factorial( false
  • field( fields find_in_set( first_value(
  • float float( floor( foreign
  • format format_number( from from_unixtime(
  • from_utc_timestamp( full function get_json_object(
  • get_splits( greatest( group hash(
  • hex( histogram_numeric( hour( if(
  • in( in_file( index( initcap(
  • inline( inpath inputformat insert
  • instr( int int( interval_day_time(
  • interval_year_month( into is isnotnull(
  • isnull( items java_method( join
  • json_tuple( key keys lag(
  • last last_day( last_value( lcase(
  • lead( least( left length(
  • levenshtein( like like( limit
  • lines ln( load local
  • locate( location log( log10
  • log2 lower( lpad( ltrim(
  • map map( map_keys( map_values(
  • mask( mask_first_n( mask_hash( mask_last_n(
  • mask_show_first_n( mask_show_last_n( matchpath( max(
  • md5 min( minute( month(
  • months_between( msck named_struct( negative(
  • next_day( ngrams( none noop(
  • noopstreaming( noopwithmap( noopwithmapstreaming( norely
  • not not( novalidate ntile(
  • null nulls nvl( of
  • offset on or or(
  • order out outer outputformat
  • overwrite parse_url( parse_url_tuple( partition
  • partitioned partitions percent_rank( percentile(
  • percentile_approx( pi( pmod( posexplode(
  • positive( pow( power( primary
  • printf( purge quarter( radians(
  • rand( rank( reduce references
  • reflect( reflect2 regexp regexp(
  • regexp_extract( regexp_replace( rely rename
  • repeat( replace replace( reverse(
  • rewrite right rlike rlike(
  • round( row row_number( rpad(
  • rtrim( second( select sentences(
  • sequencefile serde serdeproperties set
  • sha( sha1 sha2 shiftleft(
  • shiftright( shiftrightunsigned( show sign(
  • sin( size( smallint smallint(
  • sort sort_array( sorted soundex(
  • space( split( sqrt( stack(
  • std( stddev( stddev_pop( stddev_samp(
  • stored str_to_map( string string(
  • struct( subquery substr( substring(
  • substring_index( sum( table tables
  • tablesample tan( tblproperties temporary
  • terminated textfile timestamp timestamp(
  • tinyint tinyint( to to_date(
  • to_unix_timestamp( to_utc_timestamp( transactions transform
  • translate( trim( true trunc(
  • ucase( unbase64 unhex( union
  • unix_timestamp( update upper( using
  • validate values var_pop( var_samp(
  • varchar( variance( version( weekofyear(
  • when( where windowingtablefunction( with
  • xpath( xpath_boolean( xpath_double( xpath_float(
  • xpath_int( xpath_long( xpath_number( xpath_short(
  • xpath_string( year( | ~
  1. Hive中的注释与MySQL一样:
  • hive> -- this is a comments !
  1. Hive查询表时显示表头信息;默认情况下使用Hive的CLI查询的结果显示并没有表头信息,我们可以通过以下的方式设置显示表头信息:
  • hive> select * from mytable;
  • OK
  • 1 Tom 18
  • Time taken: 0.973 seconds, Fetched: 1 row(s)
  • hive> set hive.cli.print.header=true;
  • hive> select * from mytable;
  • OK
  • mytable.id mytable.name mytable.age
  • 1 Tom 18
  • Time taken: 0.11 seconds, Fetched: 1 row(s)
  1. 级联删除;当我们删除一个库时,如果该库中还存在表,就会删除失败,此时可以使用Cascade级联删除:
  • hive> drop database hive1;
  • FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. InvalidOperationException(message:Database hive1 is not empty. One or more tables exist.)
  • hive> drop database if exists hive1 cascade;
  • OK
  • Time taken: 2.525 seconds