大数据
基础组件
Hive

Hive 03 - 查询、视图、索引和分桶

简介:Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行。 其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计,不必开发专门的MapReduce应用,十分适合数据仓库的统计分析。

1. 导出表

我们可以通过insert overwrite local directory命令将Hive中的表数据导出到本地文件:

  • hive> insert overwrite local directory '/home/ubuntu/hive1.table3.db' select * from hive1.table3 where id>1;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625013933_cc866566-60ce-46b7-8140-f70515381b35
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0004, Tracking URL = http://s100:8088/proxy/application_1498376867650_0004/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0004
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
  • 2017-06-25 01:39:42,544 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 01:39:57,325 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.21 sec
  • MapReduce Total cumulative CPU time: 4 seconds 210 msec
  • Ended Job = job_1498376867650_0004
  • Moving data to local directory /home/ubuntu/hive1.table3.db
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Cumulative CPU: 4.21 sec HDFS Read: 11169 HDFS Write: 134 SUCCESS
  • Total MapReduce CPU Time Spent: 4 seconds 210 msec
  • OK
  • Time taken: 26.24 seconds
  • hive> !ls -al /home/ubuntu/hive1.table3.db/;
  • total 24
  • drwxrwxr-x 2 ubuntu ubuntu 4096 Jun 25 01:39 .
  • drwxr-xr-x 10 ubuntu ubuntu 4096 Jun 25 01:39 ..
  • -rw-r--r-- 1 ubuntu ubuntu 109 Jun 25 01:39 000000_0
  • -rw-r--r-- 1 ubuntu ubuntu 12 Jun 25 01:39 .000000_0.crc
  • -rw-r--r-- 1 ubuntu ubuntu 25 Jun 25 01:39 000001_0
  • -rw-r--r-- 1 ubuntu ubuntu 12 Jun 25 01:39 .000001_0.crc
  • hive> !cat /home/ubuntu/hive1.table3.db/000000_0;
  • 3Tom226FLTallahassee
  • 3Tom226WAOlympia
  • 3Polo21WAOlympia
  • 4Mick29WAOlympia
  • 5John32WAOlympia
  • hive> !cat /home/ubuntu/hive1.table3.db/000001_0;
  • 2Jerry22CASacramento

同时我们还可以将表数据导出到HDFS上:

  • hive> insert overwrite directory 'hdfs://s100:8020/user/ubuntu/hive1.table3.db' select * from hive1.table3 where id>2;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625014320_108c2084-f716-4cf3-8450-e9a63dd4bff0
  • Total jobs = 3
  • Launching Job 1 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0005, Tracking URL = http://s100:8088/proxy/application_1498376867650_0005/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0005
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
  • 2017-06-25 01:43:28,059 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 01:43:40,537 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.93 sec
  • MapReduce Total cumulative CPU time: 3 seconds 930 msec
  • Ended Job = job_1498376867650_0005
  • Stage-3 is filtered out by condition resolver.
  • Stage-2 is selected by condition resolver.
  • Stage-4 is filtered out by condition resolver.
  • Launching Job 3 out of 3
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0006, Tracking URL = http://s100:8088/proxy/application_1498376867650_0006/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0006
  • Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:43:50,849 Stage-2 map = 0%, reduce = 0%
  • 2017-06-25 01:43:57,098 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.86 sec
  • MapReduce Total cumulative CPU time: 860 msec
  • Ended Job = job_1498376867650_0006
  • Moving data to directory hdfs://s100:8020/user/ubuntu/hive1.table3.db
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Cumulative CPU: 3.93 sec HDFS Read: 11019 HDFS Write: 109 SUCCESS
  • Stage-Stage-2: Map: 1 Cumulative CPU: 0.86 sec HDFS Read: 1953 HDFS Write: 109 SUCCESS
  • Total MapReduce CPU Time Spent: 4 seconds 790 msec
  • OK
  • Time taken: 38.14 seconds
  • hive> dfs -ls -R /user/ubuntu/hive1.table3.db;
  • -rwxr-xr-x 3 ubuntu supergroup 109 2017-06-25 01:43 /user/ubuntu/hive1.table3.db/000000_0
  • hive> dfs -cat /user/ubuntu/hive1.table3.db/000000_0;
  • 3Tom226FLTallahassee
  • 3Tom226WAOlympia
  • 3Polo21WAOlympia
  • 4Mick29WAOlympia
  • 5John32WAOlympia

在导出过程中,还可以将数据分批导出到不同的目录:

  • hive> from hive1.table3 tb3
  • > insert overwrite directory 'hdfs://s100:8020/user/ubuntu/hive1.table3.db.CA'
  • > select * where tb3.province='CA'
  • > insert overwrite directory 'hdfs://s100:8020/user/ubuntu/hive1.table3.db.FL'
  • > select * where tb3.province='FL'
  • > insert overwrite directory 'hdfs://s100:8020/user/ubuntu/hive1.table3.db.WA'
  • > select * where tb3.province='WA';
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625015436_25429f8b-8834-4231-b24c-09b82b5335f1
  • Total jobs = 7
  • Launching Job 1 out of 7
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0007, Tracking URL = http://s100:8088/proxy/application_1498376867650_0007/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0007
  • Hadoop job information for Stage-3: number of mappers: 2; number of reducers: 0
  • 2017-06-25 01:54:43,426 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 01:54:50,756 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 3.23 sec
  • MapReduce Total cumulative CPU time: 3 seconds 230 msec
  • Ended Job = job_1498376867650_0007
  • Stage-5 is filtered out by condition resolver.
  • Stage-4 is selected by condition resolver.
  • Stage-6 is filtered out by condition resolver.
  • Stage-10 is filtered out by condition resolver.
  • Stage-9 is selected by condition resolver.
  • Stage-11 is filtered out by condition resolver.
  • Stage-15 is filtered out by condition resolver.
  • Stage-14 is selected by condition resolver.
  • Stage-16 is filtered out by condition resolver.
  • Launching Job 5 out of 7
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0008, Tracking URL = http://s100:8088/proxy/application_1498376867650_0008/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0008
  • Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:54:59,567 Stage-4 map = 0%, reduce = 0%
  • 2017-06-25 01:55:05,845 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 1.05 sec
  • MapReduce Total cumulative CPU time: 1 seconds 50 msec
  • Ended Job = job_1498376867650_0008
  • Launching Job 6 out of 7
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0009, Tracking URL = http://s100:8088/proxy/application_1498376867650_0009/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0009
  • Hadoop job information for Stage-9: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:55:15,003 Stage-9 map = 0%, reduce = 0%
  • 2017-06-25 01:55:22,322 Stage-9 map = 100%, reduce = 0%, Cumulative CPU 1.08 sec
  • MapReduce Total cumulative CPU time: 1 seconds 80 msec
  • Ended Job = job_1498376867650_0009
  • Launching Job 7 out of 7
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0010, Tracking URL = http://s100:8088/proxy/application_1498376867650_0010/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0010
  • Hadoop job information for Stage-14: number of mappers: 1; number of reducers: 0
  • 2017-06-25 01:55:31,361 Stage-14 map = 0%, reduce = 0%
  • 2017-06-25 01:55:37,704 Stage-14 map = 100%, reduce = 0%
  • Ended Job = job_1498376867650_0010
  • Moving data to directory hdfs://s100:8020/user/ubuntu/hive1.table3.db.CA
  • Moving data to directory hdfs://s100:8020/user/ubuntu/hive1.table3.db.FL
  • Moving data to directory hdfs://s100:8020/user/ubuntu/hive1.table3.db.WA
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 2 Cumulative CPU: 3.23 sec HDFS Read: 14623 HDFS Write: 276 SUCCESS
  • Stage-Stage-4: Map: 1 Cumulative CPU: 1.05 sec HDFS Read: 1927 HDFS Write: 72 SUCCESS
  • Stage-Stage-9: Map: 1 Cumulative CPU: 1.08 sec HDFS Read: 2013 HDFS Write: 99 SUCCESS
  • Stage-Stage-14: Map: 1 Cumulative CPU: 1.25 sec HDFS Read: 2019 HDFS Write: 105 SUCCESS
  • Total MapReduce CPU Time Spent: 6 seconds 610 msec
  • OK
  • Time taken: 63.67 seconds

在上面的操作中,分别根据province分区的不同,将数据导出到三个不同的目录;查看导出的数据:

  • hive> dfs -ls -R /user/ubuntu/;
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.CA
  • -rwxr-xr-x 3 ubuntu supergroup 72 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.CA/000000_0
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.FL
  • -rwxr-xr-x 3 ubuntu supergroup 99 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.FL/000000_0
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.WA
  • -rwxr-xr-x 3 ubuntu supergroup 105 2017-06-25 01:55 /user/ubuntu/hive1.table3.db.WA/000000_0
  • hive> dfs -cat /user/ubuntu/hive1.table3.db.CA/000000_0;
  • 0Tom18CASacramento
  • 1Jack20CASacramento
  • 2Jerry22CASacramento
  • hive> dfs -cat /user/ubuntu/hive1.table3.db.FL/000000_0;
  • 0Tom20FLTallahassee
  • 1Tom122FLTallahassee
  • 1Tom124FLTallahassee
  • 3Tom226FLTallahassee
  • hive> dfs -cat /user/ubuntu/hive1.table3.db.WA/000000_0;
  • 1Tom124WAOlympia
  • 3Tom226WAOlympia
  • 3Polo21WAOlympia
  • 4Mick29WAOlympia
  • 5John32WAOlympia

2. 查询

  1. 使用函数对查询的内容进行处理;
  • 将查询的某些字段转为大小写:
  • hive> select id,upper(name) from hive1.table3;
  • OK
  • 0 TOM
  • 1 JACK
  • 2 JERRY
  • 0 TOM
  • 1 TOM1
  • 1 TOM1
  • 3 TOM2
  • 1 TOM1
  • 3 TOM2
  • 3 POLO
  • 4 MICK
  • 5 JOHN
  • Time taken: 0.14 seconds, Fetched: 12 row(s)
  • hive> select id,name,lower(province) from hive1.table3;
  • OK
  • 0 Tom ca
  • 1 Jack ca
  • 2 Jerry ca
  • 0 Tom fl
  • 1 Tom1 fl
  • 1 Tom1 fl
  • 3 Tom2 fl
  • 1 Tom1 wa
  • 3 Tom2 wa
  • 3 Polo wa
  • 4 Mick wa
  • 5 John wa
  • Time taken: 0.107 seconds, Fetched: 12 row(s)

upperlower是两个函数。

  • 对查询的某些字段进行算术处理:

简单运算:

  • hive> select id,age+10 from hive1.table3;
  • OK
  • 0 28
  • 1 30
  • 2 32
  • 0 30
  • 1 32
  • 1 34
  • 3 36
  • 1 34
  • 3 36
  • 3 31
  • 4 39
  • 5 42
  • Time taken: 0.157 seconds, Fetched: 12 row(s)

四舍五入运算:

  • hive> select round(12.3);
  • OK
  • 12.0
  • Time taken: 0.075 seconds, Fetched: 1 row(s)
  • hive> select round(12.5);
  • OK
  • 13.0
  • Time taken: 0.042 seconds, Fetched: 1 row(s)

向上取整和向下取整:

  • hive> select floor(12.3),ceil(12.3);
  • OK
  • 12 13
  • Time taken: 0.055 seconds, Fetched: 1 row(s)
  • 聚合函数:

统计个数:

  • hive> select count(*) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625022450_2cb74c55-118f-4bb6-b290-4be8ae300c76
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0011, Tracking URL = http://s100:8088/proxy/application_1498376867650_0011/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0011
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:24:59,383 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:25:09,753 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
  • 2017-06-25 02:25:18,264 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.71 sec
  • MapReduce Total cumulative CPU time: 3 seconds 710 msec
  • Ended Job = job_1498376867650_0011
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.71 sec HDFS Read: 14446 HDFS Write: 102 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 710 msec
  • OK
  • 12
  • Time taken: 29.06 seconds, Fetched: 1 row(s)

平均值:

  • hive> select avg(age) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625022638_e344657b-6e7b-4e02-943f-0396bbb83aee
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0013, Tracking URL = http://s100:8088/proxy/application_1498376867650_0013/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0013
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:26:47,985 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:26:58,630 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.04 sec
  • 2017-06-25 02:27:05,894 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.52 sec
  • MapReduce Total cumulative CPU time: 3 seconds 520 msec
  • Ended Job = job_1498376867650_0013
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.52 sec HDFS Read: 15613 HDFS Write: 118 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 520 msec
  • OK
  • 23.666666666666668
  • Time taken: 28.74 seconds, Fetched: 1 row(s)

取和:

  • hive> select sum(age) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625022545_d011df1b-000f-41d4-a756-665b39d30d83
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0012, Tracking URL = http://s100:8088/proxy/application_1498376867650_0012/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0012
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:25:56,282 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:26:06,707 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.12 sec
  • 2017-06-25 02:26:13,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.32 sec
  • MapReduce Total cumulative CPU time: 3 seconds 320 msec
  • Ended Job = job_1498376867650_0012
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.32 sec HDFS Read: 14693 HDFS Write: 103 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 320 msec
  • OK
  • 284
  • Time taken: 29.803 seconds, Fetched: 1 row(s)

最大值:

  • hive> select max(age) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625022734_2c306c02-bf56-42a4-94da-4b871f75afab
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0014, Tracking URL = http://s100:8088/proxy/application_1498376867650_0014/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0014
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:27:40,995 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:27:47,232 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.84 sec
  • 2017-06-25 02:27:54,585 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.1 sec
  • MapReduce Total cumulative CPU time: 3 seconds 100 msec
  • Ended Job = job_1498376867650_0014
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.1 sec HDFS Read: 14697 HDFS Write: 102 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 100 msec
  • OK
  • 32
  • Time taken: 21.595 seconds, Fetched: 1 row(s)

最小值:

  • hive> select min(age) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625022811_a5986421-bbaf-49b6-8dd9-abcf5940cbc2
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0015, Tracking URL = http://s100:8088/proxy/application_1498376867650_0015/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0015
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:28:21,947 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:28:33,447 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.12 sec
  • 2017-06-25 02:28:41,919 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.35 sec
  • MapReduce Total cumulative CPU time: 4 seconds 350 msec
  • Ended Job = job_1498376867650_0015
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 4.35 sec HDFS Read: 14703 HDFS Write: 102 SUCCESS
  • Total MapReduce CPU Time Spent: 4 seconds 350 msec
  • OK
  • 18
  • Time taken: 32.129 seconds, Fetched: 1 row(s)
  • 字段去重处理:

可以通过distinct函数将要查询的字段中的重复值去掉:

  • hive> select distinct(name) from hive1.table3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625023048_e9be82a0-8220-4aaf-8079-83e33ad1e025
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0016, Tracking URL = http://s100:8088/proxy/application_1498376867650_0016/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0016
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 02:30:57,799 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 02:31:07,364 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
  • 2017-06-25 02:31:14,667 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.12 sec
  • MapReduce Total cumulative CPU time: 3 seconds 120 msec
  • Ended Job = job_1498376867650_0016
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.12 sec HDFS Read: 14135 HDFS Write: 223 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 120 msec
  • OK
  • Jack
  • Jerry
  • John
  • Mick
  • Polo
  • Tom
  • Tom1
  • Tom2
  • Time taken: 27.034 seconds, Fetched: 8 row(s)
  • explode函数:
  • hive> select explode(array('Tom','Jack','Rose'));
  • OK
  • Tom
  • Jack
  • Rose
  • Time taken: 0.089 seconds, Fetched: 3 row(s)
  • base64编码:
  • hive> select base64(binary('http://s100:8020/'));
  • OK
  • aHR0cDovL3MxMDA6ODAyMC8=
  • Time taken: 0.063 seconds, Fetched: 1 row(s)
  • 转换格式:
  • hive> select cast("120" as bigint);
  • OK
  • 120
  • Time taken: 0.067 seconds, Fetched: 1 row(s)
  • 单纯联结:
  • hive> select concat(120,'200');
  • OK
  • 120200
  • Time taken: 0.045 seconds, Fetched: 1 row(s)
  • 嵌套查询:

可以将已经查询到的数据作为数据来源进行二次嵌套查询:

  • hive> select * from hive1.table3;
  • OK
  • 0 Tom 18 CA Sacramento
  • 1 Jack 20 CA Sacramento
  • 2 Jerry 22 CA Sacramento
  • 0 Tom 20 FL Tallahassee
  • 1 Tom1 22 FL Tallahassee
  • 1 Tom1 24 FL Tallahassee
  • 3 Tom2 26 FL Tallahassee
  • 1 Tom1 24 WA Olympia
  • 3 Tom2 26 WA Olympia
  • 3 Polo 21 WA Olympia
  • 4 Mick 29 WA Olympia
  • 5 John 32 WA Olympia
  • Time taken: 0.099 seconds, Fetched: 12 row(s)
  • hive> from (select * from hive1.table3 where province='FL') e select e.id,e.name,e.age where e.age>22;
  • OK
  • 1 Tom1 24
  • 3 Tom2 26
  • Time taken: 0.122 seconds, Fetched: 2 row(s)

这种查询相当于:

  • hive> select e.id,e.name,e.age from (select * from hive1.table3 where province='FL') e where e.age>22;
  • OK
  • 1 Tom1 24
  • 3 Tom2 26
  • Time taken: 0.117 seconds, Fetched: 2 row(s)
  • CASE...WHEN...THEN语法:
  • hive> select id,name,age,province,city,
  • > case
  • > when age<=20 then 'young'
  • > when age>20 and age<=30 then 'middle'
  • > when age>30 then 'old'
  • > end as bracket from hive1.table3;
  • OK
  • 0 Tom 18 CA Sacramento young
  • 1 Jack 20 CA Sacramento young
  • 2 Jerry 22 CA Sacramento middle
  • 0 Tom 20 FL Tallahassee young
  • 1 Tom1 22 FL Tallahassee middle
  • 1 Tom1 24 FL Tallahassee middle
  • 3 Tom2 26 FL Tallahassee middle
  • 1 Tom1 24 WA Olympia middle
  • 3 Tom2 26 WA Olympia middle
  • 3 Polo 21 WA Olympia middle
  • 4 Mick 29 WA Olympia middle
  • 5 John 32 WA Olympia old
  • Time taken: 0.109 seconds, Fetched: 12 row(s)

注:不可以在where子句中使用字段的别名:

  • hive> select id i,name from hive1.table3 where i>1;
  • FAILED: SemanticException [Error 10004]: Line 1:41 Invalid table alias or column reference 'i': (possible column names are: id, name, age, province, city)

注:在以下几种情况下可以避免Hive的查询以MapReduce方式进行:
1. 全表扫描;select * from table;

  1. where子句只有分区字段的判断;

  2. 设置了hive.exec.mode.local.auto=true的情况下Hive会尽量使用本地模式查询;

其余所有查询都会转换为MapReduce进行查询。

  • Group By聚合

查询每个province的人员总数进行分组:

  • hive> select province,count(*) from hive1.table3 group by province;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625035032_e9263334-f66e-4684-bd7e-0cad7cc72dbe
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0017, Tracking URL = http://s100:8088/proxy/application_1498376867650_0017/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0017
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 03:50:40,025 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 03:50:45,387 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.0 sec
  • 2017-06-25 03:50:46,473 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.97 sec
  • 2017-06-25 03:50:52,742 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.14 sec
  • MapReduce Total cumulative CPU time: 3 seconds 140 msec
  • Ended Job = job_1498376867650_0017
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.14 sec HDFS Read: 15081 HDFS Write: 138 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 140 msec
  • OK
  • CA 3
  • FL 4
  • WA 5
  • Time taken: 21.704 seconds, Fetched: 3 row(s)
  • Having子句

对上面GroupBy查询的结果进行过滤,只查出总数大于3的结果:

  • hive> select province,count(*) as e from hive1.table3 group by province having e>3;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625035605_53f1381c-0af4-450e-a974-666012e1b85d
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0019, Tracking URL = http://s100:8088/proxy/application_1498376867650_0019/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0019
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 03:56:13,070 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 03:56:23,528 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.25 sec
  • 2017-06-25 03:56:31,901 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.69 sec
  • MapReduce Total cumulative CPU time: 4 seconds 690 msec
  • Ended Job = job_1498376867650_0019
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 4.69 sec HDFS Read: 15420 HDFS Write: 121 SUCCESS
  • Total MapReduce CPU Time Spent: 4 seconds 690 msec
  • OK
  • FL 4
  • WA 5
  • Time taken: 27.96 seconds, Fetched: 2 row(s)

3. 连接查询

接下来开始演示Hive中的连接查询;在演示之前我们需要准备两张表和相应的测试数据,两张表的创建如下:

  • hive> create table hive1.customers(id int,name string,age int)
  • > row format delimited
  • > fields terminated by '\t'
  • > lines terminated by '\n'
  • > stored as textfile;
  • OK
  • Time taken: 0.095 seconds
  • hive> create table hive1.orders(id int,orderno string,price float,cid int)
  • > row format delimited
  • > fields terminated by '\t'
  • > lines terminated by '\n'
  • > stored as textfile;
  • hive> show tables;
  • OK
  • customers
  • orders
  • table2
  • table3
  • table4
  • Time taken: 0.02 seconds, Fetched: 6 row(s)

然后编写测试数据并向两种表中载入:

  • hive> load data local inpath '/home/ubuntu/customers.data' into table hive1.customers;
  • Loading data to table hive1.customers
  • OK
  • Time taken: 0.405 seconds
  • hive> load data local inpath '/home/ubuntu/orders.data' into table hive1.orders;
  • Loading data to table hive1.orders
  • OK
  • Time taken: 0.217 seconds
  • hive> select * from hive1.customers;
  • OK
  • 0 Tom 20
  • 1 Jack 19
  • 2 Rose 24
  • Time taken: 0.053 seconds, Fetched: 3 row(s)
  • hive> select * from hive1.orders;
  • OK
  • 0 No000 12.99 0
  • 1 No001 22.1 2
  • 2 No002 13.7 0
  • 3 No003 23.0 1
  • 4 No004 65.99 2
  • 5 No005 78.2 3
  • 6 No006 4.3 1
  • 7 No007 1.99 2
  • 8 No008 29.99 2
  • 9 No009 100.99 NULL
  • Time taken: 0.05 seconds, Fetched: 10 row(s)

3.1. 内连接INNER JOIN…ON

  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price from customers c join orders o on c.id=o.cid;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625043551_e4fff907-c396-465a-a0df-40f88ae7cace
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 04:36:01 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 04:36:03 Dump the side-table for tag: 0 with group count: 3 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-35-51_006_569512484172616105-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
  • 2017-06-25 04:36:03 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-35-51_006_569512484172616105-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (334 bytes)
  • 2017-06-25 04:36:03 End of local task; Time Taken: 2.005 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0020, Tracking URL = http://s100:8088/proxy/application_1498376867650_0020/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0020
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 04:36:14,225 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 04:36:21,757 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
  • MapReduce Total cumulative CPU time: 1 seconds 870 msec
  • Ended Job = job_1498376867650_0020
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 7325 HDFS Write: 367 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 870 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • 0 Tom 20 0 No000 12.99
  • 2 Rose 24 1 No001 22.1
  • 0 Tom 20 2 No002 13.7
  • 1 Jack 19 3 No003 23.0
  • 2 Rose 24 4 No004 65.99
  • 1 Jack 19 6 No006 4.3
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 8 No008 29.99
  • Time taken: 31.85 seconds, Fetched: 8 row(s)

注:上面的内连接省掉了INNER关键字。

连接查询有以下几种优化方法:

  1. 查询表的大小从左到右是递增的;Hive会将小的表放在缓存中;我们可以在查询语句中使用streamtable来表明大的表;如在上面的例子中,orders是比较大的表,如果将orders表放在前面,会导致查询性能降低:
  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price from orders o join customers c on o.cid=c.id;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625045134_90a9eaaa-c005-415c-95e8-75ee85c7c82a
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 04:51:41 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 04:51:44 Dump the side-table for tag: 1 with group count: 3 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-51-34_250_2946658903456255805-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
  • 2017-06-25 04:51:44 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-51-34_250_2946658903456255805-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (334 bytes)
  • 2017-06-25 04:51:44 End of local task; Time Taken: 2.301 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0021, Tracking URL = http://s100:8088/proxy/application_1498376867650_0021/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0021
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 04:51:52,995 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 04:51:59,296 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.53 sec
  • MapReduce Total cumulative CPU time: 1 seconds 530 msec
  • Ended Job = job_1498376867650_0021
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.53 sec HDFS Read: 7686 HDFS Write: 367 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 530 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • 0 Tom 20 0 No000 12.99
  • 2 Rose 24 1 No001 22.1
  • 0 Tom 20 2 No002 13.7
  • 1 Jack 19 3 No003 23.0
  • 2 Rose 24 4 No004 65.99
  • 1 Jack 19 6 No006 4.3
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 8 No008 29.99
  • Time taken: 26.11 seconds, Fetched: 8 row(s)

可以添加/*+streamtable(o)*/语法暗示orders表是较大的表:

  • hive> select /*+streamtable(o)*/ c.id,c.name,c.age,o.id,o.orderno,o.price from orders o join customers c on o.cid=c.id;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625045559_f8746922-9d19-48c2-bffb-f3db1605b761
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 04:56:08 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 04:56:11 Dump the side-table for tag: 1 with group count: 3 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-55-59_042_7809462805281604536-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
  • 2017-06-25 04:56:11 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-55-59_042_7809462805281604536-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (334 bytes)
  • 2017-06-25 04:56:11 End of local task; Time Taken: 2.924 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0023, Tracking URL = http://s100:8088/proxy/application_1498376867650_0023/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0023
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 04:56:19,105 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 04:56:26,482 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec
  • MapReduce Total cumulative CPU time: 1 seconds 700 msec
  • Ended Job = job_1498376867650_0023
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.7 sec HDFS Read: 7370 HDFS Write: 367 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 700 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • 0 Tom 20 0 No000 12.99
  • 2 Rose 24 1 No001 22.1
  • 0 Tom 20 2 No002 13.7
  • 1 Jack 19 3 No003 23.0
  • 2 Rose 24 4 No004 65.99
  • 1 Jack 19 6 No006 4.3
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 8 No008 29.99
  • Time taken: 28.529 seconds, Fetched: 8 row(s)
  1. 在多张表连接查询时,如果进行连接的字段是一样的,Hive将只会启动一个MapReduce。

3.2. 左外连接LEFT OUTER JOIN…ON

  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price from customers c left outer join orders o on c.id=o.cid;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625045954_e503ede7-e959-47c5-abef-0927ed7b1d7f
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 05:00:01 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 05:00:04 Dump the side-table for tag: 1 with group count: 5 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-59-54_523_4945188933179624876-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable
  • 2017-06-25 05:00:04 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_04-59-54_523_4945188933179624876-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile41--.hashtable (489 bytes)
  • 2017-06-25 05:00:04 End of local task; Time Taken: 2.513 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0024, Tracking URL = http://s100:8088/proxy/application_1498376867650_0024/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0024
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 05:00:13,073 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 05:00:19,387 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.06 sec
  • MapReduce Total cumulative CPU time: 1 seconds 60 msec
  • Ended Job = job_1498376867650_0024
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.06 sec HDFS Read: 6832 HDFS Write: 367 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 60 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • 0 Tom 20 0 No000 12.99
  • 0 Tom 20 2 No002 13.7
  • 1 Jack 19 3 No003 23.0
  • 1 Jack 19 6 No006 4.3
  • 2 Rose 24 1 No001 22.1
  • 2 Rose 24 4 No004 65.99
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 8 No008 29.99
  • Time taken: 25.987 seconds, Fetched: 8 row(s)

3.3. 右外连接RIGHT OUTER JOIN…ON

  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price from customers c right outer join orders o on c.id=o.cid;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625050156_a5df56ca-e839-45b9-9966-1fd48edf2275
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 05:02:03 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 05:02:05 Dump the side-table for tag: 0 with group count: 3 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_05-01-56_297_7254382012370064664-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile50--.hashtable
  • 2017-06-25 05:02:05 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_05-01-56_297_7254382012370064664-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile50--.hashtable (334 bytes)
  • 2017-06-25 05:02:05 End of local task; Time Taken: 1.361 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0025, Tracking URL = http://s100:8088/proxy/application_1498376867650_0025/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0025
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 05:02:13,853 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 05:02:20,103 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 0.94 sec
  • MapReduce Total cumulative CPU time: 940 msec
  • Ended Job = job_1498376867650_0025
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 0.94 sec HDFS Read: 6994 HDFS Write: 437 SUCCESS
  • Total MapReduce CPU Time Spent: 940 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • 0 Tom 20 0 No000 12.99
  • 2 Rose 24 1 No001 22.1
  • 0 Tom 20 2 No002 13.7
  • 1 Jack 19 3 No003 23.0
  • 2 Rose 24 4 No004 65.99
  • NULL NULL NULL 5 No005 78.2
  • 1 Jack 19 6 No006 4.3
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 8 No008 29.99
  • NULL NULL NULL 9 No009 100.99
  • Time taken: 24.925 seconds, Fetched: 10 row(s)

3.4. 全外连接FULL OUTER JOIN…ON

  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price from customers c full outer join orders o on c.id=o.cid;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625050323_e0ac061d-a5b8-4dbe-863b-2bb21fdd8f4a
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0026, Tracking URL = http://s100:8088/proxy/application_1498376867650_0026/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0026
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 05:03:30,933 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 05:03:41,300 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.1 sec
  • 2017-06-25 05:03:48,647 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.29 sec
  • MapReduce Total cumulative CPU time: 3 seconds 290 msec
  • Ended Job = job_1498376867650_0026
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.29 sec HDFS Read: 16027 HDFS Write: 437 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 290 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price
  • NULL NULL NULL 9 No009 100.99
  • 0 Tom 20 2 No002 13.7
  • 0 Tom 20 0 No000 12.99
  • 1 Jack 19 6 No006 4.3
  • 1 Jack 19 3 No003 23.0
  • 2 Rose 24 8 No008 29.99
  • 2 Rose 24 7 No007 1.99
  • 2 Rose 24 4 No004 65.99
  • 2 Rose 24 1 No001 22.1
  • NULL NULL NULL 5 No005 78.2
  • Time taken: 25.804 seconds, Fetched: 10 row(s)

3.5. 左半开连接LEFT SEMI JOIN…ON

Hive中不支持下面的查询:

  • hive> select s.ymd,s.symbol,s.price from stocks s where s.ymd,s.symbol in (select d.ymd,d.symbol from dividends d);

不过可以使用如下的左半开连接达到相同的目的:

  • hive> select s.ymd,s.symbol,s.price from stocks s LEFT SEMI JOIN dividends ON s.ymd=d.ymd AND s.symbol=d.symbol;

select和where语句中不能引用右边表的字段。如我们可以查看拥有订单的客户,即只显示有订单关联的客户,没有订单关联的客户不显示:

我们先向customers表中添加两条数据:

  • hive> insert into hive1.customers(id,name,age) values(3,'John',22);
  • hive> insert into hive1.customers(id,name,age) values(4,'Jane',23);

由上面的表信息可知,id为4的客户在orders表中是没有订单对应的,因此我们再使用左半开连接查询:

  • hive> select c.id,c.name,c.age from customers c left semi join orders o on c.id=o.cid;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625062405_61839ea3-c229-4042-b93c-c0cf367e3d3f
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 06:24:15 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 06:24:17 Dump the side-table for tag: 1 with group count: 4 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-24-05_550_6876205285174063918-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile81--.hashtable
  • 2017-06-25 06:24:17 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-24-05_550_6876205285174063918-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile81--.hashtable (332 bytes)
  • 2017-06-25 06:24:17 End of local task; Time Taken: 2.191 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0031, Tracking URL = http://s100:8088/proxy/application_1498376867650_0031/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0031
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 06:24:26,505 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 06:24:33,144 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.75 sec
  • MapReduce Total cumulative CPU time: 1 seconds 750 msec
  • Ended Job = job_1498376867650_0031
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.75 sec HDFS Read: 6838 HDFS Write: 174 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 750 msec
  • OK
  • c.id c.name c.age
  • 3 John 22
  • 0 Tom 20
  • 1 Jack 19
  • 2 Rose 24
  • Time taken: 28.697 seconds, Fetched: 4 row(s)

可以发现,此时id为4的客户并没有显示。

左半开连接用来代替in操作或者exists操作的,上述的该语句相当于如下语句:

select id,name,age from hive1.customers where id in (select cid from hive1.orders);

但是,Hive不支持in子句。所以只能变通,使用left semi子句。

3.6. 笛卡尔积

  • hive> select c.id,c.name,c.age,o.id,o.orderno,o.price,o.cid from customers c join orders o;
  • Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625063151_9a546aba-54fe-4d86-a0df-af21e0a709e2
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 06:31:59 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 06:32:01 Dump the side-table for tag: 0 with group count: 1 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-31-51_075_2528914133778780088-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile90--.hashtable
  • 2017-06-25 06:32:01 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-31-51_075_2528914133778780088-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile90--.hashtable (331 bytes)
  • 2017-06-25 06:32:01 End of local task; Time Taken: 1.971 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0032, Tracking URL = http://s100:8088/proxy/application_1498376867650_0032/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0032
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 06:32:10,305 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 06:32:16,630 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.31 sec
  • MapReduce Total cumulative CPU time: 1 seconds 310 msec
  • Ended Job = job_1498376867650_0032
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.31 sec HDFS Read: 7098 HDFS Write: 1952 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 310 msec
  • OK
  • c.id c.name c.age o.id o.orderno o.price o.cid
  • 3 John 22 0 No000 12.99 0
  • 4 Jane 23 0 No000 12.99 0
  • 0 Tom 20 0 No000 12.99 0
  • 1 Jack 19 0 No000 12.99 0
  • 2 Rose 24 0 No000 12.99 0
  • 3 John 22 1 No001 22.1 2
  • 4 Jane 23 1 No001 22.1 2
  • 0 Tom 20 1 No001 22.1 2
  • 1 Jack 19 1 No001 22.1 2
  • 2 Rose 24 1 No001 22.1 2
  • 3 John 22 2 No002 13.7 0
  • 4 Jane 23 2 No002 13.7 0
  • 0 Tom 20 2 No002 13.7 0
  • 1 Jack 19 2 No002 13.7 0
  • 2 Rose 24 2 No002 13.7 0
  • 3 John 22 3 No003 23.0 1
  • 4 Jane 23 3 No003 23.0 1
  • 0 Tom 20 3 No003 23.0 1
  • 1 Jack 19 3 No003 23.0 1
  • 2 Rose 24 3 No003 23.0 1
  • 3 John 22 4 No004 65.99 2
  • 4 Jane 23 4 No004 65.99 2
  • 0 Tom 20 4 No004 65.99 2
  • 1 Jack 19 4 No004 65.99 2
  • 2 Rose 24 4 No004 65.99 2
  • 3 John 22 5 No005 78.2 3
  • 4 Jane 23 5 No005 78.2 3
  • 0 Tom 20 5 No005 78.2 3
  • 1 Jack 19 5 No005 78.2 3
  • 2 Rose 24 5 No005 78.2 3
  • 3 John 22 6 No006 4.3 1
  • 4 Jane 23 6 No006 4.3 1
  • 0 Tom 20 6 No006 4.3 1
  • 1 Jack 19 6 No006 4.3 1
  • 2 Rose 24 6 No006 4.3 1
  • 3 John 22 7 No007 1.99 2
  • 4 Jane 23 7 No007 1.99 2
  • 0 Tom 20 7 No007 1.99 2
  • 1 Jack 19 7 No007 1.99 2
  • 2 Rose 24 7 No007 1.99 2
  • 3 John 22 8 No008 29.99 2
  • 4 Jane 23 8 No008 29.99 2
  • 0 Tom 20 8 No008 29.99 2
  • 1 Jack 19 8 No008 29.99 2
  • 2 Rose 24 8 No008 29.99 2
  • 3 John 22 9 No009 100.99 NULL
  • 4 Jane 23 9 No009 100.99 NULL
  • 0 Tom 20 9 No009 100.99 NULL
  • 1 Jack 19 9 No009 100.99 NULL
  • 2 Rose 24 9 No009 100.99 NULL
  • Time taken: 27.78 seconds, Fetched: 50 row(s)

笛卡尔积可用于在相关联的两张表中每条数据的权重计算的场景下。

3.7. Map端连接

可以使用mapjoin关键字来指定某个表放在Map端做连接处理,进而省掉Reduce端的连接(其实相当于缓存方式的Map端连接操作),如下代码会将customers表完全防止在内存中以提升性能:

  • hive> select /*+MAPJOIN(c)*/ c.id,c.name,o.id,o.orderno,o.price from customers c left outer join orders o where c.id=o.cid;
  • Warning: Map Join MAPJOIN[9][bigTable=c] in task 'Stage-3:MAPRED' is a cross product
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625064512_7a68085e-10b7-4981-a5be-404181893e7d
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 06:45:21 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 06:45:23 Dump the side-table for tag: 1 with group count: 1 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-45-12_743_2922489995435085182-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile101--.hashtable
  • 2017-06-25 06:45:23 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_06-45-12_743_2922489995435085182-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile101--.hashtable (441 bytes)
  • 2017-06-25 06:45:23 End of local task; Time Taken: 2.563 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0033, Tracking URL = http://s100:8088/proxy/application_1498376867650_0033/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0033
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 06:45:32,683 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 06:45:39,924 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.63 sec
  • MapReduce Total cumulative CPU time: 1 seconds 630 msec
  • Ended Job = job_1498376867650_0033
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.63 sec HDFS Read: 7475 HDFS Write: 375 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 630 msec
  • OK
  • c.id c.name o.id o.orderno o.price
  • 3 John 5 No005 78.2
  • 0 Tom 0 No000 12.99
  • 0 Tom 2 No002 13.7
  • 1 Jack 3 No003 23.0
  • 1 Jack 6 No006 4.3
  • 2 Rose 1 No001 22.1
  • 2 Rose 4 No004 65.99
  • 2 Rose 7 No007 1.99
  • 2 Rose 8 No008 29.99
  • Time taken: 28.255 seconds, Fetched: 9 row(s)

在上面的打印中可以发现number of mappers: 1; number of reducers: 0表明并没有进行Reduce过程。

在Hive的0.7版本后,废弃了以上关键字的暗示,转而通过一个配置项来开启是否默认进行Map Join优化:

  • hive> set hive.auto.convert.join=true;

同时可以在配置文件中配置hive.mapjoin.smalltable.filesize来限制小文件的最大大小,当小文件大于该配置值时将不进行Map端的连接操作,而默认使用普通的连接查询处理方法。

3.8. Order By排序

Order By的操作与SQL语句中的排序操作是一样的,但是在大数据集中,对全局数据进行Order By操作将会是非常消耗性能和时间的,一般来说我们不会对全局数据直接进行Order By操作。Hive提供了Sort By操作,用于对每个Reduce端的数据进行局部排序,局部排序的数据再进行Order By全局排序将有效地减少性能和时间的消耗。下面是Order By的例子:

  • hive> select * from hive1.orders order by cid asc,price desc;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625070212_27c2fb50-4eef-41b0-b423-5246db6dfaca
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0034, Tracking URL = http://s100:8088/proxy/application_1498376867650_0034/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0034
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  • 2017-06-25 07:02:23,496 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 07:02:28,756 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.83 sec
  • 2017-06-25 07:02:36,018 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.06 sec
  • MapReduce Total cumulative CPU time: 2 seconds 60 msec
  • Ended Job = job_1498376867650_0034
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.06 sec HDFS Read: 8066 HDFS Write: 362 SUCCESS
  • Total MapReduce CPU Time Spent: 2 seconds 60 msec
  • OK
  • orders.id orders.orderno orders.price orders.cid
  • 9 No009 100.99 NULL
  • 2 No002 13.7 0
  • 0 No000 12.99 0
  • 3 No003 23.0 1
  • 6 No006 4.3 1
  • 4 No004 65.99 2
  • 8 No008 29.99 2
  • 1 No001 22.1 2
  • 7 No007 1.99 2
  • 5 No005 78.2 3
  • Time taken: 25.571 seconds, Fetched: 10 row(s)

Hive中提供了hive.strict.checks.*一系列的配置值用于控制Order By的执行(原配置是hive.mapred.mode=strict|nonstrict)。

3.9. Sort By排序

  • hive> select * from hive1.orders sort by cid asc,price desc;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625070738_6059c920-982e-4a7f-aa5a-eade52b0f6eb
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0035, Tracking URL = http://s100:8088/proxy/application_1498376867650_0035/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0035
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  • 2017-06-25 07:07:45,124 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 07:07:51,367 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.82 sec
  • 2017-06-25 07:07:58,785 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.02 sec
  • MapReduce Total cumulative CPU time: 2 seconds 20 msec
  • Ended Job = job_1498376867650_0035
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.02 sec HDFS Read: 7958 HDFS Write: 362 SUCCESS
  • Total MapReduce CPU Time Spent: 2 seconds 20 msec
  • OK
  • orders.id orders.orderno orders.price orders.cid
  • 9 No009 100.99 NULL
  • 2 No002 13.7 0
  • 0 No000 12.99 0
  • 3 No003 23.0 1
  • 6 No006 4.3 1
  • 4 No004 65.99 2
  • 8 No008 29.99 2
  • 1 No001 22.1 2
  • 7 No007 1.99 2
  • 5 No005 78.2 3
  • Time taken: 21.762 seconds, Fetched: 10 row(s)

3.10. Distribute By进行分区

  • hive> select * from hive1.orders distribute by cid sort by cid asc,price desc ;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625071930_b5f93c42-970d-4fd6-b2b3-e3acba1a9060
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0039, Tracking URL = http://s100:8088/proxy/application_1498376867650_0039/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0039
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  • 2017-06-25 07:19:37,196 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 07:19:45,550 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.23 sec
  • 2017-06-25 07:19:53,859 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.57 sec
  • MapReduce Total cumulative CPU time: 2 seconds 570 msec
  • Ended Job = job_1498376867650_0039
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.57 sec HDFS Read: 7992 HDFS Write: 362 SUCCESS
  • Total MapReduce CPU Time Spent: 2 seconds 570 msec
  • OK
  • orders.id orders.orderno orders.price orders.cid
  • 9 No009 100.99 NULL
  • 2 No002 13.7 0
  • 0 No000 12.99 0
  • 3 No003 23.0 1
  • 6 No006 4.3 1
  • 4 No004 65.99 2
  • 8 No008 29.99 2
  • 1 No001 22.1 2
  • 7 No007 1.99 2
  • 5 No005 78.2 3
  • Time taken: 24.513 seconds, Fetched: 10 row(s)

在上述的查询中,所有的cid相同的数据会被送到同一个Reducer去处理,这就是因为指定了distribute by cid,这样的话就可以统计出每个客户的中各个订单价格的排序了(这个肯定是全局有序的,因为相同的客户会放到同一个Reducer去处理)。这里需要注意的是distribute by必须要写在sort by之前。

注:Sort By控制着每个Reducer内的排序,Distribute By控制着分组操作。

3.11. UNION操作

  • hive> select id,name from customers union select id,orderno from orders;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625072820_d5c549e5-7254-4a8a-b733-210607b0b0b0
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0040, Tracking URL = http://s100:8088/proxy/application_1498376867650_0040/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0040
  • Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
  • 2017-06-25 07:28:27,930 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 07:28:38,233 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.92 sec
  • 2017-06-25 07:28:45,473 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.16 sec
  • MapReduce Total cumulative CPU time: 3 seconds 160 msec
  • Ended Job = job_1498376867650_0040
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.16 sec HDFS Read: 15499 HDFS Write: 381 SUCCESS
  • Total MapReduce CPU Time Spent: 3 seconds 160 msec
  • OK
  • u2.id u2.name
  • 0 No000
  • 0 Tom
  • 1 Jack
  • 1 No001
  • 2 No002
  • 2 Rose
  • 3 John
  • 3 No003
  • 4 Jane
  • 4 No004
  • 5 No005
  • 6 No006
  • 7 No007
  • 8 No008
  • 9 No009
  • Time taken: 26.427 seconds, Fetched: 15 row(s)

4. 视图(虚表)

创建视图表:

  • hive> create view view1 as select c.id cid,c.name,c.age,o.id oid,o.orderno,o.price,o.cid ocid from customers c left outer join orders o on c.id=o.cid;
  • OK
  • cid name age oid orderno price ocid
  • Time taken: 0.335 seconds

需要注意的是,当有两个表有相同的字段的时候需要指定不同的别名以防冲突。

可以查看view1的信息:

  • hive> desc formatted view1;
  • OK
  • # col_name data_type comment
  • cid int
  • name string
  • age int
  • oid int
  • orderno string
  • price float
  • ocid int
  • # Detailed Table Information
  • Database: hive1
  • Owner: ubuntu
  • CreateTime: Sun Jun 25 07:49:10 PDT 2017
  • LastAccessTime: UNKNOWN
  • Retention: 0
  • Table Type: VIRTUAL_VIEW
  • Table Parameters:
  • transient_lastDdlTime 1498402150
  • # Storage Information
  • SerDe Library: null
  • InputFormat: org.apache.hadoop.mapred.TextInputFormat
  • OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  • Compressed: No
  • Num Buckets: -1
  • Bucket Columns: []
  • Sort Columns: []
  • # View Information
  • View Original Text: select c.id cid,c.name,c.age,o.id oid,o.orderno,o.price,o.cid ocid from customers c left outer join orders o on c.id=o.cid
  • View Expanded Text: select `c`.`id` `cid`,`c`.`name`,`c`.`age`,`o`.`id` `oid`,`o`.`orderno`,`o`.`price`,`o`.`cid` `ocid` from `hive1`.`customers` `c` left outer join `hive1`.`orders` `o` on `c`.`id`=`o`.`cid`
  • Time taken: 0.114 seconds, Fetched: 32 row(s)

可以发现view1的类型是VIRTUAL_VIEW。

在创建了视图后可以去MySQL中查看表的信息:

  • mysql> select * from tbls;
  • +--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
  • | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
  • +--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
  • | 1 | 1498139809 | 1 | 0 | ubuntu | 0 | 1 | mytable | MANAGED_TABLE | NULL | NULL |
  • | 22 | 1498301797 | 11 | 0 | ubuntu | 0 | 22 | table2 | MANAGED_TABLE | NULL | NULL |
  • | 23 | 1498303911 | 11 | 0 | ubuntu | 0 | 23 | table3 | MANAGED_TABLE | NULL | NULL |
  • | 26 | 1498379532 | 11 | 0 | ubuntu | 0 | 32 | table4 | MANAGED_TABLE | NULL | NULL |
  • | 28 | 1498389318 | 11 | 0 | ubuntu | 0 | 34 | orders | MANAGED_TABLE | NULL | NULL |
  • | 29 | 1498389369 | 11 | 0 | ubuntu | 0 | 35 | customers | MANAGED_TABLE | NULL | NULL |
  • | 30 | 1498402150 | 11 | 0 | ubuntu | 0 | 36 | view1 | VIRTUAL_VIEW | select `c`.`id` `cid`,`c`.`name`,`c`.`age`,`o`.`id` `oid`,`o`.`orderno`,`o`.`price`,`o`.`cid` `ocid` from `hive1`.`customers` `c` left outer join `hive1`.`orders` `o` on `c`.`id`=`o`.`cid` | select c.id cid,c.name,c.age,o.id oid,o.orderno,o.price,o.cid ocid from customers c left outer join orders o on c.id=o.cid |
  • +--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
  • 7 rows in set (0.00 sec)

注:MySQL可以实现行转列:

  • mysql> select * from tbls \G;
  • *************************** 1. row ***************************
  • TBL_ID: 1
  • CREATE_TIME: 1498139809
  • DB_ID: 1
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 1
  • TBL_NAME: mytable
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 2. row ***************************
  • TBL_ID: 22
  • CREATE_TIME: 1498301797
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 22
  • TBL_NAME: table2
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 3. row ***************************
  • TBL_ID: 23
  • CREATE_TIME: 1498303911
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 23
  • TBL_NAME: table3
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 4. row ***************************
  • TBL_ID: 26
  • CREATE_TIME: 1498379532
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 32
  • TBL_NAME: table4
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 5. row ***************************
  • TBL_ID: 28
  • CREATE_TIME: 1498389318
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 34
  • TBL_NAME: orders
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 6. row ***************************
  • TBL_ID: 29
  • CREATE_TIME: 1498389369
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 35
  • TBL_NAME: customers
  • TBL_TYPE: MANAGED_TABLE
  • VIEW_EXPANDED_TEXT: NULL
  • VIEW_ORIGINAL_TEXT: NULL
  • *************************** 7. row ***************************
  • TBL_ID: 30
  • CREATE_TIME: 1498402150
  • DB_ID: 11
  • LAST_ACCESS_TIME: 0
  • OWNER: ubuntu
  • RETENTION: 0
  • SD_ID: 36
  • TBL_NAME: view1
  • TBL_TYPE: VIRTUAL_VIEW
  • VIEW_EXPANDED_TEXT: select `c`.`id` `cid`,`c`.`name`,`c`.`age`,`o`.`id` `oid`,`o`.`orderno`,`o`.`price`,`o`.`cid` `ocid` from `hive1`.`customers` `c` left outer join `hive1`.`orders` `o` on `c`.`id`=`o`.`cid`
  • VIEW_ORIGINAL_TEXT: select c.id cid,c.name,c.age,o.id oid,o.orderno,o.price,o.cid ocid from customers c left outer join orders o on c.id=o.cid
  • 7 rows in set (0.00 sec)
  • ERROR:
  • No query specified

然后就可以通过视图直接进行查询,就方便很多了:

  • hive> select * from view1;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625075154_d80a0a7f-6fcb-4295-9433-a7911210d689
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 07:52:02 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 07:52:03 Dump the side-table for tag: 1 with group count: 5 into file: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_07-51-54_237_2146709166685555479-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile111--.hashtable
  • 2017-06-25 07:52:03 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/310e1b2f-2e9d-4679-848e-0fc96fd6b5ab/hive_2017-06-25_07-51-54_237_2146709166685555479-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile111--.hashtable (489 bytes)
  • 2017-06-25 07:52:03 End of local task; Time Taken: 1.606 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0041, Tracking URL = http://s100:8088/proxy/application_1498376867650_0041/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0041
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 07:52:12,717 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 07:52:18,940 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.24 sec
  • MapReduce Total cumulative CPU time: 1 seconds 240 msec
  • Ended Job = job_1498376867650_0041
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.24 sec HDFS Read: 7546 HDFS Write: 454 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 240 msec
  • OK
  • view1.cid view1.name view1.age view1.oid view1.orderno view1.price view1.ocid
  • 3 John 22 5 No005 78.2 3
  • 4 Jane 23 NULL NULL NULL NULL
  • 0 Tom 20 0 No000 12.99 0
  • 0 Tom 20 2 No002 13.7 0
  • 1 Jack 19 3 No003 23.0 1
  • 1 Jack 19 6 No006 4.3 1
  • 2 Rose 24 1 No001 22.1 2
  • 2 Rose 24 4 No004 65.99 2
  • 2 Rose 24 7 No007 1.99 2
  • 2 Rose 24 8 No008 29.99 2
  • Time taken: 26.869 seconds, Fetched: 10 row(s)

还可以在视图查询中添加where条件:

  • hive> select * from view1 where price>50;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625075459_e3ae353b-cb68-4094-a81f-eb9b15241fda
  • Total jobs = 1
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
  • 2017-06-25 07:55:08 Starting to launch local task to process map join; maximum memory = 518979584
  • 2017-06-25 07:55:10 Dump the side-table for tag: 0 with group count: 5 into file: file:/home/ubuntu/hive/ubuntu/ede13000-25f9-449f-8681-9b2a2c4e70f8/hive_2017-06-25_07-54-59_492_6774137290707458272-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
  • 2017-06-25 07:55:10 Uploaded 1 File to: file:/home/ubuntu/hive/ubuntu/ede13000-25f9-449f-8681-9b2a2c4e70f8/hive_2017-06-25_07-54-59_492_6774137290707458272-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (384 bytes)
  • 2017-06-25 07:55:10 End of local task; Time Taken: 1.944 sec.
  • Execution completed successfully
  • MapredLocal task succeeded
  • Launching Job 1 out of 1
  • Number of reduce tasks is set to 0 since there's no reduce operator
  • Starting Job = job_1498376867650_0042, Tracking URL = http://s100:8088/proxy/application_1498376867650_0042/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0042
  • Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
  • 2017-06-25 07:55:22,426 Stage-3 map = 0%, reduce = 0%
  • 2017-06-25 07:55:29,288 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.6 sec
  • MapReduce Total cumulative CPU time: 1 seconds 600 msec
  • Ended Job = job_1498376867650_0042
  • MapReduce Jobs Launched:
  • Stage-Stage-3: Map: 1 Cumulative CPU: 1.6 sec HDFS Read: 8061 HDFS Write: 162 SUCCESS
  • Total MapReduce CPU Time Spent: 1 seconds 600 msec
  • OK
  • 2 Rose 24 4 No004 65.99 2
  • 3 John 22 5 No005 78.2 3
  • Time taken: 30.968 seconds, Fetched: 2 row(s)

5. 索引

由于在Hive中主键和自增策略是禁止的,这对于某些情况下的查询并不利;但我们也可以使用Hive为某张表建立索引:

  • hive> create index index_customers on table customers (id) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild idxproperties('creator'='me') in table customers_index comment 'index of table customers';
  • OK
  • Time taken: 0.725 seconds

可以查看MySQL中的元数据有:

  • mysql> select TBL_ID,DB_ID,SD_ID,TBL_NAME,TBL_TYPE from tbls;
  • +--------+-------+-------+-----------------+---------------+
  • | TBL_ID | DB_ID | SD_ID | TBL_NAME | TBL_TYPE |
  • +--------+-------+-------+-----------------+---------------+
  • | 1 | 1 | 1 | mytable | MANAGED_TABLE |
  • | 22 | 11 | 22 | table2 | MANAGED_TABLE |
  • | 23 | 11 | 23 | table3 | MANAGED_TABLE |
  • | 26 | 11 | 32 | table4 | MANAGED_TABLE |
  • | 28 | 11 | 34 | orders | MANAGED_TABLE |
  • | 29 | 11 | 35 | customers | MANAGED_TABLE |
  • | 30 | 11 | 36 | view1 | VIRTUAL_VIEW |
  • | 31 | 11 | 41 | customers_index | INDEX_TABLE |
  • +--------+-------+-------+-----------------+---------------+
  • 8 rows in set (0.00 sec)
  • mysql> select * from idxs;
  • +----------+-------------+------------------+-------------------------------------------------------------+-----------------+--------------+------------------+-------------+-------+
  • | INDEX_ID | CREATE_TIME | DEFERRED_REBUILD | INDEX_HANDLER_CLASS | INDEX_NAME | INDEX_TBL_ID | LAST_ACCESS_TIME | ORIG_TBL_ID | SD_ID |
  • +----------+-------------+------------------+-------------------------------------------------------------+-----------------+--------------+------------------+-------------+-------+
  • | 1 | 1498403882 |  | org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler | index_customers | 31 | 1498403882 | 29 | 42 |
  • +----------+-------------+------------------+-------------------------------------------------------------+-----------------+--------------+------------------+-------------+-------+
  • 1 row in set (0.00 sec)

可以发现该索引表的信息已经存在了,但是并没有创建索引表,因为当在创建索引如果配置了deferred rebuild时,默认情况下索引表是空白的,在第一次引用该索引表或者对索引表进行ALTER INDEX重建时才会触发索引表及其数据的产生:

注:deferred:延期的。

  • hive> alter index index_customers on customers rebuild;
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625082601_396474cc-5a46-497c-b3d0-940b31103cac
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks not specified. Estimated from input data size: 1
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0043, Tracking URL = http://s100:8088/proxy/application_1498376867650_0043/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0043
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
  • 2017-06-25 08:26:10,726 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 08:26:16,118 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.87 sec
  • 2017-06-25 08:26:23,520 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.07 sec
  • MapReduce Total cumulative CPU time: 2 seconds 70 msec
  • Ended Job = job_1498376867650_0043
  • Loading data to table hive1.customers_index
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.07 sec HDFS Read: 9928 HDFS Write: 409 SUCCESS
  • Total MapReduce CPU Time Spent: 2 seconds 70 msec
  • OK
  • Time taken: 24.114 seconds

接下来可以查看建立的索引表的信息:

  • hive> select * from customers_index;
  • OK
  • 0 hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data [0]
  • 1 hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data [9]
  • 2 hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data [19]
  • 3 hdfs://s100/user/ubuntu/data/hive1.db/customers/000000_0 [0]
  • 4 hdfs://s100/user/ubuntu/data/hive1.db/customers/000000_0_copy_1 [0]
  • Time taken: 0.086 seconds, Fetched: 5 row(s)

注:索引表其实存放在HDFS中:

  • hive> dfs -ls -R /;
  • ...
  • -rwxr-xr-x 3 ubuntu supergroup 331 2017-06-25 08:26 /user/ubuntu/data/hive1.db/customers_index/000000_0
  • ...
  • hive> dfs -cat /user/ubuntu/data/hive1.db/customers_index/000000_0;
  • 0hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data0
  • 1hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data9
  • 2hdfs://s100/user/ubuntu/data/hive1.db/customers/customers.data19
  • 3hdfs://s100/user/ubuntu/data/hive1.db/customers/000000_00
  • 4hdfs://s100/user/ubuntu/data/hive1.db/customers/000000_0_copy_10

在索引表中,第一列的数据表示的是customers表每行数据的id值,因为我们是根据customers表的id建立索引的;第二个字段表达的是对应第一列中的id值在customers表的中行数据在表文件在HDFS中存放位置(从前面的内容来看,customers表中id为0-2号数据是第一次添加的,id为3号数据是第二次添加的,id为4号数据是第三次添加的);第三列的数据则表示对应第一列中的id值在customers表的中行数据在自己所在的表文件中的偏移量(id为0-2号的数据在一张表中,所以会有对应的偏移量,而id为3和id为4的数据单独在两张表中,所它们的偏移量为0);我们可以根据表数据进行查看:

  • hive> dfs -cat /user/ubuntu/data/hive1.db/customers/customers.data;
  • 0 Tom 20
  • 1 Jack 19
  • 2 Rose 24
  • hive> dfs -cat /user/ubuntu/data/hive1.db/customers/000000_0;
  • 3 John 22
  • hive> dfs -cat /user/ubuntu/data/hive1.db/customers/000000_0_copy_1;
  • 4 Jane 23
  • id为0的数据行共有字节:1(字符0)+1(制表符)+3(字符串“Tom”)+1(空格)+2(字符串“20”)+2(回车换行符)= 9
  • id为1的数据行共有字节:1(字符0)+1(制表符)+4(字符串“Jack”)+1(空格)+2(字符串“20”)+2(回车换行符)= 10

故id为1的数据行偏移量为9,id为2的数据行偏移量为19。

6. 桶表

在Hive中,我们可以根据数据的特性进行分桶操作,Hive的分桶和Hadoop的Partition操作类似。如果想要让Hive对存入的数据分桶,我们需要在创建表的时候就对表进行分桶信息的制定:

  • hive> create table hive1.BucketingTable(id int,orderno string,cid int)
  • > clustered by (cid) into 3 buckets
  • > row format delimited
  • > fields terminated by '\t'
  • > lines terminated by '\n'
  • > stored as textfile;
  • OK
  • Time taken: 0.108 seconds

此时查看HDFS中的文件内容,虽然有对应的表但是没有表数据:

  • hive> dfs -ls -R /;
  • ...
  • drwxr-xr-x - ubuntu supergroup 0 2017-06-25 09:23 /user/ubuntu/data/hive1.db/bucketingtable
  • ...

我们向该表中插入数据:

  • hive> insert into hive1.BucketingTable(id,orderno,cid) values(1,'No001',1);
  • WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  • Query ID = ubuntu_20170625092558_46fd3ae2-07c3-40d1-81ef-794c350c948a
  • Total jobs = 1
  • Launching Job 1 out of 1
  • Number of reduce tasks determined at compile time: 3
  • In order to change the average load for a reducer (in bytes):
  • set hive.exec.reducers.bytes.per.reducer=<number>
  • In order to limit the maximum number of reducers:
  • set hive.exec.reducers.max=<number>
  • In order to set a constant number of reducers:
  • set mapreduce.job.reduces=<number>
  • Starting Job = job_1498376867650_0045, Tracking URL = http://s100:8088/proxy/application_1498376867650_0045/
  • Kill Command = /soft/hadoop/bin/hadoop job -kill job_1498376867650_0045
  • Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
  • 2017-06-25 09:26:07,726 Stage-1 map = 0%, reduce = 0%
  • 2017-06-25 09:26:15,102 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.35 sec
  • 2017-06-25 09:26:23,453 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 8.06 sec
  • MapReduce Total cumulative CPU time: 8 seconds 60 msec
  • Ended Job = job_1498376867650_0045
  • Loading data to table hive1.bucketingtable
  • MapReduce Jobs Launched:
  • Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 8.06 sec HDFS Read: 16844 HDFS Write: 178 SUCCESS
  • Total MapReduce CPU Time Spent: 8 seconds 60 msec
  • OK
  • Time taken: 26.29 seconds

再次查看HDFS中的文件:

  • hive> dfs -ls -R /user/ubuntu/data/hive1.db/bucketingtable;
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000000_0
  • -rwxr-xr-x 3 ubuntu supergroup 10 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000001_0
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000002_0
  • hive> dfs -cat /user/ubuntu/data/hive1.db/bucketingtable/000001_0;
  • 1 No001 1

可以发现,此时已经有桶表的数据块存在了;因为我们分了三个桶,所以产生了三个数据块,而上面插入的数据由于cid为1,经过Hash计算与桶总数取模的值为1,所以被分到了第一个数据块中;同样的,我们插入一条cid为2的数据,再次查看HDFS文件:

  • hive> insert into hive1.BucketingTable(id,orderno,cid) values(2,'No002',2);
  • ...
  • hive> dfs -ls -R /user/ubuntu/data/hive1.db/bucketingtable;
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000000_0
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:29 /user/ubuntu/data/hive1.db/bucketingtable/000000_0_copy_1
  • -rwxr-xr-x 3 ubuntu supergroup 10 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000001_0
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:29 /user/ubuntu/data/hive1.db/bucketingtable/000001_0_copy_1
  • -rwxr-xr-x 3 ubuntu supergroup 0 2017-06-25 09:26 /user/ubuntu/data/hive1.db/bucketingtable/000002_0
  • -rwxr-xr-x 3 ubuntu supergroup 10 2017-06-25 09:29 /user/ubuntu/data/hive1.db/bucketingtable/000002_0_copy_1
  • hive> dfs -cat /user/ubuntu/data/hive1.db/bucketingtable/000002_0_copy_1;