HBase
分布式存储

HBase 03:Region操作及过滤器

简介:HBase是一个构建在HDFS上的分布式列存储系统,基于Google BigTable模型开发的,典型的key/value系统;HBase是Apache Hadoop生态系统中的重要一员,主要用于海量结构化数据存储;从逻辑上讲,HBase将数据按照表、行和列进行存储。与hadoop一样,Hbase目标主要依靠横向扩展,通过不断增加廉价的商用服务器,来增加计算和存储能力。

1. Region操作

在HBase存储的hbase:meta信息中,存放了所有表的RegionServer信息,这些信息也是存放在一个表结构中的:

1.HBase表的区域信息.png

在上面的信息中我们可以看到t1表的RegionServer信息。

1.1. Region切割

HBase存放表的RegionServer是可以根据需求进行切分的,也就是说在表对应的每个Region数据块,可以按照开发人员指定的方式进行切分;我们可以向t1表中添加几条数据:

  • hbase(main):004:0> put 't1','row1000','f1:name','Rose'
  • 0 row(s) in 0.0760 seconds
  • hbase(main):005:0> put 't1','row2000','f1:age','22'
  • 0 row(s) in 0.0100 seconds
  • hbase(main):006:0> put 't1','row2500','f1:name','John'
  • 0 row(s) in 0.0280 seconds
  • hbase(main):007:0> put 't1','row5000','f1:name','Jack'
  • 0 row(s) in 0.0080 seconds

然后再查询数据:

  • hbase(main):013:0> scan 't1'
  • ROW COLUMN+CELL
  • row1 column=f1:age, timestamp=1499004214424, value=\x00\x00\x00\x16
  • row1 column=f1:name, timestamp=1499004214424, value=Tom
  • row1000 column=f1:name, timestamp=1499090917690, value=Rose
  • row2 column=f1:age, timestamp=1499004601836, value=\x00\x00\x00\x16
  • row2 column=f1:id, timestamp=1499004601836, value=\x00\x00\x00\x01
  • row2 column=f1:name, timestamp=1499004601836, value=Tom
  • row2000 column=f1:age, timestamp=1499090946170, value=22
  • row2500 column=f1:name, timestamp=1499091123503, value=John
  • row5000 column=f1:name, timestamp=1499090978413, value=Jack
  • 6 row(s) in 0.0390 seconds

可以发现,数据都是按照rowid来进行排序的(以字符串排序的规则),默认情况下,HBase有一个配置值hbase.hregion.max.filesize来控制Region中的存储文件的大小,当存储文件大于该值就会自动切割,切割方式通常是一分为二,该配置的默认值为10737418240;我们也可以手动指定Region的存储文件根据rowid的方式进行切割:

  • hbase(main):014:0> split 't1','row2500'
  • 0 row(s) in 0.1850 seconds

再做完切割操作后,我们可以查看t1表的元数据信息:

2.HBase表中切割后的区域信息.png

从图中标识的信息可以看出以下几点:

  1. info列族的splitAspliteB两个列是切分的两个列族的信息,它们的值分别存储了两个Region分区的表信息的rowid,同时值中还存储了分割列的STARTKEYENDKEY
  2. info列族的regioninfo列中对应不同的rowid的值分别存储两个Region分区的表的信息,同时值中还存储了分割列的STARTKEYENDKEY

SplitA和SplitB信息会在一定时间被清除,只剩下划分好的RegionInfo信息。在WebUI中也可以查看切割后的t1存放数据的Region信息:

3.HBase表中切割后在WebUI中的区域信息.png

在对整体的Region存储文件进行切割后,我们还可以对切割后的部分Region进行切割,使用以下的方式:

  • hbase(main):016:0> split 't1,row2500,1499091276691.b359f008430c17cec2c8adc744be48c1.', 'row4000'
  • 0 row(s) in 0.1530 seconds

切割完成后我们重启HBase后查看hbase:meta表中存放的元数据信息有:

4.HBase表中再次切割后的区域信息.png

如果让RegionServer自己根据数据量的大小来自动切割,将会由于多个节点同时到达单块数据存储的临界点而同时进行切割操作,这可能会引起切割风暴,因此我们也可以在创建表的时候直接指定切分Region的规则:

  • create 'tt','f1',SPLITS=>['row100000000','row200000000','row300000000']

1.2. Region合并

我们可以切割Region,同样的,我们也可以通过merge_region合并两个Region,该命令的用法如下:

  • hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME'
  • hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true

上面的两条命令用于合并两个Region,传入的是两个Region的名称的Encoded值,这个值可以在hbase:meta中查询到;第二条语句的最后一个Bool参数表示是否强制合并;例如,我们将上述切分的后两个Region进行合并,先查看后两个Region的名称Encoded值:

5.表的RegionName的encoded值.png

然后使用命令进行合并:

  • hbase(main):001:0> merge_region 'a28479c4661483b1aed754c2a2d286fd','f01bd1d8a1f6e3deb2ef44735569b5f8',true
  • 0 row(s) in 0.4400 seconds

合并成功后,我们可以在WebUI中查看合并后的Region信息:

6.合并后的Region在WebUI中的信息.png

1.3. Region移动

同时我们还可以将Region移动到某些节点上。目前两个Region都处在s103节点上,使用move命令可以移动Region,改名了的使用如下:

  • hbase> move 'ENCODED_REGIONNAME'
  • hbase> move 'ENCODED_REGIONNAME', 'SERVER_NAME'

这个命令有两个参数,第一个参数即RegionName的Encoded值,第二个参数是目的节点的Server_Name,这个Server_Name是由目的节点的Host、Port及StartCode拼接而成的,节点的StartCode可以在hbase:meta信息中查看:

7.HBase的节点的StartCode.png

也可以在WebUI的Zookeeper信息中查看到:

8.WebUI中HBase的节点的StartCode1.png

8.WebUI中HBase的节点的StartCode2.png

注:上面WebUI中的zk_dump操作也可以在HBase Shell中直接输入zk_dump进行操作。

下面就将第二个Region移动到s102上:

  • hbase(main):005:0> move 'e12d1a83097bf6700d02d7c5f4e80f69','s102,16020,1499092375444'
  • 0 row(s) in 0.0830 seconds

移动后查看WebUI中的信息:

9.移动Region后的WebUI信息.png

2. 查看表数据文件内容

我们可以使用hbase org.apache.hadoop.hbase.io.hfile.HFile -f 表数据文件 -v -m -p来查看表数据文件中的内容:

  • ubuntu@s100:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /hbase/data/default/t1/b579864dc123839ca623dee3f01036f3/f1/506df7932cce42719dc7f66018f5eab7 -v -m -p
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
  • Scanning -> /hbase/data/default/t1/b579864dc123839ca623dee3f01036f3/f1/506df7932cce42719dc7f66018f5eab7
  • 2017-07-03 08:20:36,378 INFO [main] hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled
  • K: row1/f1:age/1499004214424/Put/vlen=4/seqid=4 V: \x00\x00\x00\x16
  • K: row1/f1:name/1499004214424/Put/vlen=3/seqid=4 V: Tom
  • K: row1000/f1:name/1499090917690/Put/vlen=4/seqid=14 V: Rose
  • K: row2/f1:age/1499004601836/Put/vlen=4/seqid=6 V: \x00\x00\x00\x16
  • K: row2/f1:id/1499004601836/Put/vlen=4/seqid=6 V: \x00\x00\x00\x01
  • K: row2/f1:name/1499004601836/Put/vlen=3/seqid=6 V: Tom
  • K: row2000/f1:age/1499090946170/Put/vlen=2/seqid=15 V: 22
  • Block index size as per heapsize: 392
  • ...
  • ubuntu@s100:~$ hbase org.apache.hadoop.hbase.io.hfile.HFile -f /hbase/data/default/t1/e12d1a83097bf6700d02d7c5f4e80f69/f1/61092ac0261c4af9994eeb188d62a98b -v -m -p
  • SLF4J: Class path contains multiple SLF4J bindings.
  • SLF4J: Found binding in [jar:file:/soft/hbase-1.2.4/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: Found binding in [jar:file:/soft/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  • SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  • SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
  • Scanning -> /hbase/data/default/t1/e12d1a83097bf6700d02d7c5f4e80f69/f1/61092ac0261c4af9994eeb188d62a98b
  • 2017-07-03 08:19:06,996 INFO [main] hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled
  • K: row2500/f1:name/1499091123503/Put/vlen=4/seqid=19 V: John
  • K: row5000/f1:name/1499090978413/Put/vlen=4/seqid=17 V: Jack
  • Block index size as per heapsize: 400
  • ...

3. WAL

RegionServer会将数据保存到内存中,直到积攒足够多的数据再将其刷写到磁盘上,这样可以避免创建很多的小文件。存储在内存中的数据是不稳定的,例如在服务器断电的情况下数据就可能会丢失;一个比较常见的解决方案是WAL写前日志,每次更新都会写入日志,只有在写入成功才会通知客户端操作成功,然后服务器可以按需自由地批量处理或聚合内存中的数据。当灾难发生的时候,WAL就是找回数据的生命线,类似于MySQL的binary log,WAL存放了对数据的所有更改,如果服务器崩溃,它可以有效地回放日志。

3.1. 快照

我们可以为表创建快照,如下操作为t1表创建快照:

  • hbase(main):007:0> snapshot 't1','sp1'
  • 0 row(s) in 0.4170 seconds
  • hbase(main):008:0> list_snapshots
  • SNAPSHOT TABLE + CREATION TIME
  • sp1 t1 (Mon Jul 03 09:03:18 -0700 2017)
  • 1 row(s) in 0.0420 seconds
  • => ["sp1"]

快照对应在HDFS中以文件信息进行存储:

  • ubuntu@s100:~$ hdfs dfs -ls -R /hbase/
  • drwxr-xr-x - ubuntu supergroup 0 2017-07-03 09:03 /hbase/.hbase-snapshot
  • drwxr-xr-x - ubuntu supergroup 0 2017-07-03 09:03 /hbase/.hbase-snapshot/.tmp
  • drwxr-xr-x - ubuntu supergroup 0 2017-07-03 09:03 /hbase/.hbase-snapshot/sp1
  • -rw-r--r-- 3 ubuntu supergroup 0 2017-07-03 09:03 /hbase/.hbase-snapshot/sp1/.inprogress
  • -rw-r--r-- 3 ubuntu supergroup 20 2017-07-03 09:03 /hbase/.hbase-snapshot/sp1/.snapshotinfo
  • -rw-r--r-- 3 ubuntu supergroup 455 2017-07-03 09:03 /hbase/.hbase-snapshot/sp1/data.manifest

4. VERSION

我们可以为表添加列族,在添加列族的同时可以为该列族指定VERSIONS版本数:

  • hbase(main):005:0> desc 't1'
  • Table t1 is ENABLED
  • t1
  • COLUMN FAMILIES DESCRIPTION
  • {NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION =>
  • 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  • 1 row(s) in 0.0790 seconds
  • hbase(main):006:0> alter 't1',{NAME=>'f3',VERSIONS=>5}
  • Updating all regions with the new schema...
  • 0/2 regions updated.
  • 2/2 regions updated.
  • Done.
  • 0 row(s) in 3.2150 seconds
  • hbase(main):007:0> desc 't1'
  • Table t1 is ENABLED
  • t1
  • COLUMN FAMILIES DESCRIPTION
  • {NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION =>
  • 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  • {NAME => 'f3', BLOOMFILTER => 'ROW', VERSIONS => '5', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER
  • ', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  • 2 row(s) in 0.0200 seconds

在上面的脚本中,我们为t1表添加了一个f3列族,并且指定了该列族的版本数为5,也就是说,该列族最多可以存储5个版本的数据;接下来我们put多份rowid和列名相同的数据到f3列族:

  • hbase(main):008:0> put 't1','row1','f3:name','Tom1'
  • 0 row(s) in 0.1010 seconds
  • hbase(main):009:0> put 't1','row1','f3:name','Tom2'
  • 0 row(s) in 0.0110 seconds
  • hbase(main):001:0> put 't1','row1','f3:name','Tom3'
  • 0 row(s) in 0.3860 seconds
  • hbase(main):002:0> put 't1','row1','f3:name','Tom4'
  • 0 row(s) in 0.0180 seconds
  • hbase(main):003:0> put 't1','row1','f3:name','Tom5'
  • 0 row(s) in 0.0170 seconds
  • hbase(main):004:0> scan 't1'
  • ROW COLUMN+CELL
  • row1 column=f1:age, timestamp=1499004214424, value=\x00\x00\x00\x16
  • row1 column=f1:name, timestamp=1499004214424, value=Tom
  • row1 column=f3:name, timestamp=1499172923583, value=Tom5
  • row1000 column=f1:name, timestamp=1499090917690, value=Rose
  • row2 column=f1:age, timestamp=1499004601836, value=\x00\x00\x00\x16
  • row2 column=f1:id, timestamp=1499004601836, value=\x00\x00\x00\x01
  • row2 column=f1:name, timestamp=1499004601836, value=Tom
  • row2000 column=f1:age, timestamp=1499090946170, value=22
  • row2500 column=f1:name, timestamp=1499091123503, value=John
  • row5000 column=f1:name, timestamp=1499090978413, value=Jack
  • 6 row(s) in 0.0890 seconds

查看该表的数据,可以发现刚刚put到表中的数据只显示了最新的一个版本。使用get命令可以指定显示的版本数:

  • hbase(main):005:0> get 't1','row1',{COLUMN=>'f3',VERSIONS=>4}
  • COLUMN CELL
  • f3:name timestamp=1499172923583, value=Tom5
  • f3:name timestamp=1499172914141, value=Tom4
  • f3:name timestamp=1499172908407, value=Tom3
  • f3:name timestamp=1499172863815, value=Tom2
  • 4 row(s) in 0.0450 seconds

当传入的VERSIONS大于或等于当时定义列族的VERSIONS时,会显示所有的版本:

  • hbase(main):006:0> get 't1','row1',{COLUMN=>'f3',VERSIONS=>7}
  • COLUMN CELL
  • f3:name timestamp=1499172923583, value=Tom5
  • f3:name timestamp=1499172914141, value=Tom4
  • f3:name timestamp=1499172908407, value=Tom3
  • f3:name timestamp=1499172863815, value=Tom2
  • f3:name timestamp=1499172858502, value=Tom1
  • 5 row(s) in 0.0160 seconds

也可以指定相应版本的时间戳来查看版本数据,例如查看指定时间戳版本:

  • hbase(main):007:0> get 't1','row1',{COLUMN=>'f3',TIMESTAMP=>1499172908407}
  • COLUMN CELL
  • f3:name timestamp=1499172908407, value=Tom3
  • 1 row(s) in 0.0050 seconds

或者查看时间戳范围内的版本:

  • hbase(main):012:0> get 't1','row1',{COLUMN=>'f3',TIMERANGE=>[1499172908408,1499172914143]}
  • COLUMN CELL
  • f3:name timestamp=1499172914141, value=Tom4
  • 1 row(s) in 0.0050 seconds
  • hbase(main):015:0> get 't1','row1',{COLUMN=>'f3',TIMERANGE=>[1499172908408,1499172999999],VERSIONS=>7}
  • COLUMN CELL
  • f3:name timestamp=1499172923583, value=Tom5
  • f3:name timestamp=1499172914141, value=Tom4
  • 2 row(s) in 0.0100 seconds

也可以根据时间戳来删除相应的版本:

  • hbase(main):017:0> get 't1','row1',{COLUMN=>'f3',VERSIONS=>7}
  • COLUMN CELL
  • f3:name timestamp=1499172923583, value=Tom5
  • f3:name timestamp=1499172914141, value=Tom4
  • f3:name timestamp=1499172908407, value=Tom3
  • f3:name timestamp=1499172863815, value=Tom2
  • f3:name timestamp=1499172858502, value=Tom1
  • 5 row(s) in 0.0180 seconds
  • hbase(main):019:0> delete 't1','row1','f3:name',1499172863815
  • 0 row(s) in 0.0260 seconds
  • hbase(main):020:0> get 't1','row1',{COLUMN=>'f3',VERSIONS=>7}
  • COLUMN CELL
  • f3:name timestamp=1499172923583, value=Tom5
  • f3:name timestamp=1499172914141, value=Tom4
  • f3:name timestamp=1499172908407, value=Tom3
  • 3 row(s) in 0.0100 seconds

可以发现,在删除某个时间戳对应的版本时,该版本之前的版本也都被删除了。但其实我们通过指定RAW为true的查询还是可以看到删除的数据:

  • hbase(main):021:0> scan 't1',{RAW=>true,VERSIONS=>7}
  • ROW COLUMN+CELL
  • row1 column=f1:age, timestamp=1499004214424, value=\x00\x00\x00\x16
  • row1 column=f1:name, timestamp=1499004214424, value=Tom
  • row1 column=f3:name, timestamp=1499172923583, value=Tom5
  • row1 column=f3:name, timestamp=1499172914141, value=Tom4
  • row1 column=f3:name, timestamp=1499172908407, value=Tom3
  • row1 column=f3:name, timestamp=1499172863815, type=DeleteColumn
  • row1 column=f3:name, timestamp=1499172863815, value=Tom2
  • row1 column=f3:name, timestamp=1499172858502, value=Tom1
  • row1000 column=f1:name, timestamp=1499090917690, value=Rose
  • row2 column=f1:age, timestamp=1499004601836, value=\x00\x00\x00\x16
  • row2 column=f1:id, timestamp=1499004601836, value=\x00\x00\x00\x01
  • row2 column=f1:name, timestamp=1499004601836, value=Tom
  • row2000 column=f1:age, timestamp=1499090946170, value=22
  • row2500 column=f1:name, timestamp=1499091123503, value=John
  • row5000 column=f1:name, timestamp=1499090978413, value=Jack
  • 6 row(s) in 0.0760 seconds

4.1. VERSIONS Java API

同样的我们可以使用Java API获取表中某行某列的版本数据:

  • /**
  • * Get获取数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void get() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t1"));
  • Get get = new Get(Bytes.toBytes("row1"));
  • get.addColumn(Bytes.toBytes("f3"), Bytes.toBytes("name"));
  • get.setMaxVersions();
  • get.setTimeRange(1499172908407L, 1499172923583L);
  • Result result = table.get(get);
  • List<Cell> columnCells = result.getColumnCells(Bytes.toBytes("f3"), Bytes.toBytes("name"));
  • for (Cell cell : columnCells) {
  • System.out.println(Bytes.toString(cell.getValue()));
  • }
  • System.out.println("get table over");
  • }

运行上述的代码,会打印以下内容:

  • Tom4
  • Tom3
  • get table over

也可以使用RAW的Scan对某行某列的数据进行扫描:

注:需要注意的时,RAW类型的扫描不可以指定列。

  • /**
  • * 扫描数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void rawScan() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t1"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.addFamily(Bytes.toBytes("f3"));
  • scan.setRaw(true);
  • scan.setTimeRange(1499172858502L, 1499172923583L);
  • scan.setMaxVersions();
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(Bytes.toBytes("f3"));
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(Bytes.toBytes("f3"), entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = null;
  • if ("id".equals(new String(qualifier)) || "age".equals(new String(qualifier))) {
  • valueV = new Integer(Bytes.toInt(value));
  • } else {
  • valueV = new String(value);
  • }
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + ", value:" + valueV + ", timestamp:" + timestamp);
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row1, family:f3, qualifier:name, value:Tom4, timestamp:1499172914141
  • row:row1, family:f3, qualifier:name, value:Tom3, timestamp:1499172908407
  • row:row1, family:f3, qualifier:name, value:, timestamp:1499172863815
  • row:row1, family:f3, qualifier:name, value:Tom2, timestamp:1499172863815
  • row:row1, family:f3, qualifier:name, value:Tom1, timestamp:1499172858502
  • scan table over

5. 批量处理数据

我们在put数据的时候可以批量进行插入,如下面的Java代码:

  • /**
  • * 批量插入数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void putBatch() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t1"));
  • Long start = System.currentTimeMillis();
  • // 构建批量插入的数据行
  • ArrayList<Put> puts = new ArrayList<Put>();
  • for (int i = 0; i < 1000000; i++) {
  • // 创建需要put的数据
  • Put put = new Put(Bytes.toBytes("row" + i));
  • put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("name"), Bytes.toBytes("Tom" + i));
  • puts.add(put);
  • }
  • // put数据并关闭表
  • table.put(puts);
  • table.close();
  • System.out.println(System.currentTimeMillis() - start);
  • System.out.println("putBatch table over");
  • }

运行上述代码,可以在HBase Shell中使用count命令进行查看:

  • hbase(main):002:0> count 't1', INTERVAL=>100000
  • Current count: 100000, row: row189998
  • Current count: 200000, row: row279998
  • Current count: 300000, row: row369998
  • Current count: 400000, row: row459998
  • Current count: 500000, row: row549998
  • Current count: 600000, row: row639998
  • Current count: 700000, row: row729998
  • Current count: 800000, row: row819998
  • Current count: 900000, row: row909998
  • Current count: 1000000, row: row999999
  • 1000000 row(s) in 31.4940 seconds

注:INTERVAL可以指定每次count操作的批量条数。

5.1. Cache行查询优化

我们在查询大批量数据的时候,可以设置scan的cache,设置了cache后,每次调用getScanner()方法之后,API都会把设定值配置到扫描实例中,这个值能够控制每次RPC调用取回的行数,可以有效地减少扫描所需要的时间:

  • /**
  • * 扫描数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void scanWithCache() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t1"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setStartRow(Bytes.toBytes("row1"));
  • scan.setStopRow(Bytes.toBytes("row999999"));
  • scan.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("name"));
  • scan.setCaching(50000);
  • Long start = System.currentTimeMillis();
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(Bytes.toBytes("f1"));
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(Bytes.toBytes("f1"), entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = null;
  • if ("id".equals(new String(qualifier)) || "age".equals(new String(qualifier))) {
  • valueV = new Integer(Bytes.toInt(value));
  • } else {
  • valueV = new String(value);
  • }
  • long timestamp = cell.getTimestamp();
  • // System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • // + new String(qualifier) + ", value:" + valueV + ", timestamp:" + timestamp);
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println(System.currentTimeMillis() - start);
  • System.out.println("scan table over");
  • }

上述代码的scan.setCaching(50000);可以开启当前操作的Cache配置,Cache的全局配置交由hbase-site.xml中由hbase.client.scanner.caching来控制。

5.2. Batch列查询优化

Cache配置是针对行的批量优化,以减少查询时RPC的调用次数,在HBase中,还提供了对列的查询优化Batch,当我们设置了Batch之后,每次RPC返回的数据将会是Batch所指定的列数;例如在表中的一个列族一共有10列数据,我们设置了Batch为3,那么多次RPC会分次调用3、3、3和1列的数据。

6. Cache和Batch的交叉测试

我们先准备一个具有两个列族,每个列族三个列,共十行数据的表,准备测试数据的代码如下:

  • 创建表:
  • /**
  • * 创建Cache和Batch查询测试表
  • *
  • * @throws Exception
  • */
  • @Test
  • public void createCacheAndBatchTestTable() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 得到管理程序
  • Admin admin = connection.getAdmin();
  • TableName tabName = TableName.valueOf("t4");
  • // 创建表描述符
  • HTableDescriptor tabd = new HTableDescriptor(tabName);
  • // 创建列族描述符
  • HColumnDescriptor cld1 = new HColumnDescriptor("f1");
  • HColumnDescriptor cld2 = new HColumnDescriptor("f2");
  • tabd.addFamily(cld1);
  • tabd.addFamily(cld2);
  • admin.createTable(tabd);
  • System.out.println("create table over");
  • }

创建的表的结构如下:

  • hbase(main):003:0> desc 't4'
  • Table t4 is ENABLED
  • t4
  • COLUMN FAMILIES DESCRIPTION
  • {NAME => 'f1', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION =>
  • 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  • {NAME => 'f2', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION =>
  • 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
  • 2 row(s) in 0.2480 seconds
  • 向表中插入测试数据的测试代码:
  • /**
  • * 插入Cache和Batch查询测试表数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void putCacheAndBatchTestData() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 构建批量插入的数据行
  • ArrayList<Put> puts = new ArrayList<Put>();
  • for (int i = 0; i < 10; i++) {
  • // 创建需要put的数据
  • Put put = new Put(Bytes.toBytes("row" + i));
  • put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("id"), Bytes.toBytes("id1-" + i));
  • put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("name"), Bytes.toBytes("name1-" + i));
  • put.addColumn(Bytes.toBytes("f1"), Bytes.toBytes("age"), Bytes.toBytes("age1-" + i));
  • put.addColumn(Bytes.toBytes("f2"), Bytes.toBytes("id"), Bytes.toBytes("id2-" + i));
  • put.addColumn(Bytes.toBytes("f2"), Bytes.toBytes("name"), Bytes.toBytes("name2-" + i));
  • put.addColumn(Bytes.toBytes("f2"), Bytes.toBytes("age"), Bytes.toBytes("age2-" + i));
  • puts.add(put);
  • }
  • // put数据并关闭表
  • table.put(puts);
  • table.close();
  • System.out.println("putBatch table over");
  • }

插入的测试数据如下:

  • hbase(main):004:0> scan 't4'
  • ROW COLUMN+CELL
  • row0 column=f1:age, timestamp=1499184025037, value=age1-0
  • row0 column=f1:id, timestamp=1499184025037, value=id1-0
  • row0 column=f1:name, timestamp=1499184025037, value=name1-0
  • row0 column=f2:age, timestamp=1499184025037, value=age2-0
  • row0 column=f2:id, timestamp=1499184025037, value=id2-0
  • row0 column=f2:name, timestamp=1499184025037, value=name2-0
  • row1 column=f1:age, timestamp=1499184025037, value=age1-1
  • row1 column=f1:id, timestamp=1499184025037, value=id1-1
  • row1 column=f1:name, timestamp=1499184025037, value=name1-1
  • row1 column=f2:age, timestamp=1499184025037, value=age2-1
  • row1 column=f2:id, timestamp=1499184025037, value=id2-1
  • row1 column=f2:name, timestamp=1499184025037, value=name2-1
  • row2 column=f1:age, timestamp=1499184025037, value=age1-2
  • row2 column=f1:id, timestamp=1499184025037, value=id1-2
  • row2 column=f1:name, timestamp=1499184025037, value=name1-2
  • row2 column=f2:age, timestamp=1499184025037, value=age2-2
  • row2 column=f2:id, timestamp=1499184025037, value=id2-2
  • row2 column=f2:name, timestamp=1499184025037, value=name2-2
  • row3 column=f1:age, timestamp=1499184025037, value=age1-3
  • row3 column=f1:id, timestamp=1499184025037, value=id1-3
  • row3 column=f1:name, timestamp=1499184025037, value=name1-3
  • row3 column=f2:age, timestamp=1499184025037, value=age2-3
  • row3 column=f2:id, timestamp=1499184025037, value=id2-3
  • row3 column=f2:name, timestamp=1499184025037, value=name2-3
  • row4 column=f1:age, timestamp=1499184025037, value=age1-4
  • row4 column=f1:id, timestamp=1499184025037, value=id1-4
  • row4 column=f1:name, timestamp=1499184025037, value=name1-4
  • row4 column=f2:age, timestamp=1499184025037, value=age2-4
  • row4 column=f2:id, timestamp=1499184025037, value=id2-4
  • row4 column=f2:name, timestamp=1499184025037, value=name2-4
  • 10 row(s) in 0.2050 seconds

我们设置Cache和Batch进行扫描的测试代码如下:

  • /**
  • * 扫描数据
  • *
  • * @throws Exception
  • */
  • @Test
  • public void cacheAndBatchScan() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setCaching(5);
  • scan.setBatch(3);
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • Integer counter = 1;
  • while (iterator.hasNext()) {
  • System.out.println("***** 第" + counter + "次 *****");
  • counter++;
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

在Cache为5,Batch为3的情况下,打印的数据如下:

  • ***** 第1次 *****
  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • ***** 第2次 *****
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • ***** 第3次 *****
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • ***** 第4次 *****
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • ***** 第5次 *****
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • ***** 第6次 *****
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • ***** 第7次 *****
  • row:row2, family:f1, qualifier:age , value:age1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • ***** 第8次 *****
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • ***** 第9次 *****
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • ***** 第10次 *****
  • row:row3, family:f1, qualifier:age , value:age1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • ***** 第11次 *****
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • ***** 第12次 *****
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • ***** 第13次 *****
  • row:row4, family:f1, qualifier:age , value:age1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:id , value:id1-4 , timestamp:1499259417717
  • ***** 第14次 *****
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:age , value:age2-4 , timestamp:1499259417717
  • ***** 第15次 *****
  • row:row4, family:f2, qualifier:id , value:id2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

对于其扫描的逻辑有:

10.cache=5,batch=3扫描情况.png

对于cache=6,Batch=4有以下的打印输出:

  • ***** 第1次 *****
  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • ***** 第2次 *****
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • ***** 第3次 *****
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • ***** 第4次 *****
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • ***** 第5次 *****
  • row:row2, family:f1, qualifier:age , value:age1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • ***** 第6次 *****
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • ***** 第7次 *****
  • row:row3, family:f1, qualifier:age , value:age1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • ***** 第8次 *****
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • ***** 第9次 *****
  • row:row4, family:f1, qualifier:age , value:age1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:id , value:id1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:age , value:age2-4 , timestamp:1499259417717
  • ***** 第10次 *****
  • row:row4, family:f2, qualifier:id , value:id2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

其扫描逻辑为:

10.cache=6,batch=4扫描情况.png

对于cache=6,batch=5有以下打印:

  • ***** 第1次 *****
  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • ***** 第2次 *****
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • ***** 第3次 *****
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • ***** 第4次 *****
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • ***** 第5次 *****
  • row:row2, family:f1, qualifier:age , value:age1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • ***** 第6次 *****
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • ***** 第7次 *****
  • row:row3, family:f1, qualifier:age , value:age1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • ***** 第8次 *****
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • ***** 第9次 *****
  • row:row4, family:f1, qualifier:age , value:age1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:id , value:id1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:age , value:age2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:id , value:id2-4 , timestamp:1499259417717
  • ***** 第10次 *****
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

其扫描逻辑为:

10.cache=6,batch=5扫描情况.png

7. 过滤器

HBase中的过滤器可以实现类似于JDBC的where子句一样的功能;在了解过滤器前,我们需要先了解CompareFilter中定义的比较运算符CompareOp:

操作 描述
LESS 匹配小于设定值的值
LESS_OR_EQUAL 匹配小于或等于设定值的值
EQUAL 匹配等于设定值的值
NOT_EQUAL 匹配与设定值不相等的值
GREATER_OR_EQUAL 匹配大于或等于设定值的值
GREATER 匹配大于设定值的值
NO_OP 排除一切值

这些运算符可以在使用过滤器时决定数据根据何种方式进行过滤。

另外,CompareFilter在使用是还需要传入比较器,HBase提供了下列的比较器:

比较器 描述
BinaryComparator 使用Bytes.compareTo()比较当前值与阈值
BinaryPrefixComparator 与BinaryComparator相似,但是是从左端开始前缀匹配
NullComparator 不做匹配,只判断当前值是不是null
BitComparator 通过BitwiseOp提供的按位与(AND)、或(OR)、异或(XOR)操作执行位级比较
RegexStringComparator 根据一个正则表达式,在实例化这个比较器的时候去匹配表中的数据
SubStringComparator 把阈值和表中数据当做String实例,同时通过contains()操作匹配字符串

7.1. 行过滤器

行过滤器基于行键来过滤数据;下面演示基于行键的全值匹配、正则匹配和子串匹配三种过滤。

  1. 普通行过滤器
  • /**
  • * 行过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void rowFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes("row1"))));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • scan table over
  1. 正则表达式过滤器
  • /**
  • * 正则匹配过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void rowFilterByRegex() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator(".*[02]")));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:age , value:age1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • scan table over
  1. 子串匹配过滤器
  • /**
  • * 子串匹配行过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void rowFilterBySubstring() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new RowFilter(CompareFilter.CompareOp.EQUAL, new SubstringComparator("w3")));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row3, family:f1, qualifier:age , value:age1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • scan table over

7.2. 列族过滤器

列族过滤器基于列族比较来过滤数据。

  • /**
  • * 列族过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void familyFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(
  • new FamilyFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("f2"))));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

上述代码运行后打印:

  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:age , value:age2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:id , value:id2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

7.3. 列名过滤器

列名过滤器用于筛选特定的列。

  • /**
  • * 列过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void qualifierFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(
  • new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("name"))));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码打印如下内容:

  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

7.4. 值过滤器

值过滤器用于筛选某个特定值的单元格。

  • /**
  • * 值过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void valueFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(
  • new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("^name.*")));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

7.5. 单列值过滤器

单列过滤器是根据一列的值决定是否一行数据被过滤。

  1. 普通单列值过滤器
  • /**
  • * 单列值过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void singleFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new SingleColumnValueFilter(Bytes.toBytes("f1"), Bytes.toBytes("name"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes("name1-0")));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • scan table over

注:上面代码打印的是包含了name1-0的一整行的所有列的数据。

  1. 单列值排他过滤器
  • /**
  • * 单列值排他过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void singleExcludeFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new SingleColumnValueExcludeFilter(Bytes.toBytes("f1"), Bytes.toBytes("name"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes("name1-0")));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • scan table over

注:上面代码打印的是包含了name1-0的行中除去了name1-0列的其他所有列的数据。

7.6. 分页过滤器

分页过滤器用于对结果按行分页,在构建该过滤器时需要指定pageSize参数控制每页返回的行数。

  • /**
  • * 分页过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void pageFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new PageFilter(2));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • scan table over

7.7. 行键过滤器

行键过滤器提供了可以修改扫描出的列和单元格的功能,通过convertToKeyOnly(boolean)方法帮助调用只返回键不返回值;该过滤器的构造方法需要一个名为lenAsVal的布尔参数,该参数会被传入到convertToKeyOnly(boolean)方法中,控制KeyValue实例中值的处理,默认值为false,表示值被设为长度为0的字节数组,当设置为true时,值被设定为原值长度的字节数组。

  • /**
  • * 行键过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void keyonlyFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new KeyOnlyFilter());
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

7.8. 列分页过滤器

列分页过滤器与分页过滤器类似,可以对一行的所有列进行分页。

  • /**
  • * 行键过滤器
  • *
  • * @throws Exception
  • */
  • @Test
  • public void culomnPageFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • scan.setFilter(new ColumnPaginationFilter(1, 1));
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行代码,打印内容如下:

  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:id , value:id1-4 , timestamp:1499259417717
  • scan table over

7.9. 过滤器链

可以使用FilterList过滤器链来串联多个过滤器,过滤器链有MUST_PASS_ALLMUST_PASS_ONE两种模式,对应于AND和OR的两种操作:

  • /**
  • * 过滤器链
  • *
  • * @throws Exception
  • */
  • @Test
  • public void filterList() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
  • filterList.addFilter(new FamilyFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("f2"))));
  • filterList.addFilter(new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("name"))));
  • scan.setFilter(filterList);
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码打印内容如下:

  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over

8. 组合过滤

对于复杂的条件,类似于(A and B)or (C and D)可以使用组合过滤器链:

  • /**
  • * 组合滤器链
  • *
  • * @throws Exception
  • */
  • @Test
  • public void compositeFilter() throws Exception {
  • Configuration conf = HBaseConfiguration.create();
  • Connection connection = ConnectionFactory.createConnection(conf);
  • // 获得表
  • Table table = connection.getTable(TableName.valueOf("t4"));
  • // 指定扫描的数据
  • Scan scan = new Scan();
  • // 左组合条件
  • FilterList filterListLeft = new FilterList(FilterList.Operator.MUST_PASS_ALL);
  • filterListLeft.addFilter(new FamilyFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("f1")));
  • filterListLeft.addFilter(new QualifierFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes("name1"))));
  • // 右组合条件
  • FilterList filterListRight = new FilterList(FilterList.Operator.MUST_PASS_ALL);
  • filterListLeft.addFilter(new FamilyFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("f2")));
  • filterListLeft.addFilter(new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("name3"))));
  • // 组合条件
  • FilterList compositeFilter = new FilterList(FilterList.Operator.MUST_PASS_ONE, filterListLeft, filterListRight);
  • scan.setFilter(compositeFilter);
  • ResultScanner scanner = table.getScanner(scan);
  • Iterator<Result> iterator = scanner.iterator();
  • while (iterator.hasNext()) {
  • Result next = iterator.next();
  • NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> bigMap = next.getMap();
  • // 所有的列族的集合
  • Set<byte[]> columnsSet = bigMap.keySet();
  • // 循环每个列族的数据
  • for (byte[] columnName : columnsSet) {
  • NavigableMap<byte[], byte[]> familyMap = next.getFamilyMap(columnName);
  • for (java.util.Map.Entry<byte[], byte[]> entry : familyMap.entrySet()) {
  • List<Cell> columnCells = next.getColumnCells(columnName, entry.getKey());
  • for (Cell cell : columnCells) {
  • byte[] row = cell.getRow();
  • byte[] family = cell.getFamily();
  • byte[] qualifier = cell.getQualifier();
  • byte[] value = cell.getValue();
  • Object valueV = new String(value);
  • long timestamp = cell.getTimestamp();
  • System.out.println("row:" + new String(row) + ", family:" + new String(family) + ", qualifier:"
  • + new String(qualifier) + "\t, value:" + valueV + "\t, timestamp:" + timestamp);
  • }
  • }
  • }
  • }
  • scanner.close();
  • table.close();
  • System.out.println("scan table over");
  • }

运行上述代码,打印内容如下:

  • row:row0, family:f1, qualifier:age , value:age1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:id , value:id1-0 , timestamp:1499259417717
  • row:row0, family:f1, qualifier:name , value:name1-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:age , value:age2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:id , value:id2-0 , timestamp:1499259417717
  • row:row0, family:f2, qualifier:name , value:name2-0 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:age , value:age1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:id , value:id1-1 , timestamp:1499259417717
  • row:row1, family:f1, qualifier:name , value:name1-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:age , value:age2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:id , value:id2-1 , timestamp:1499259417717
  • row:row1, family:f2, qualifier:name , value:name2-1 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:age , value:age1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:id , value:id1-2 , timestamp:1499259417717
  • row:row2, family:f1, qualifier:name , value:name1-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:age , value:age2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:id , value:id2-2 , timestamp:1499259417717
  • row:row2, family:f2, qualifier:name , value:name2-2 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:age , value:age1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:id , value:id1-3 , timestamp:1499259417717
  • row:row3, family:f1, qualifier:name , value:name1-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:age , value:age2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:id , value:id2-3 , timestamp:1499259417717
  • row:row3, family:f2, qualifier:name , value:name2-3 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:age , value:age1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:id , value:id1-4 , timestamp:1499259417717
  • row:row4, family:f1, qualifier:name , value:name1-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:age , value:age2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:id , value:id2-4 , timestamp:1499259417717
  • row:row4, family:f2, qualifier:name , value:name2-4 , timestamp:1499259417717
  • scan table over