使用Hive存储数据实践

1-03 大数据 magicwt 3,446 views

使用Hive存储数据实践

1-03 3,446 views

数据存储需求是：每天会生成大量文章数据，每条文章数据包含标题、内容、URL、发表时间等多个字段，数据后续不会更新，因此考虑使用Hive作为数据仓库存储这些数据。以下介绍使用Hive存储数据的实践步骤以及注意事项。

1.创建表

创建外部表toutiao_category，建表语句如下所示，使用外部表是为了考虑数据存储的灵活性，对于外部表，若后续删除表，仅删除表元数据，不会删除表数据，可以继续读取数据进行分析。

CREATE EXTERNAL TABLE `toutiao_category`(                                      
  `toutiao_has_mp4_video` int,                                                 
  `toutiao_repin_count` int,                                                   
  `abstract` string,                                                           
  `article_ptime` int,                                                         
  `toutiao_recommend` int,                                                     
  `toutiao_article_type` int,                                                  
  `category` string,                                                           
  `docid` bigint,                                                              
  `bury_count` int,                                                            
  `title` string,                                                              
  `content` string,                                                            
  `source` string,                                                             
  `comment_count` int,                                                         
  `article_url` string,                                                        
  `toutiao_middle_mode` string,                                                
  `toutiao_datetime` string,                                                   
  `toutiao_aggr_type` int,                                                     
  `toutiao_article_sub_type` int,                                              
  `toutiao_external_visit_count` int,                                          
  `ctime` int,                                                                 
  `toutiao_favorite_count` int,                                                
  `toutiao_impression_count` int,                                              
  `toutiao_keywords` string,                                                   
  `digg_count` int,                                                            
  `toutiao_more_mode` string,                                                  
  `toutiao_go_detail_count` int,                                               
  `origin_url` string)                                                         
PARTITIONED BY (                                                               
  `date` string)                                                               
ROW FORMAT DELIMITED                                                           
  FIELDS TERMINATED BY '\t'                                                    
  LINES TERMINATED BY '\n'                                                     
STORED AS INPUTFORMAT                                                          
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat'                                   
OUTPUTFORMAT                                                                   
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'                 
LOCATION                                                                       
  'hdfs://heracles/user/mediadata/hive/warehouse/news/toutiao_category';

CREATE EXTERNAL TABLE `toutiao_category`(

`toutiao_has_mp4_video` int,

`toutiao_repin_count` int,

`abstract` string,

`article_ptime` int,

`toutiao_recommend` int,

`toutiao_article_type` int,

`category` string,

`docid` bigint,

`bury_count` int,

`title` string,

`content` string,

`source` string,

`comment_count` int,

`article_url` string,

`toutiao_middle_mode` string,

`toutiao_datetime` string,

`toutiao_aggr_type` int,

`toutiao_article_sub_type` int,

`toutiao_external_visit_count` int,

`ctime` int,

`toutiao_favorite_count` int,

`toutiao_impression_count` int,

`toutiao_keywords` string,

`digg_count` int,

`toutiao_more_mode` string,

`toutiao_go_detail_count` int,

`origin_url` string)

PARTITIONED BY (

`date` string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS INPUTFORMAT

'com.hadoop.mapred.DeprecatedLzoTextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'hdfs://heracles/user/mediadata/hive/warehouse/news/toutiao_category';

建表语句中：
1）“PARTITIONED BY (date string)”是按照天进行分区，类似于关系数据库中的分表操作，这样在实际存储表数据时，某一天的数据会单独存储于某个目录下，查询条件包含天时，就会只读取满足条件的天所对应目录下的数据，这样可以加快查询速度；
2）“ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’”表示对于存储的数据，按照“\n”划分成行，并对于行，按照“\t”划分出字段，划分出的字段的个数、顺序、属性应与建表语句中的字段说明一致；
3）“STORED AS INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’”，因为数据采用LZO算法压缩，所以存储格式指定为“com.hadoop.mapred.DeprecatedLzoTextInputFormat”；
4）“LOCATION ‘hdfs://heracles/user/mediadata/hive/warehouse/crawl_news/toutiao_category’”表示数据在HDFS中存储的根目录。

2.压缩数据

按行并且每行按照“\t”分割字段导出某一天的文章数据，并使用lzop命令压缩数据：

lzop toutiao_category_20160101.txt -odata.lzo

压缩后的数据文件和原有文件相比，大小可以减小50%左右，节约了存储空间。

3.导入数据

公司集群中的Hive采用BeeLine连接Hive Server，使用BeeLine向toutiao_category表导入数据：

/usr/lib/hive/bin/beeline -u “jdbc:hive2://xxx.xxx.xxx.xxx:xxx/mediadata_news;principal=xxx” -e “load data local inpath ‘/data/rsync_dir/news/data.lzo’ overwrite into table toutiao_category partition (date=’20160101′);”

执行后，本地文件“/data/rsync_dir/news/data.lzo”被导入到toutiao_category表的“20160101”分区中：

/user/mediadata/hive/warehouse/crawl_news/toutiao_category/date=20160101/data.lzo

4.为压缩数据创建索引

对Hive表进行查询时，Hive实际上是将查询SQL转化为MapReduce Job，而对于导入到toutiao_category表的data.lzo文件，由于其是lzo格式，因此MapReduce Job在读取、分析该文件时，只会分配一个Mapper任务，因此为了提高查询速度，对lzo文件创建索引，这样MapReduce Job会对一个lzo文件分配多个Mapper任务：

hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar com.hadoop.compression.lzo.LzoIndexer /user/mediadata/hive/warehouse/news/toutiao_category/date=20160101/data.lzo

执行后，会增加index文件：

/user/mediadata/hive/warehouse/news/toutiao_category/date=20160101/data.lzo.index

5.查询数据

使用BeeLine执行SQL查询：

0: jdbc:hive2://xxx.xxx.xxx.xxx:xxx/mediadata_n> select ctime from toutiao_category where date=20160101 limit 10;
+————-+
| ctime |
+————-+
| 1451663999 |
| 1451663998 |
| 1451663980 |
| 1451663967 |
| 1451663956 |
| 1451663956 |
| 1451663955 |
| 1451663945 |
| 1451663933 |
| 1451663933 |
+————-+
10 rows selected (19.087 seconds)

版权属于: 我爱我家

原文地址: http://magicwt.com/2016/01/03/%e4%bd%bf%e7%94%a8hive%e5%ad%98%e5%82%a8%e6%95%b0%e6%8d%ae%e5%ae%9e%e8%b7%b5/

转载时必须以链接形式注明原始出处及本声明。