Hive综合案例分析之不正常订单状态统计
需求
订单有5个状态:创建、捡货、发送、送达、取消
统计:创建和捡货之间不能操作2小时,创建到发送时间不能操作4小时,创建到送达之间不能超过48小时。
知识点
1)external table
2)desc formatted的使用
3)virtual column
4)Alter FILEFORMAT
5)COALESCE、unix_timestamp的使用
6)PARQUET
实现
外部表
订单创建表:
CREATE EXTERNAL TABLE order_created ( orderNumber STRING , event_time STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
加载表数据
load data local inpath ‘/home/spark/software/data/order_created.txt‘ overwrite into table order_created;
导入数据另一种方式:在创建表的时候通过location指定文件目录来导入数据
1)在创建表时location指定的目录下已经存在文件
CREATE TABLE test_load1 ( id STRING , name STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘ LOCATION ‘/student‘;
2)在创建表时location指定的目录下还不存在文件
CREATE TABLE test_load2 ( id STRING , name STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘ LOCATION ‘/load‘;
将文件数据上传到指定目录下:
hadoop fs -put /home/spark/software/data/student.txt /load/
查询数据
select * from order_created;
10703007267488 2014-05-01 06:01:12.334+01 10101043505096 2014-05-01 07:28:12.342+01 10103043509747 2014-05-01 07:50:12.33+01 10103043501575 2014-05-01 09:27:12.33+01 10104043514061 2014-05-01 09:03:12.324+01
静态分区表
CREATE TABLE order_created_partition ( orderNumber STRING , event_time STRING ) PARTITIONED BY (event_month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
分区表加载数据方式一:
load data local inpath ‘/home/spark/software/data/order_created.txt‘ overwrite into table order_created_partition PARTITION(event_month=‘2014-05‘);
数据查询
select * from order_created_partition where event_month=‘2014-05‘; #不跑mapreduce
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05
分区表加载数据方式二:
第一步:创建hdfs目录:在hdfs目录:/user/hive/warehouse/order_created_partition目录下创建event_month=2014-06
hadoop fs -mkdir /user/hive/warehouse/order_created_partition/event_month=2014-06
第二步:拷贝数据到新创建的目录下:
hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_partition/event_month=2014-06
第三步:添加新分区数据到元数据信息中
msck repair table order_created_partition;
执行日志信息:
Partitions not in metastore: order_created_partition:event_month=2014-06 Repair: Added partition to metastore order_created_partition:event_month=2014-06
查询event_month=2014-06分区的数据:
select * from order_created_partition where event_month=‘2014-06‘;
10703007267488 2014-05-01 06:01:12.334+01 2014-06 10101043505096 2014-05-01 07:28:12.342+01 2014-06 10103043509747 2014-05-01 07:50:12.33+01 2014-06 10103043501575 2014-05-01 09:27:12.33+01 2014-06 10104043514061 2014-05-01 09:03:12.324+01 2014-06
查看分区表已有的所有分区:
show partitions order_created_partition;
查看分区表已有的指定分区:
SHOW PARTITIONS order_created_partition PARTITION(event_month=‘2014-06‘);
查看表字段信息:
desc order_created_partition; desc extended order_created_partition; desc formatted order_created_partition; desc formatted order_created_partition partition(event_month=‘2014-05‘);
动态分区表
CREATE TABLE order_created_dynamic_partition ( orderNumber STRING , event_time STRING ) PARTITIONED BY (event_month string);
加载数据:
insert into table order_created_dynamic_partition PARTITION (event_month) select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;
报错:
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
解决方案:
set hive.exec.dynamic.partition.mode=nonstrict;
重新执行:
insert into table order_created_dynamic_partition PARTITION (event_month) select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;
查询数据:
select * from order_created_dynamic_partition;
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05
Parquet类型表以及ALTER FILEFORMAT
创建存储方式为parquet类型的分区表
CREATE TABLE order_created_dynamic_partition_parquet ( orderNumber STRING , event_time STRING ) PARTITIONED BY (event_month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘ STORED AS parquet;
查看表信息:需要重点关注下SerDe Library/InputFormat/OutputFormat三个属性的区别
desc formatted order_created_dynamic_partition_parquet; SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat desc formatted order_created_dynamic_partition; SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
插入数据:
insert into table order_created_dynamic_partition_parquet PARTITION (event_month) select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;
查询数据:
select * from order_created_dynamic_partition_parquet;
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05
关注下hdfs上存储的文件:
hadoop fs -text /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-05/000000_0
查看发现是“乱码”的,文件内容不能直接肉眼识别。
注意:如下的操作是将textfile的文件拷贝到parquet类型的表中
在hdfs目录:/user/hive/warehouse/order_created_dynamic_partition_parquet目录下创建event_month=2014-06
hadoop fs -mkdir /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06
拷贝数据到新创建的目录下:
hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06
添加新分区数据到元数据信息中:
msck repair table order_created_dynamic_partition_parquet;
查询数据:
select * from order_created_dynamic_partition_parquet;
报错,信息如下:
Failed with exception java.io.IOException:java.lang.RuntimeException: hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [52, 43, 48, 49]
报错原因:因为event_month=2014-06不是parquet文件,而是普通的文本类型文件
而读取parquet分区的数据是正常的,比如:
select * from order_created_dynamic_partition_parquet where event_month=‘2014-05‘;
解决方案:
hive0.12版本以及以下版本的解决方案
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘; ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT textfile;
必须要同时修改SERDE又要修改FILEFORMAT才能起作用;
手工设置以后就可以进行正常的查询操作了。
注意查看SerDe LazySimpleSerDe/INPUTFORMAT/OUTPUTFORMAT:
查看event_month=‘2014-06‘分区的Storage Information:
desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-06‘); SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
查看event_month=‘2014-05‘分区的Storage Information:
desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-05‘); SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
查看整个表的Storage Information:
desc formatted order_created_dynamic_partition_parquet; SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
如上操作后得出结论:
1)order_created_dynamic_partition_parquet中的不同分区可以有不同的存储类型;
2)表的存储类型还是创建时指定的存储类型;
hive0.13版本以及以上版本的解决方案:
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT parquet;
insert数据到parquet表
insert into table order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) select orderNumber, event_time from order_created;
查看order_created_dynamic_partition_parquet目录下的文件发现一个partition中有两种不同类型的FILEFORMAT了
hadoop fs -ls /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06 /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/000000_0 #Parquet类型 /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt #TEXTFILE类型
因为原先的event_month=‘2014-06‘已经被我们手工修改成TEXTFILE类型了,而insert into进去的是Parquet类型(insert into进去的存储类型和创建表时指定的存储类型一致)。
select * from order_created_dynamic_partition_parquet; 查询报错
查看表信息:
desc formatted order_created_dynamic_partition_parquet partition(event_month=‘2014-06‘); SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
发现SerDe Library是Parquet类型了,原先手工设置的就无效了,变成创建表时指定的存储类型了。需要重新设置:
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘; ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=‘2014-06‘) SET FILEFORMAT textfile;
select * from order_created_dynamic_partition_parquet;
能查询出部分数据,其中有些是Parquet格式的就显示不出来。这可能是hive的一个bug
Hive虚拟列
INPUT__FILE__NAME: 输入文件的文件名
BLOCK__OFFSET__INSIDE__FILE: 这一行数据所在的文件中的偏移量
select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE from order_created_dynamic_partition; hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10703007267488 2014-05-01 06:01:12.334+01 0 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10101043505096 2014-05-01 07:28:12.342+01 42 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043509747 2014-05-01 07:50:12.33+01 84 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043501575 2014-05-01 09:27:12.33+01 125 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10104043514061 2014-05-01 09:03:12.324+01 166
求这一行在文件的第几行
select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1 from order_created_dynamic_partition; hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10703007267488 2014-05-01 06:01:12.334+01 1 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10101043505096 2014-05-01 07:28:12.342+01 2 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043509747 2014-05-01 07:50:12.33+01 3.0487804878 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043501575 2014-05-01 09:27:12.33+01 4.0487804878 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10104043514061 2014-05-01 09:03:12.324+01 4.95238095238
select INPUT__FILE__NAME, ordernumber, event_time, round(BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1) from order_created_dynamic_partition; hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10703007267488 2014-05-01 06:01:12.334+01 1 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10101043505096 2014-05-01 07:28:12.342+01 2 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043509747 2014-05-01 07:50:12.33+01 3 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10103043501575 2014-05-01 09:27:12.33+01 4 hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0 10104043514061 2014-05-01 09:03:12.324+01 5
订单捡货表
创建表:
CREATE TABLE order_picked ( orderNumber STRING , event_time STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
加载数据:
load data local inpath ‘/home/spark/software/data/order_picked.txt‘ overwrite into table order_picked;
查询数据:
select * from order_picked; 10703007267488 2014-05-01 07:02:12.334+01 10101043505096 2014-05-01 08:29:12.342+01 10103043509747 2014-05-01 10:55:12.33+01
订单发货表
创建表:
CREATE TABLE order_shipped ( orderNumber STRING , event_time STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
加载数据:
load data local inpath ‘/home/spark/software/data/order_shipped.txt‘ overwrite into table order_shipped;
查询数据:
select * from order_shipped; 10703007267488 2014-05-01 10:00:12.334+01 10101043505096 2014-05-01 18:39:12.342+01
订单收货表
创建表:
CREATE TABLE order_received ( orderNumber STRING , event_time STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
加载数据:
load data local inpath ‘/home/spark/software/data/order_received.txt‘ overwrite into table order_received;
查询数据:
select * from order_received; 10703007267488 2014-05-02 12:12:12.334+01
订单取消表
创建表:
CREATE TABLE order_cancelled ( orderNumber STRING , event_time STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;
加载数据:
load data local inpath ‘/home/spark/software/data/order_cancelled.txt‘ overwrite into table order_cancelled;
查询数据:
select * from order_cancelled; 10103043501575 2014-05-01 12:12:12.334+01
行列互换操作:
方法一:采用union all
CREATE TABLE order_tracking AS SELECT orderNumber , max(CASE WHEN type_id="order_created" THEN event_time ELSE ‘0‘ END) AS order_created_ts , max(CASE WHEN type_id="order_picked" THEN event_time ELSE ‘0‘ END) AS order_picked_ts , max(CASE WHEN type_id="order_shipped" THEN event_time ELSE ‘0‘ END) AS order_shipped_ts , max(CASE WHEN type_id="order_received" THEN event_time ELSE ‘0‘ END) AS order_received_ts , max(CASE WHEN type_id="order_cancelled" THEN event_time ELSE ‘0‘ END) AS order_cancelled_ts FROM ( select orderNumber, "order_created" as type_id, event_time FROM order_created UNION ALL select orderNumber, "order_picked" as type_id, event_time FROM order_picked UNION ALL select orderNumber, "order_shipped" as type_id, event_time FROM order_shipped UNION ALL select orderNumber, "order_received" as type_id, event_time FROM order_received UNION ALL select orderNumber, "order_cancelled" as type_id, event_time FROM order_cancelled ) u group by orderNumber;
select * from order_tracking order by order_created_ts limit 5; 10703007267488 2014-05-01 06:01:12.334+01 2014-05-01 07:02:12.334+01 2014-05-01 10:00:12.334+01 2014-05-02 12:12:12.334+01 0 10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 0 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 0 0 0 10104043514061 2014-05-01 09:03:12.324+01 0 0 0 0 10103043501575 2014-05-01 09:27:12.33+01 0 0 0 2014-05-01 12:12:12.334+01
方法二:采用join
CREATE TABLE order_tracking_join AS select t1.orderNumber , t1.event_time as order_created_ts , t2.event_time as order_picked_ts , t3.event_time as order_shipped_ts , t4.event_time as order_received_ts , t5.event_time as order_cancelled_ts from ( select ordernumber, max(event_time) as event_time from order_created group by ordernumber ) t1 left outer join ( select ordernumber, max(event_time) as event_time from order_picked group by ordernumber ) t2 on t1.ordernumber = t2.ordernumber left outer join ( select ordernumber, max(event_time) as event_time from order_shipped group by ordernumber ) t3 on t1.ordernumber = t3.ordernumber left outer join ( select ordernumber, max(event_time) as event_time from order_received group by ordernumber ) t4 on t1.ordernumber = t4.ordernumber left outer join ( select ordernumber, max(event_time) as event_time from order_cancelled group by ordernumber ) t5 on t1.ordernumber = t5.ordernumber;
select * from order_tracking_join order by order_created_ts limit 5; 10703007267488 2014-05-01 06:01:12.334+01 2014-05-01 07:02:12.334+01 2014-05-01 10:00:12.334+01 2014-05-02 12:12:12.334+01 NULL 10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 NULL NULL 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 NULL NULL NULL 10104043514061 2014-05-01 09:03:12.324+01 NULL NULL NULL NULL 10103043501575 2014-05-01 09:27:12.33+01 NULL NULL NULL 2014-05-01 12:12:12.334+01
最终的统计操作:
COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0)使用说明:如果第一个参数为null的话就采用第二个参数作为第一个参数的值
方法一:采用采用union all的结果表
select orderNumber , order_created_ts , order_picked_ts , order_shipped_ts , order_received_ts , order_cancelled_ts from order_tracking WHERE order_created_ts != ‘0‘ AND order_cancelled_ts = ‘0‘ AND ( COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 2 * 60 * 60 OR COALESCE(unix_timestamp(order_shipped_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 4 * 60 * 60 OR COALESCE(unix_timestamp(order_received_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 48 * 60 * 60 ) ;
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 0 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 0 0 0
方法二:采用join的结果表
select orderNumber , order_created_ts , order_picked_ts , order_shipped_ts , order_received_ts , order_cancelled_ts from order_tracking_join WHERE order_created_ts IS NOT NULL AND order_cancelled_ts IS NULL AND ( COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 2 * 60 * 60 OR COALESCE(unix_timestamp(order_shipped_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 4 * 60 * 60 OR COALESCE(unix_timestamp(order_received_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0) - unix_timestamp(order_created_ts, ‘yyyy-MM-dd HH:mm:ss.S‘) > 48 * 60 * 60 ) ;
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 NULL NULL 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 NULL NULL NULL