Hive综合案例分析之不正常订单状态统计

需求

订单有5个状态:创建、捡货、发送、送达、取消

统计:创建和捡货之间不能操作2小时,创建到发送时间不能操作4小时,创建到送达之间不能超过48小时。

 

知识点

1)external table

2)desc formatted的使用

3)virtual column

4)Alter FILEFORMAT

5)COALESCE、unix_timestamp的使用

6)PARQUET

 

实现

外部表

订单创建表:

CREATE EXTERNAL TABLE order_created (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载表数据

load data local inpath /home/spark/software/data/order_created.txt overwrite into table order_created;

导入数据另一种方式:在创建表的时候通过location指定文件目录来导入数据 

1)在创建表时location指定的目录下已经存在文件

CREATE  TABLE test_load1 (
    id STRING
  , name  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
LOCATION /student;

2)在创建表时location指定的目录下还不存在文件

CREATE  TABLE test_load2 (
    id STRING
  , name  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
LOCATION /load;

将文件数据上传到指定目录下:

hadoop fs -put /home/spark/software/data/student.txt /load/

查询数据

select * from order_created;
10703007267488 2014-05-01 06:01:12.334+01 10101043505096 2014-05-01 07:28:12.342+01 10103043509747 2014-05-01 07:50:12.33+01 10103043501575 2014-05-01 09:27:12.33+01 10104043514061 2014-05-01 09:03:12.324+01

 

静态分区表

CREATE TABLE order_created_partition (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

分区表加载数据方式一:

load data local inpath /home/spark/software/data/order_created.txt overwrite into table order_created_partition PARTITION(event_month=2014-05);

数据查询

select * from order_created_partition where event_month=2014-05;   #不跑mapreduce
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05

分区表加载数据方式二:

第一步:创建hdfs目录:在hdfs目录:/user/hive/warehouse/order_created_partition目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_partition/event_month=2014-06

第二步:拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_partition/event_month=2014-06

第三步:添加新分区数据到元数据信息中

msck repair table order_created_partition;

执行日志信息:

Partitions not in metastore:    order_created_partition:event_month=2014-06
Repair: Added partition to metastore order_created_partition:event_month=2014-06

查询event_month=2014-06分区的数据:

select * from order_created_partition where event_month=2014-06; 
10703007267488 2014-05-01 06:01:12.334+01 2014-06 10101043505096 2014-05-01 07:28:12.342+01 2014-06 10103043509747 2014-05-01 07:50:12.33+01 2014-06 10103043501575 2014-05-01 09:27:12.33+01 2014-06 10104043514061 2014-05-01 09:03:12.324+01 2014-06

查看分区表已有的所有分区:

show partitions order_created_partition;

查看分区表已有的指定分区:

SHOW PARTITIONS order_created_partition PARTITION(event_month=2014-06);

查看表字段信息:

desc order_created_partition;
desc extended order_created_partition;
desc formatted order_created_partition;
desc formatted order_created_partition partition(event_month=2014-05);

 

动态分区表

CREATE TABLE order_created_dynamic_partition (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string);

加载数据:

insert into table order_created_dynamic_partition PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

报错:

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. 
To turn this off set hive.exec.dynamic.partition.mode=nonstrict

解决方案:

set hive.exec.dynamic.partition.mode=nonstrict;

重新执行:

insert into table order_created_dynamic_partition PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition;
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05

 

Parquet类型表以及ALTER FILEFORMAT

创建存储方式为parquet类型的分区表

CREATE TABLE order_created_dynamic_partition_parquet (
    orderNumber STRING
  , event_time  STRING
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS parquet;

查看表信息:需要重点关注下SerDe Library/InputFormat/OutputFormat三个属性的区别

desc formatted order_created_dynamic_partition_parquet;
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

desc formatted order_created_dynamic_partition;
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

插入数据:

insert into table order_created_dynamic_partition_parquet PARTITION (event_month)
select orderNumber, event_time, substr(event_time, 1, 7) as event_month from order_created;

查询数据:

select * from order_created_dynamic_partition_parquet;
10703007267488 2014-05-01 06:01:12.334+01 2014-05 10101043505096 2014-05-01 07:28:12.342+01 2014-05 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10103043501575 2014-05-01 09:27:12.33+01 2014-05 10104043514061 2014-05-01 09:03:12.324+01 2014-05

关注下hdfs上存储的文件:

hadoop fs -text /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-05/000000_0

查看发现是“乱码”的,文件内容不能直接肉眼识别。

注意:如下的操作是将textfile的文件拷贝到parquet类型的表中

在hdfs目录:/user/hive/warehouse/order_created_dynamic_partition_parquet目录下创建event_month=2014-06

hadoop fs -mkdir /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

拷贝数据到新创建的目录下:

hadoop fs -put /home/spark/software/data/order_created.txt /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06

添加新分区数据到元数据信息中:

msck repair table order_created_dynamic_partition_parquet;

查询数据:

select * from order_created_dynamic_partition_parquet;

报错,信息如下:

Failed with exception java.io.IOException:java.lang.RuntimeException: 
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt 
is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [52, 43, 48, 49]

报错原因:因为event_month=2014-06不是parquet文件,而是普通的文本类型文件

而读取parquet分区的数据是正常的,比如:

select * from order_created_dynamic_partition_parquet where event_month=2014-05;

解决方案:

hive0.12版本以及以下版本的解决方案

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET SERDE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe;
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT textfile;

必须要同时修改SERDE又要修改FILEFORMAT才能起作用;

手工设置以后就可以进行正常的查询操作了。

注意查看SerDe LazySimpleSerDe/INPUTFORMAT/OUTPUTFORMAT:

查看event_month=‘2014-06‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-06);
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat 

查看event_month=‘2014-05‘分区的Storage Information:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-05);
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

查看整个表的Storage Information:

desc formatted order_created_dynamic_partition_parquet;
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

如上操作后得出结论:

1)order_created_dynamic_partition_parquet中的不同分区可以有不同的存储类型;

2)表的存储类型还是创建时指定的存储类型;

hive0.13版本以及以上版本的解决方案:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT parquet;

insert数据到parquet表

insert into table order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) select orderNumber, event_time from order_created;

查看order_created_dynamic_partition_parquet目录下的文件发现一个partition中有两种不同类型的FILEFORMAT了

hadoop fs -ls /user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06
/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/000000_0      #Parquet类型
/user/hive/warehouse/order_created_dynamic_partition_parquet/event_month=2014-06/order_created.txt   #TEXTFILE类型

因为原先的event_month=‘2014-06‘已经被我们手工修改成TEXTFILE类型了,而insert into进去的是Parquet类型(insert into进去的存储类型和创建表时指定的存储类型一致)。

select * from order_created_dynamic_partition_parquet; 查询报错

查看表信息:

desc formatted order_created_dynamic_partition_parquet partition(event_month=2014-06);
SerDe Library:          org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe      
InputFormat:            org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
OutputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 

发现SerDe Library是Parquet类型了,原先手工设置的就无效了,变成创建表时指定的存储类型了。需要重新设置:

ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET SERDE org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe;
ALTER TABLE order_created_dynamic_partition_parquet PARTITION (event_month=2014-06) SET FILEFORMAT textfile;

select * from order_created_dynamic_partition_parquet;

能查询出部分数据,其中有些是Parquet格式的就显示不出来。这可能是hive的一个bug

 

Hive虚拟列

INPUT__FILE__NAME: 输入文件的文件名

BLOCK__OFFSET__INSIDE__FILE: 这一行数据所在的文件中的偏移量

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE  from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      0
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      42
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       84
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       125
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      166

求这一行在文件的第几行

select INPUT__FILE__NAME, ordernumber, event_time, BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1 from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3.0487804878
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4.0487804878
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      4.95238095238
select INPUT__FILE__NAME, ordernumber, event_time, round(BLOCK__OFFSET__INSIDE__FILE / (length(ordernumber) + length(event_time) + 2) + 1) from order_created_dynamic_partition;
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10703007267488 2014-05-01 06:01:12.334+01      1
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10101043505096 2014-05-01 07:28:12.342+01      2
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043509747 2014-05-01 07:50:12.33+01       3
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10103043501575 2014-05-01 09:27:12.33+01       4
hdfs://hadoop000:8020/user/hive/warehouse/order_created_dynamic_partition/event_month=2014-05/000000_0  10104043514061 2014-05-01 09:03:12.324+01      5

订单捡货表

创建表:

CREATE TABLE order_picked (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_picked.txt overwrite into table order_picked;

查询数据:

select * from order_picked;
10703007267488  2014-05-01 07:02:12.334+01
10101043505096  2014-05-01 08:29:12.342+01
10103043509747  2014-05-01 10:55:12.33+01

订单发货表

创建表:

CREATE TABLE order_shipped (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_shipped.txt overwrite into table order_shipped;

查询数据:

select * from order_shipped;
10703007267488  2014-05-01 10:00:12.334+01
10101043505096  2014-05-01 18:39:12.342+01

订单收货表

创建表:

CREATE TABLE order_received (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_received.txt overwrite into table order_received;

查询数据:

select * from order_received;
10703007267488  2014-05-02 12:12:12.334+01

订单取消表

创建表:

CREATE TABLE order_cancelled (
    orderNumber STRING
  , event_time  STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t;

加载数据:

load data local inpath /home/spark/software/data/order_cancelled.txt overwrite into table order_cancelled;

查询数据:

select * from order_cancelled;
10103043501575  2014-05-01 12:12:12.334+01

行列互换操作:

方法一:采用union all

CREATE TABLE order_tracking AS
SELECT orderNumber
     , max(CASE WHEN type_id="order_created"   THEN event_time ELSE 0 END) AS order_created_ts
     , max(CASE WHEN type_id="order_picked"    THEN event_time ELSE 0 END) AS order_picked_ts
     , max(CASE WHEN type_id="order_shipped"   THEN event_time ELSE 0 END) AS order_shipped_ts
     , max(CASE WHEN type_id="order_received"  THEN event_time ELSE 0 END) AS order_received_ts
     , max(CASE WHEN type_id="order_cancelled" THEN event_time ELSE 0 END) AS order_cancelled_ts
FROM (
    select orderNumber, "order_created"   as type_id, event_time FROM order_created
  UNION ALL
    select orderNumber, "order_picked"    as type_id, event_time FROM order_picked
  UNION ALL
    select orderNumber, "order_shipped"   as type_id, event_time FROM order_shipped
  UNION ALL
    select orderNumber, "order_received"  as type_id, event_time FROM order_received
  UNION ALL
    select orderNumber, "order_cancelled" as type_id, event_time FROM order_cancelled
) u
group by orderNumber;
select * from order_tracking order by order_created_ts limit 5;
10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       0
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    0
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       0       0       0
10104043514061  2014-05-01 09:03:12.324+01      0       0       0       0
10103043501575  2014-05-01 09:27:12.33+01       0       0       0       2014-05-01 12:12:12.334+01

方法二:采用join

CREATE TABLE order_tracking_join AS
select t1.orderNumber
     , t1.event_time as order_created_ts
     , t2.event_time as order_picked_ts
     , t3.event_time as order_shipped_ts
     , t4.event_time as order_received_ts
     , t5.event_time as order_cancelled_ts
from (
  select ordernumber, max(event_time) as event_time from order_created group by ordernumber
) t1
left outer join (
  select ordernumber, max(event_time) as event_time from order_picked group by ordernumber
) t2
on t1.ordernumber = t2.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_shipped group by ordernumber
) t3
on t1.ordernumber = t3.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_received group by ordernumber
) t4
on t1.ordernumber = t4.ordernumber
left outer join (
  select ordernumber, max(event_time) as event_time from order_cancelled group by ordernumber
) t5
on t1.ordernumber = t5.ordernumber;
select * from order_tracking_join order by order_created_ts limit 5;
10703007267488  2014-05-01 06:01:12.334+01      2014-05-01 07:02:12.334+01      2014-05-01 10:00:12.334+01    2014-05-02 12:12:12.334+01       NULL
10101043505096  2014-05-01 07:28:12.342+01      2014-05-01 08:29:12.342+01      2014-05-01 18:39:12.342+01    NULL     NULL
10103043509747  2014-05-01 07:50:12.33+01       2014-05-01 10:55:12.33+01       NULL    NULL    NULL
10104043514061  2014-05-01 09:03:12.324+01      NULL    NULL    NULL    NULL
10103043501575  2014-05-01 09:27:12.33+01       NULL    NULL    NULL    2014-05-01 12:12:12.334+01

最终的统计操作:

COALESCE(unix_timestamp(order_picked_ts, ‘yyyy-MM-dd HH:mm:ss.S‘), 0)使用说明:如果第一个参数为null的话就采用第二个参数作为第一个参数的值

方法一:采用采用union all的结果表

select orderNumber
     , order_created_ts
     , order_picked_ts
     , order_shipped_ts
     , order_received_ts
     , order_cancelled_ts
  from order_tracking
 WHERE order_created_ts != 0 AND order_cancelled_ts = 0
   AND (
    COALESCE(unix_timestamp(order_picked_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 2 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_shipped_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 4 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_received_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 48 * 60 * 60
   )
;
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 0 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 0 0 0

方法二:采用join的结果表

select orderNumber
     , order_created_ts
     , order_picked_ts
     , order_shipped_ts
     , order_received_ts
     , order_cancelled_ts
  from order_tracking_join
 WHERE order_created_ts IS NOT NULL AND order_cancelled_ts IS NULL
   AND (
    COALESCE(unix_timestamp(order_picked_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 2 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_shipped_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 4 * 60 * 60
    OR
    COALESCE(unix_timestamp(order_received_ts, yyyy-MM-dd HH:mm:ss.S), 0) - unix_timestamp(order_created_ts, yyyy-MM-dd HH:mm:ss.S) > 48 * 60 * 60
   )
;
10101043505096 2014-05-01 07:28:12.342+01 2014-05-01 08:29:12.342+01 2014-05-01 18:39:12.342+01 NULL NULL 10103043509747 2014-05-01 07:50:12.33+01 2014-05-01 10:55:12.33+01 NULL NULL NULL

 

文章来自:http://www.cnblogs.com/luogankun/p/3939533.html
© 2021 jiaocheng.bubufx.com  联系我们
ICP备案:鲁ICP备09046678号-3