前言
GROUPING SETS
,GROUPING__ID
,CUBE
,ROLLUP
这几个分析函数通常用于OLAP
中,不能累加,而且需要根据不同维度上钻和下钻的指标统计,比如,分小时、天、月的UV数。
grouping sets
根据不同的维度组合进行聚合,等价于将不同维度的group by
的结果进行 union all
,简单来说就是将多个不同维度的group by
逻辑写在了 一个sql中。
数据准备:
vim /root/test.txt
2015-03,2015-03-10,cookie1
2015-03,2015-03-10,cookie5
2015-03,2015-03-12,cookie7
2015-04,2015-04-12,cookie3
2015-04,2015-04-13,cookie2
2015-04,2015-04-13,cookie4
2015-04,2015-04-16,cookie4
2015-03,2015-03-10,cookie2
2015-03,2015-03-10,cookie3
2015-04,2015-04-12,cookie5
2015-04,2015-04-13,cookie6
2015-04,2015-04-15,cookie3
2015-04,2015-04-15,cookie2
2015-04,2015-04-16,cookie1
将数据导入到hdfs目录上:
hdfs dfs -put /root/test.txt /tmp
创建表
use test;
create table cookie5(month string, day string, cookieid string)
row format delimited fields terminated by ',';
load data inpath "/tmp/test.txt" into table cookie5;
select * from cookie5;
开始使用grouping sets
来条sql语句:
select
month,day,count(cookieid)
from cookie5
group by month,day
grouping sets (month,day);
查询结果如下:
上面这个sql等同于多个group by + union all
select month,NULL as day,count(cookieid) as nums
from cookie5
group by month
union all select NULL as month,day,count(cookieid) as nums
from cookie5
group by day;
注意点:使用
union
和union all
必须保证各个select 集合的结果有相同个数的列,并且每个列的类型是一样的。union all
的表字段必须匹配,也就是上文的month 需要用NULL as month
来进行填充。
结果一致!但是grouping sets
速度要比group by + union
快!!!!!
但是有引出一个问题?为什么grouping sets
要比 group by + union
速度要快?
首先:用explain
来解释下hive
执行计划:
explain select
month,day,count(cookieid)
from cookie5
group by month,day
grouping sets (month,day);
执行计划如下:
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: cookie5 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: month (type: string), day (type: string), cookieid (type: string) |
| outputColumnNames: month, day, cookieid |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(cookieid) |
| keys: month (type: string), day (type: string), '0' (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2, _col3 |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| sort order: +++ |
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string), _col2 (type: string) |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col3 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string), KEY._col1 (type: string), KEY._col2 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col3 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| pruneGroupingSetId: true |
| Select Operator |
| expressions: _col0 (type: string), _col1 (type: string), _col3 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink
再看另外一个:
explain select month,NULL as day,count(cookieid) as nums from cookie5 group by month
union all
select NULL as month,day,count(cookieid) as nums from cookie5 group by day;
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-2 depends on stages: Stage-1, Stage-3 |
| Stage-3 is a root stage |
| Stage-0 depends on stages: Stage-2 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: cookie5 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: month (type: string), cookieid (type: string) |
| outputColumnNames: month, cookieid |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(cookieid) |
| keys: month (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: _col0 (type: string), UDFToString(null) (type: string), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-2 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| Union |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| TableScan |
| Union |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 2 Data size: 756 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-3 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: cookie5 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: day (type: string), cookieid (type: string) |
| outputColumnNames: day, cookieid |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: count(cookieid) |
| keys: day (type: string) |
| mode: hash |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: string) |
| sort order: + |
| Map-reduce partition columns: _col0 (type: string) |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col1 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: count(VALUE._col0) |
| keys: KEY._col0 (type: string) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
+----------------------------------------------------+--+
| Explain |
+----------------------------------------------------+--+
| Select Operator |
| expressions: UDFToString(null) (type: string), _col0 (type: string), _col1 (type: bigint) |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 1 Data size: 378 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| table: |
| input format: org.apache.hadoop.mapred.SequenceFileInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink |
| |
+----------------------------------------------------+--+
比较:通过explain
可以发现
grouping sets
时,有2个stage
,一次reduce
group by + union
时,有4个stage
,两次reduce
。
那么肯定用grouping sets
时,速度会快。
GROUPING__ID
的使用:
来条sql语句:
select
month,
day,
count(distinct cookieid) as uv,
GROUPING__ID
from cookie5
group by month,day
grouping sets (month,day)
order by GROUPING__ID;
效果等价于:
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day;
结果说明:
-
第一列是按照
month
进行分组 -
第二列是按照day进行分组
-
第三列是按照month或day分组是,统计这一组有几个不同的cookieid
-
第四列grouping_id表示这一组结果属于哪个分组集合,根据grouping sets中的分组条件month,day
再来个例子:
SELECT month, day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING__ID
FROM cookie5
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING__ID;
等价于:
SELECT month,NULL as day,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM cookie5 GROUP BY month
UNION ALL
SELECT NULL as month,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM cookie5 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM cookie5 GROUP BY month,day;
grouping_id计算方法
grouping sets
中的每一种粒度,都对应唯一的 grouping__id
值,其计算公式与 group by
的顺序、当前粒度的字段有关。
具体计算方法如下:
- 将每个字段都抽象成二进制,有2个字段就是2位,有5个字段就有5位二进制数。
- 靠近
group by
的字段是低位, 远离的则是高位。 - hive中参与了组合的字段是1,未参与是0。
- 这样就形成了一个二进制数,这个二进制数转为十进制,即为当前粒度对应的 grouping__id。
举例
以上述对 cookie5
的3种粒度的统计结果为例:
我们的上面的sql 为 group by month, day
, 遵循第一条原则,month
在二进制中处于低位,day
处于高位。
序号 | grouping set | 二进制值 | 对应的十进制(grouping__id 的值) |
---|---|---|---|
0 | () | 00 | 0 |
1 | (month) | 01 | 1 |
2 | (day) | 10 | 2 |
3 | (month, day) | 11 | 3 |
这就是上面 grouping sets
的结果中 grouping__id
值的由来。