HIVE卡口流量需求分析
目录
HIVE卡口流量需求分析
1.创建表格 插入数据
2.需求
3.总结:
1.创建表格 插入数据
CREATE TABLE learn3.veh_pass(
id STRING COMMENT "卡口编号",
pass_time STRING COMMENT "进过时间",
pass_num int COMMENT "过车数"
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
STORED AS TEXTFILE;
load data local inpath "/usr/local/soft/hive-3.1.2/data/veh_pass.txt" INTO TABLE learn3.veh_pass;
2.需求
需求1:查询四月的设备及其设备种类总数
(如果是查询当前月可以使用语句:substr(pass_time,1,7) = substr(current_date,1,7))
-- 写法1
SELECT
T.id
,count(*) OVER()
FROM (
SELECT
id
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = substr(current_date,1,7)
) T GROUP BY T.id
-- 错误写法
SELECT
DISTINCT id
,count(*) OVER()
FROM (
SELECT
id
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = substr(current_date,1,7)
)T
-- 写法2:
SELECT
T1.id
,count(*) OVER()
FROM (
SELECT
DISTINCT T.id
FROM (
SELECT
id
,pass_time
FROM learn3.veh_pass
WHERE substr(pass_time,1,7) = "2022-04"
)T )T1;
如果求的是四月每辆车的出现次数
select
t1.id
,count(*)
from
(
select
v.id
,v.pass_time
from learn3.veh_pass v
where substr(pass_time,1,7) = "2022-04"
) t1 group by t1.id;
+---------------------+-----------------+
| t1.id | count_window_0 |
+---------------------+-----------------+
| 451000000000071117 | 5 |
| 451000000000071116 | 5 |
| 451000000000071115 | 5 |
| 451000000000071114 | 5 |
| 451000000000071113 | 5 |
+---------------------+-----------------+
+---------------------+
| id |
+---------------------+
| 451000000000071113 |
| 451000000000071114 |
| 451000000000071115 |
| 451000000000071116 |
| 451000000000071117 |
+---------------------+
3.总结:
OVER():会为每条数据都开启一个窗口,默认窗口大小就是当前数据集的大小
OVER(PARTITION BY)会按照指定的字段进行分区,在获取一条数据时,窗口大小为整个分区的大小,之后根据分区中的数据进行计算
OVER(PARTITION BY ... ORDER BY ...)根据给定的分区,在获取一条数据时,窗口大小为整个分区的大小,并且对分区中的数据进行排序
-- 需求2:查询所有流量明细及所有设备月流量总额
SELECT
T1.id
,T1.pass_time
,T1.pass_num
,SUM(T1.pass_num) OVER(PARTITION BY SUBSTRING(T1.pass_time,1,7)) as total_pass
FROM learn3.veh_pass T1;
需求3:按设备编号日期顺序展示明细 并求
OVER中的取数据格式
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的改变而变化。
CURRENT ROW:当前行
n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据
UNBOUNDED :起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING 表示到后面的终点
假设我们现在要取当前行 当前行的前一行数据和后一行数据 我们可以写、
ROW BETWEEN 1 PRECEDING and 1 FOLLOWING
1)从第一天开始到当前天数 对流量进行累加
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM (
SELECT
*
FROM learn3.veh_pass ORDER BY pass_time
) T1;
2)昨天与当前天流量累加
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM (
SELECT
*
FROM learn3.veh_pass ORDER BY pass_time
) T1;
3)当前天数的前一天与后一天流量累加
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM (
SELECT
*
FROM learn3.veh_pass ORDER BY pass_time
) T1;
4)当前天与下一天的累加和
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)
FROM (
SELECT
*
FROM learn3.veh_pass ORDER BY pass_time
) T1;
5)当前天数与之后所有天流量累加和
SELECT
T1.*
,SUM(T1.pass_num) OVER(ORDER BY T1.pass_time ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM (
SELECT
*
FROM learn3.veh_pass ORDER BY pass_time
) T1;
需求4:查询每个设备编号上次有数据日期和下一次有数据日期
LAG(col,n,default_val):往前第n行数据
LEAD(col,n, default_val):往后第n行数据
NTILE(n):把有序窗口的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
SELECT
T1.*
, LAG(T1.pass_time,1,"2022-01-01") OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as before_time
, LEAD(T1.pass_time,1,"2022-12-31") OVER(PARTITION BY T1.id ORDER BY T1.pass_time) as after_time
FROM learn3.veh_pass T1;