0 需 求
稀疏字段累计求和问题
1 问题分析
根据图片中数据变换的形式,可以看出是根据字段term补齐数据中缺失的日期,term为连续日期的个数,当为12时,表明由2018-12-21到2019-01-02连续日期个数为12,当补齐日期后,根据日期顺序求amount的累计值,注意的是,当日期补齐后,补齐的日期值是空的。此类问题在业务中经常出现,特别在求累计值时,如果日期不是连续的,很容易漏掉部分日期累计值,造成数据不完整。这类问题的核心点就是数据日期非连续,需要补齐连续的日期,那么如何补齐连续日期呢?看过我SQLBOY1000题专栏的同学应该明白有类似的题目,这里给出链接。
SQL重叠交叉区间问题分析--HiveSQL面试题30_莫叫石榴姐的博客-CSDN博客
HiveSql一天一个小技巧:如何构造连续日期_hive生成连续的日期_莫叫石榴姐的博客-CSDN博客
步骤1:根据数据日期,补全需要的连续日期
对于补齐连续日期,我们给出模板及核心语句
lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
其中space()函数表示取空格,目的是为了扩展数据使用,取多少空格由里面参数决定,split()中的正则(?!$)表示不是以空格结尾的就匹配,因为split()函数会多切出一个空格,我们需要去掉。
利用posexplode()函数生成索引,根据数据中的起始日期(min(value_date))+增长步长的方式可以补齐所有的日期。注意这里面是按月增长的,我们使用add_months函数,即
add_months(value_date, pos)
整体生成连续日期语句如下:
with data as
(
select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
union all
select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)
select contract
, add_months(value_date, pos) value_date
,term
from (
select contract
, min(value_date) value_date
, max(amount) amount
, max(term) term
from data
group by contract
) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
步骤2:用补齐的连续日期作主表关联数据表,并计算累计值。
注意:这里一定要用生成连续日期做主表与关联数据表,这样才能做累计计算时候不重不漏,此时
sum() over(partition by order by )中sum的值一定是数据表右表中的值,partition by和order by的值是主表中的值。
在准确计算非连续日期累计值的核心点也在于此,生成补齐的日期维度表一定是主表,然后去关联数据表。
最终具体SQL如下:
with data as
(
select 'AAAA' as contract,'2018-12-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-03-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-06-21' as value_date,9439.30 as amount,12 as term
union all
select 'AAAA' as contract,'2019-09-21' as value_date,9439.30 as amount,12 as term
union all
select 'BBBB' as contract,'2018-12-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-02-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-06-02' as value_date,9439.30 as amount,10 as term
union all
select 'BBBB' as contract,'2019-09-02' as value_date,9439.30 as amount,10 as term
)
select dim.contract
,dim.value_date
,cast(sum(d.amount) over(partition by dim.contract order by dim.value_date) as decimal(18,2)) amount
,dim.term
from
(select contract
, add_months(value_date, pos) value_date
,term
from (
select contract
, min(value_date) value_date
, max(amount) amount
, max(term) term
from data
group by contract
) t1 lateral view posexplode(split(space(term), '(?!$)')) temp as pos,val
) dim
left join
(
select contract
,value_date
,amount
from data
) d
on dim.contract = d.contract and dim.value_date = d.value_date
结果如下:
2 小结
本文给出了一种非连续日期准确求解累计值的通用方法。通过本文可以学习到:
(1)连续日期的构造方法
(2)非连续日期准确求解累计值的方法
注意此类问题又叫稀疏字段累计求和问题