【离线数仓-7-数据仓库开发DIM层设计要点-拉链表同步装载脚本】

news2024/10/2 8:42:12

离线数仓-7-数据仓库开发DIM层设计要点-拉链表同步&装载脚本

  • 离线数仓-7-数据仓库开发DIM层设计要点-拉链表同步&装载脚本
    • 一、DIM层 维度模型 设计要点
      • 6.用户维度表 -拉链表
        • 1.用户维度表 前期梳理
        • 2.用户维度表 DDL表设计分析
        • 3.用户维度表 加载数据分析
          • 1.拉链表首日装载数据SQL
          • 2.拉链表每日装载数据SQL
            • 1.拉链表每日装载数据 -第一种思路
            • 2.拉链表每日装载数据 -第二种思路
      • 7.DIM层 维度模型-数据装载脚本
        • 1.首日装载脚本
        • 2.每日装载脚本

离线数仓-7-数据仓库开发DIM层设计要点-拉链表同步&装载脚本

一、DIM层 维度模型 设计要点

6.用户维度表 -拉链表

  • 拉链表回顾:
    • 拉链表是什么:记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。
    • 拉链表意义:在于能够更加高效的保存维度信息的历史状态。
    • 拉链表适合的场景:缓慢变化维,业务表格数据变化不频繁。
    • 拉链表如何使用:分析的日期 大于等于 开始日期 并且 分析的日期 小于等于 结束日期。
    • 之前文档汇总: https://blog.csdn.net/weixin_38136584/article/details/129137583?spm=1001.2014.3001.5501

1.用户维度表 前期梳理

  • 从业务系统寻找用户维度表的主维表和相关维表
    • 1.获取用户相关表格如下:
      ods_user_info_inc
    • 2.分析与之关联的每个表格中的具体字段,抽离出来“用户维度表”所需字段
    • 3.确定这些表格中,谁是主维表。粒度小的是主维表,用户表user,所以此表作为主维表。
    • 4.主维表确定以后,
      • 之前规则:每日全量表 主维表对应的一行代表什么信息,维度表一行就代表对应信息;
      • 但是此处, 用户表做为主维表,同时用户表也是一张拉链表,拉链表规定,每行是一个状态记录,所以这里每行代表一个用户状态。

2.用户维度表 DDL表设计分析

  • 拉链表的开始时间和结束时间 需要注意
  • 拉链表按照什么做分区
    • hive中对表进行分区的意义:分区后,一个表中,不同数据存放在不同路径,后面查询时,使用分区作为过滤条件,极大减少了查询的数据量级,查询效率会更高,便于将来查询。
    • 对于拉链表来说,将来查询场景需要分析,拉链表也是一张维度表,
      • 维度表查询,一方面是获取全量最新,另一方面是获取历史上一份数据
      • 对于查询频繁方向来看,拉链表对于查询最新数据是比较频繁的。
      • 所以,就上面思考,设计拉链表的时候,需要思考:怎样加快查询最新一天拉链表的查询速度。
  • 分区规划的设计如下图:
    • dt最大的时间分区里面,存放 全量最新的用户数据。
    • dt之前的分区,存放当日过期的用户数据。
      在这里插入图片描述
  • 具体sql如下:
DROP TABLE IF EXISTS dim_user_zip;
CREATE EXTERNAL TABLE dim_user_zip
(
    `id`           STRING COMMENT '用户id',
    `login_name`   STRING COMMENT '用户名称',
    `nick_name`    STRING COMMENT '用户昵称',
    `name`         STRING COMMENT '用户姓名',
    `phone_num`    STRING COMMENT '手机号码',
    `email`        STRING COMMENT '邮箱',
    `user_level`   STRING COMMENT '用户等级',
    `birthday`     STRING COMMENT '生日',
    `gender`       STRING COMMENT '性别',
    `create_time`  STRING COMMENT '创建时间',
    `operate_time` STRING COMMENT '操作时间',
    `start_date`   STRING COMMENT '开始日期',
    `end_date`     STRING COMMENT '结束日期'
) COMMENT '用户表'
    PARTITIONED BY (`dt` STRING)
    STORED AS ORC
    LOCATION '/warehouse/gmall/dim/dim_user_zip/'
    TBLPROPERTIES ('orc.compress' = 'snappy');

3.用户维度表 加载数据分析

  • 对于拉链表来说,只做增量同步即可。

  • 拉链表具体装载过程如下:

    • 1.拉链表最开始加载的处理:全量同步数据到拉链表,并且添加开始时间和结束时间
      • 开始日期:以什么日期为准? 获取不到开始时间就以拉链表最初同步数据的时间为准。
      • 结束时间:添加最大时间即可
    • 2.拉链表第二天加载数据的处理:
      • 1.基于binlog来获取当天用户全量表相关更新操作日志,进行分析
      • 2.获取 当天的用户变化表,并添加上开始时间和结束时间
      • 3.当天用户变化表和前一天全量的用户拉链表进行合并,得到最新的全量用户拉链表。
    • 3.拉链表后面的加载数据跟第二天一致。
      在这里插入图片描述
  • 数据流向:基于拉链表具体装载过程的分析,需要书写两个加载过程,分别是第一天和后面的。
    在这里插入图片描述

1.拉链表首日装载数据SQL
  • 隐私信息的脱敏,手机号,用户名等等
insert overwrite table dim_user_zip partition (dt='9999-12-31')
select
    data.id,
    data.login_name,
    data.nick_name,
    md5(data.name),
    md5(data.phone_num),
    md5(data.email),
    data.user_level,
    data.birthday,
    data.gender,
    data.create_time,
    data.operate_time,
    '2020-06-14' start_date,
    '9999-12-31' end_date
from ods_user_info_inc
where dt='2020-06-14'
and type='bootstrap-insert';
2.拉链表每日装载数据SQL
1.拉链表每日装载数据 -第一种思路
  • 第一种SQL思路具体过程

    • 1.获取子查询 前一日全量数据
    • 2.获取子查询 当日新增及变化数据
    • 3.使用full join 将新表和旧表两部分数据关联
    • 4.使用CET将两个子查询连接起来,为后面共享使用提供便利
    • 5.然后使用if判断,数据更新到全量最新分区还是前一天的分区
      • 1.判断是否为当日全量最新数据,fulljoin之后,如果新表id不为空,取新表所有字段,否则取老表所有字段,并更新为最新的分区,整合完毕数据,即为需要当日全量最新数据,更新到时间最大分区
      • 2.判断是否为过期数据,fulljoin之后,新旧表格两边id都不为空,那么对应数据就是过期数据,需要更新到前一天分区里面,同时 结束时间也需要更新为前一天的时间。
    • 5.将两部分查询到的数据进行uninon all操作;
    • 6.使用hive动态分区的概念,使用一个insert命令,将不同数据发往不同分区。
      在这里插入图片描述
  • 前一日全量数据SQL

 select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_zip
        where dt='9999-12-31'
  • 当日新增及变化数据的SQL,需要将每天变化的最终数据获取到,因为是binlog数据,会记录一天中某条记录变化的各种状态,可能一天内会出现多条数据,只获取最后的状态数据即可,所以这里需要进行判断操作。分组TopN操作:开窗函数+过滤。
select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ods_user_info_inc
            where dt='2020-06-15'
        )t1
        where rn=1
  • 判断是否为当日全量最新数据
with
tmp as
(
    select
        old.id old_id,
        old.login_name old_login_name,
        old.nick_name old_nick_name,
        old.name old_name,
        old.phone_num old_phone_num,
        old.email old_email,
        old.user_level old_user_level,
        old.birthday old_birthday,
        old.gender old_gender,
        old.create_time old_create_time,
        old.operate_time old_operate_time,
        old.start_date old_start_date,
        old.end_date old_end_date,
        new.id new_id,
        new.login_name new_login_name,
        new.nick_name new_nick_name,
        new.name new_name,
        new.phone_num new_phone_num,
        new.email new_email,
        new.user_level new_user_level,
        new.birthday new_birthday,
        new.gender new_gender,
        new.create_time new_create_time,
        new.operate_time new_operate_time,
        new.start_date new_start_date,
        new.end_date new_end_date
    from
    (
        select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_zip
        where dt='9999-12-31'
    )old
    full outer join
    (
        select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ods_user_info_inc
            where dt='2020-06-15'
        )t1
        where rn=1
    )new
    on old.id=new.id
)
select
    if(new_id is not null,new_id,old_id),
    if(new_id is not null,new_login_name,old_login_name),
    if(new_id is not null,new_nick_name,old_nick_name),
    if(new_id is not null,new_name,old_name),
    if(new_id is not null,new_phone_num,old_phone_num),
    if(new_id is not null,new_email,old_email),
    if(new_id is not null,new_user_level,old_user_level),
    if(new_id is not null,new_birthday,old_birthday),
    if(new_id is not null,new_gender,old_gender),
    if(new_id is not null,new_create_time,old_create_time),
    if(new_id is not null,new_operate_time,old_operate_time),
    if(new_id is not null,new_start_date,old_start_date),
    if(new_id is not null,new_end_date,old_end_date),
    if(new_id is not null,new_end_date,old_end_date) dt
from tmp

  • 判断是否为过期数据
with
tmp as
(
    select
        old.id old_id,
        old.login_name old_login_name,
        old.nick_name old_nick_name,
        old.name old_name,
        old.phone_num old_phone_num,
        old.email old_email,
        old.user_level old_user_level,
        old.birthday old_birthday,
        old.gender old_gender,
        old.create_time old_create_time,
        old.operate_time old_operate_time,
        old.start_date old_start_date,
        old.end_date old_end_date,
        new.id new_id,
        new.login_name new_login_name,
        new.nick_name new_nick_name,
        new.name new_name,
        new.phone_num new_phone_num,
        new.email new_email,
        new.user_level new_user_level,
        new.birthday new_birthday,
        new.gender new_gender,
        new.create_time new_create_time,
        new.operate_time new_operate_time,
        new.start_date new_start_date,
        new.end_date new_end_date
    from
    (
        select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_zip
        where dt='9999-12-31'
    )old
    full outer join
    (
        select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ods_user_info_inc
            where dt='2020-06-15'
        )t1
        where rn=1
    )new
    on old.id=new.id
)
select
    old_id,
    old_login_name,
    old_nick_name,
    old_name,
    old_phone_num,
    old_email,
    old_user_level,
    old_birthday,
    old_gender,
    old_create_time,
    old_operate_time,
    old_start_date,
    cast(date_add('2020-06-15',-1) as string) old_end_date,
    cast(date_add('2020-06-15',-1) as string) dt
from tmp

  • 总结,第一种方式的最终SQL
with
tmp as
(
    select
        old.id old_id,
        old.login_name old_login_name,
        old.nick_name old_nick_name,
        old.name old_name,
        old.phone_num old_phone_num,
        old.email old_email,
        old.user_level old_user_level,
        old.birthday old_birthday,
        old.gender old_gender,
        old.create_time old_create_time,
        old.operate_time old_operate_time,
        old.start_date old_start_date,
        old.end_date old_end_date,
        new.id new_id,
        new.login_name new_login_name,
        new.nick_name new_nick_name,
        new.name new_name,
        new.phone_num new_phone_num,
        new.email new_email,
        new.user_level new_user_level,
        new.birthday new_birthday,
        new.gender new_gender,
        new.create_time new_create_time,
        new.operate_time new_operate_time,
        new.start_date new_start_date,
        new.end_date new_end_date
    from
    (
        select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_zip
        where dt='9999-12-31'
    )old
    full outer join
    (
        select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ods_user_info_inc
            where dt='2020-06-15'
        )t1
        where rn=1
    )new
    on old.id=new.id
)
insert overwrite table dim_user_zip partition(dt)
select
    if(new_id is not null,new_id,old_id),
    if(new_id is not null,new_login_name,old_login_name),
    if(new_id is not null,new_nick_name,old_nick_name),
    if(new_id is not null,new_name,old_name),
    if(new_id is not null,new_phone_num,old_phone_num),
    if(new_id is not null,new_email,old_email),
    if(new_id is not null,new_user_level,old_user_level),
    if(new_id is not null,new_birthday,old_birthday),
    if(new_id is not null,new_gender,old_gender),
    if(new_id is not null,new_create_time,old_create_time),
    if(new_id is not null,new_operate_time,old_operate_time),
    if(new_id is not null,new_start_date,old_start_date),
    if(new_id is not null,new_end_date,old_end_date),
    if(new_id is not null,new_end_date,old_end_date) dt
from tmp
union all
select
    old_id,
    old_login_name,
    old_nick_name,
    old_name,
    old_phone_num,
    old_email,
    old_user_level,
    old_birthday,
    old_gender,
    old_create_time,
    old_operate_time,
    old_start_date,
    cast(date_add('2020-06-15',-1) as string) old_end_date,
    cast(date_add('2020-06-15',-1) as string) dt
from tmp
where old_id is not null
and new_id is not null;

2.拉链表每日装载数据 -第二种思路
  • 1.将两个子查询(前一日全量数据、当日新增及变化数据)使用union all连接起来

  • 2.使用开窗函数,将整个数据进行开窗,

    • rk结果=1的就写到全量表中,结束日期使用最大日期;
    • rk结果=2的就写到前一天分区里面,结束日期处理为前一天时间
    • hive中使用动态分区来实现数据写入到不同分区。
  • 多个select 查询uinon all 后,再进行select的话,默认获取的是第一个select的所有字段。

  • 第二种方式的SQL

insert overwrite table dim_user_zip partition(dt)
select 
			id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            if(rk =2 ,data_sub('2020-06-15',1),end_date) end_date,
            if(rk =1 ,'9999-12-31',data_sub('2020-06-15',1)) dt
from (
select 
			id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date,
            rank() over (partition by id order by start_date desc) rk
from 
( select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_zip
        where dt='9999-12-31'
 union all 
 select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ods_user_info_inc
            where dt='2020-06-15'
        )t1
        where rn=1
	) t2
) t3

7.DIM层 维度模型-数据装载脚本

1.首日装载脚本

  • 数仓上线第一天,执行一次即可
#!/bin/bash

APP=gmall

if [ -n "$2" ] ;then
   do_date=$2
else 
   echo "请传入日期参数"
   exit
fi 

dim_user_zip="
insert overwrite table ${APP}.dim_user_zip partition (dt='9999-12-31')
select
    data.id,
    data.login_name,
    data.nick_name,
    md5(data.name),
    md5(data.phone_num),
    md5(data.email),
    data.user_level,
    data.birthday,
    data.gender,
    data.create_time,
    data.operate_time,
    '$do_date' start_date,
    '9999-12-31' end_date
from ${APP}.ods_user_info_inc
where dt='$do_date'
and type='bootstrap-insert';
"

dim_sku_full="
with
sku as
(
    select
        id,
        price,
        sku_name,
        sku_desc,
        weight,
        is_sale,
        spu_id,
        category3_id,
        tm_id,
        create_time
    from ${APP}.ods_sku_info_full
    where dt='$do_date'
),
spu as
(
    select
        id,
        spu_name
    from ${APP}.ods_spu_info_full
    where dt='$do_date'
),
c3 as
(
    select
        id,
        name,
        category2_id
    from ${APP}.ods_base_category3_full
    where dt='$do_date'
),
c2 as
(
    select
        id,
        name,
        category1_id
    from ${APP}.ods_base_category2_full
    where dt='$do_date'
),
c1 as
(
    select
        id,
        name
    from ${APP}.ods_base_category1_full
    where dt='$do_date'
),
tm as
(
    select
        id,
        tm_name
    from ${APP}.ods_base_trademark_full
    where dt='$do_date'
),
attr as
(
    select
        sku_id,
        collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
    from ${APP}.ods_sku_attr_value_full
    where dt='$do_date'
    group by sku_id
),
sale_attr as
(
    select
        sku_id,
        collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
    from ${APP}.ods_sku_sale_attr_value_full
    where dt='$do_date'
    group by sku_id
)
insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date')
select
    sku.id,
    sku.price,
    sku.sku_name,
    sku.sku_desc,
    sku.weight,
    sku.is_sale,
    sku.spu_id,
    spu.spu_name,
    sku.category3_id,
    c3.name,
    c3.category2_id,
    c2.name,
    c2.category1_id,
    c1.name,
    sku.tm_id,
    tm.tm_name,
    attr.attrs,
    sale_attr.sale_attrs,
    sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
"

dim_province_full="
insert overwrite table ${APP}.dim_province_full partition(dt='$do_date')
select
    province.id,
    province.name,
    province.area_code,
    province.iso_code,
    province.iso_3166_2,
    region_id,
    region_name
from
(
    select
        id,
        name,
        region_id,
        area_code,
        iso_code,
        iso_3166_2
    from ${APP}.ods_base_province_full
    where dt='$do_date'
)province
left join
(
    select
        id,
        region_name
    from ${APP}.ods_base_region_full
    where dt='$do_date'
)region
on province.region_id=region.id;
"

dim_coupon_full="
insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date')
select
    id,
    coupon_name,
    coupon_type,
    coupon_dic.dic_name,
    condition_amount,
    condition_num,
    activity_id,
    benefit_amount,
    benefit_discount,
    case coupon_type
        when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元')
        when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
        when '3203' then concat('减',benefit_amount,'元')
    end benefit_rule,
    create_time,
    range_type,
    range_dic.dic_name,
    limit_num,
    taken_count,
    start_time,
    end_time,
    operate_time,
    expire_time
from
(
    select
        id,
        coupon_name,
        coupon_type,
        condition_amount,
        condition_num,
        activity_id,
        benefit_amount,
        benefit_discount,
        create_time,
        range_type,
        limit_num,
        taken_count,
        start_time,
        end_time,
        operate_time,
        expire_time
    from ${APP}.ods_coupon_info_full
    where dt='$do_date'
)ci
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='32'
)coupon_dic
on ci.coupon_type=coupon_dic.dic_code
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='33'
)range_dic
on ci.range_type=range_dic.dic_code;
"

dim_activity_full="
insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date')
select
    rule.id,
    info.id,
    activity_name,
    rule.activity_type,
    dic.dic_name,
    activity_desc,
    start_time,
    end_time,
    create_time,
    condition_amount,
    condition_num,
    benefit_amount,
    benefit_discount,
    case rule.activity_type
        when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元')
        when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
        when '3103' then concat('打',10*(1-benefit_discount),'折')
    end benefit_rule,
    benefit_level
from
(
    select
        id,
        activity_id,
        activity_type,
        condition_amount,
        condition_num,
        benefit_amount,
        benefit_discount,
        benefit_level
    from ${APP}.ods_activity_rule_full
    where dt='$do_date'
)rule
left join
(
    select
        id,
        activity_name,
        activity_type,
        activity_desc,
        start_time,
        end_time,
        create_time
    from ${APP}.ods_activity_info_full
    where dt='$do_date'
)info
on rule.activity_id=info.id
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='31'
)dic
on rule.activity_type=dic.dic_code;
"

case $1 in
"dim_user_zip")
    hive -e "$dim_user_zip"
;;
"dim_sku_full")
    hive -e "$dim_sku_full"
;;
"dim_province_full")
    hive -e "$dim_province_full"
;;
"dim_coupon_full")
    hive -e "$dim_coupon_full"
;;
"dim_activity_full")
    hive -e "$dim_activity_full"
;;
"all")
    hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full"
;;
esac

2.每日装载脚本

  • 数仓上线之后,以后每天都需要执行一遍
#!/bin/bash

APP=gmall

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then
    do_date=$2
else 
    do_date=`date -d "-1 day" +%F`
fi

dim_user_zip="
set hive.exec.dynamic.partition.mode=nonstrict;
with
tmp as
(
    select
        old.id old_id,
        old.login_name old_login_name,
        old.nick_name old_nick_name,
        old.name old_name,
        old.phone_num old_phone_num,
        old.email old_email,
        old.user_level old_user_level,
        old.birthday old_birthday,
        old.gender old_gender,
        old.create_time old_create_time,
        old.operate_time old_operate_time,
        old.start_date old_start_date,
        old.end_date old_end_date,
        new.id new_id,
        new.login_name new_login_name,
        new.nick_name new_nick_name,
        new.name new_name,
        new.phone_num new_phone_num,
        new.email new_email,
        new.user_level new_user_level,
        new.birthday new_birthday,
        new.gender new_gender,
        new.create_time new_create_time,
        new.operate_time new_operate_time,
        new.start_date new_start_date,
        new.end_date new_end_date
    from
    (
        select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from ${APP}.dim_user_zip
        where dt='9999-12-31'
    )old
    full outer join
    (
        select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '$do_date' start_date,
            '9999-12-31' end_date
        from
        (
            select
                data.id,
                data.login_name,
                data.nick_name,
                data.name,
                data.phone_num,
                data.email,
                data.user_level,
                data.birthday,
                data.gender,
                data.create_time,
                data.operate_time,
                row_number() over (partition by data.id order by ts desc) rn
            from ${APP}.ods_user_info_inc
            where dt='$do_date'
        )t1
        where rn=1
    )new
    on old.id=new.id
)
insert overwrite table ${APP}.dim_user_zip partition(dt)
select
    if(new_id is not null,new_id,old_id),
    if(new_id is not null,new_login_name,old_login_name),
    if(new_id is not null,new_nick_name,old_nick_name),
    if(new_id is not null,new_name,old_name),
    if(new_id is not null,new_phone_num,old_phone_num),
    if(new_id is not null,new_email,old_email),
    if(new_id is not null,new_user_level,old_user_level),
    if(new_id is not null,new_birthday,old_birthday),
    if(new_id is not null,new_gender,old_gender),
    if(new_id is not null,new_create_time,old_create_time),
    if(new_id is not null,new_operate_time,old_operate_time),
    if(new_id is not null,new_start_date,old_start_date),
    if(new_id is not null,new_end_date,old_end_date),
    if(new_id is not null,new_end_date,old_end_date) dt
from tmp
union all
select
    old_id,
    old_login_name,
    old_nick_name,
    old_name,
    old_phone_num,
    old_email,
    old_user_level,
    old_birthday,
    old_gender,
    old_create_time,
    old_operate_time,
    old_start_date,
    cast(date_add('$do_date',-1) as string) old_end_date,
    cast(date_add('$do_date',-1) as string) dt
from tmp
where old_id is not null
and new_id is not null;
"

dim_sku_full="
with
sku as
(
    select
        id,
        price,
        sku_name,
        sku_desc,
        weight,
        is_sale,
        spu_id,
        category3_id,
        tm_id,
        create_time
    from ${APP}.ods_sku_info_full
    where dt='$do_date'
),
spu as
(
    select
        id,
        spu_name
    from ${APP}.ods_spu_info_full
    where dt='$do_date'
),
c3 as
(
    select
        id,
        name,
        category2_id
    from ${APP}.ods_base_category3_full
    where dt='$do_date'
),
c2 as
(
    select
        id,
        name,
        category1_id
    from ${APP}.ods_base_category2_full
    where dt='$do_date'
),
c1 as
(
    select
        id,
        name
    from ${APP}.ods_base_category1_full
    where dt='$do_date'
),
tm as
(
    select
        id,
        tm_name
    from ${APP}.ods_base_trademark_full
    where dt='$do_date'
),
attr as
(
    select
        sku_id,
        collect_set(named_struct('attr_id',attr_id,'value_id',value_id,'attr_name',attr_name,'value_name',value_name)) attrs
    from ${APP}.ods_sku_attr_value_full
    where dt='$do_date'
    group by sku_id
),
sale_attr as
(
    select
        sku_id,
        collect_set(named_struct('sale_attr_id',sale_attr_id,'sale_attr_value_id',sale_attr_value_id,'sale_attr_name',sale_attr_name,'sale_attr_value_name',sale_attr_value_name)) sale_attrs
    from ${APP}.ods_sku_sale_attr_value_full
    where dt='$do_date'
    group by sku_id
)
insert overwrite table ${APP}.dim_sku_full partition(dt='$do_date')
select
    sku.id,
    sku.price,
    sku.sku_name,
    sku.sku_desc,
    sku.weight,
    sku.is_sale,
    sku.spu_id,
    spu.spu_name,
    sku.category3_id,
    c3.name,
    c3.category2_id,
    c2.name,
    c2.category1_id,
    c1.name,
    sku.tm_id,
    tm.tm_name,
    attr.attrs,
    sale_attr.sale_attrs,
    sku.create_time
from sku
left join spu on sku.spu_id=spu.id
left join c3 on sku.category3_id=c3.id
left join c2 on c3.category2_id=c2.id
left join c1 on c2.category1_id=c1.id
left join tm on sku.tm_id=tm.id
left join attr on sku.id=attr.sku_id
left join sale_attr on sku.id=sale_attr.sku_id;
"

dim_province_full="
insert overwrite table ${APP}.dim_province_full partition(dt='$do_date')
select
    province.id,
    province.name,
    province.area_code,
    province.iso_code,
    province.iso_3166_2,
    region_id,
    region_name
from
(
    select
        id,
        name,
        region_id,
        area_code,
        iso_code,
        iso_3166_2
    from ${APP}.ods_base_province_full
    where dt='$do_date'
)province
left join
(
    select
        id,
        region_name
    from ${APP}.ods_base_region_full
    where dt='$do_date'
)region
on province.region_id=region.id;
"

dim_coupon_full="
insert overwrite table ${APP}.dim_coupon_full partition(dt='$do_date')
select
    id,
    coupon_name,
    coupon_type,
    coupon_dic.dic_name,
    condition_amount,
    condition_num,
    activity_id,
    benefit_amount,
    benefit_discount,
    case coupon_type
        when '3201' then concat('满',condition_amount,'元减',benefit_amount,'元')
        when '3202' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
        when '3203' then concat('减',benefit_amount,'元')
    end benefit_rule,
    create_time,
    range_type,
    range_dic.dic_name,
    limit_num,
    taken_count,
    start_time,
    end_time,
    operate_time,
    expire_time
from
(
    select
        id,
        coupon_name,
        coupon_type,
        condition_amount,
        condition_num,
        activity_id,
        benefit_amount,
        benefit_discount,
        create_time,
        range_type,
        limit_num,
        taken_count,
        start_time,
        end_time,
        operate_time,
        expire_time
    from ${APP}.ods_coupon_info_full
    where dt='$do_date'
)ci
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='32'
)coupon_dic
on ci.coupon_type=coupon_dic.dic_code
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='33'
)range_dic
on ci.range_type=range_dic.dic_code;
"

dim_activity_full="
insert overwrite table ${APP}.dim_activity_full partition(dt='$do_date')
select
    rule.id,
    info.id,
    activity_name,
    rule.activity_type,
    dic.dic_name,
    activity_desc,
    start_time,
    end_time,
    create_time,
    condition_amount,
    condition_num,
    benefit_amount,
    benefit_discount,
    case rule.activity_type
        when '3101' then concat('满',condition_amount,'元减',benefit_amount,'元')
        when '3102' then concat('满',condition_num,'件打',10*(1-benefit_discount),'折')
        when '3103' then concat('打',10*(1-benefit_discount),'折')
    end benefit_rule,
    benefit_level
from
(
    select
        id,
        activity_id,
        activity_type,
        condition_amount,
        condition_num,
        benefit_amount,
        benefit_discount,
        benefit_level
    from ${APP}.ods_activity_rule_full
    where dt='$do_date'
)rule
left join
(
    select
        id,
        activity_name,
        activity_type,
        activity_desc,
        start_time,
        end_time,
        create_time
    from ${APP}.ods_activity_info_full
    where dt='$do_date'
)info
on rule.activity_id=info.id
left join
(
    select
        dic_code,
        dic_name
    from ${APP}.ods_base_dic_full
    where dt='$do_date'
    and parent_code='31'
)dic
on rule.activity_type=dic.dic_code;
"

case $1 in
"dim_user_zip")
    hive -e "$dim_user_zip"
;;
"dim_sku_full")
    hive -e "$dim_sku_full"
;;
"dim_province_full")
    hive -e "$dim_province_full"
;;
"dim_coupon_full")
    hive -e "$dim_coupon_full"
;;
"dim_activity_full")
    hive -e "$dim_activity_full"
;;
"all")
    hive -e "$dim_user_zip$dim_sku_full$dim_province_full$dim_coupon_full$dim_activity_full"
;;
esac

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/369903.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

RocketMQ 5.x新版本部署优化一览

​ RocketMQ从2022年9月份开始推出了新的5.x大版本。相比于之前的4.x版本,5.x版本向云原生前进了一大步。在增强原因功能的基础上,更是支持多语言客户端,周边生态也进行了补强和完善,明显可以看到离Kafka老大哥又近了很大一步。 …

linux网络编程-多进程实现TCP并发服务器

服务端流程步骤socket函数创建监听套接字lfdbind函数将监听套接字绑定ip和端口listen函数设置服务器为被动监听状态,同时创建一条未完成连接队列(没走完tcp三次握手流程的连接),和一条已完成连接队列(已完成tcp三次握手…

3-虚拟机篇

一.java JVM 的内存结构 内存:按线程类型分两类 线程共享: 方法区:存放类的信息堆:存放java对象的信息 线程私有: java虚拟机栈:存放java方法、方法参数和局部变量程序计数器:记录程序执行…

mars3d将当前视⻆指向北⽅且加载建筑物白膜

通过 setView⽅法实现,可以设置heading参数控制相对旋转⻆度map.scene.camera.setView({ orientation: { heading: 0, } }) 相关示例:1.http://mars3d.cn/editor-vue.html?idmap/options/scene2.功能示例(Vue版) | Mars3D三维可视化平台 | 火星科技// *…

ESP32S3 SPI发送间隔频率 驱动ADS8326

ESP32S3 SPI发送间隔频率驱动ADS8326esp32s3 spi例程 代码测试用寄存器方式实现spi发送寄存器描述驱动ADS8326 ads8326驱动时序 首先CS信号拉低,然后clk发送6个时钟,ads8326开始启动转换。 最后clk发送16个时钟,就会读取到两个字节的数据&a…

Spring扫描逻辑原码解析(带图好理解)

先上流程图 再上类图 再上代码 public static void main(String[] args) {XsmApplicationContext applicationContextnew XsmApplicationContext(AppConfig.class);System.out.println(applicationContext.getBean("userService"));System.out.println(applicationCo…

Laravel框架学习笔记——Laravel环境配置及安装(Ubuntu20.04为例)

目录引言1、安装Nginx2、安装PHP3、安装Composer4、搭建Laravel框架项目5、修改Nginx映射6、安装MySQL引言 好久没写博客了,因为个人需要, 所以要涉及到Laravel框架的学习,所以会出一系列的关于PHP的Laravel框架学习笔记,希望能够…

【平台数仓设计——2023】

平台数仓设计——2023前言一、选取大数据平台1、CDH大数据平台2、HDP大数据平台3、CDP大数据平台4、各种云数据中台二、选取调度平台1、DolphinScheduler(海豚调度)2、AzKaban3、Oozie4、Airflow5、corntab命令三、选取数仓设计方案1、离线数仓2、实时数仓3、离线实时一体化数仓…

尚医通 (二十二)预约下单

目录一、预约下单功能(一)1、需求2、搭建订单模块3、封装Feign调用获取就诊人接口4、封装Feign调用获取排班下单信息接口二、预约下单功能(二)1、实现生成订单接口三、预约下单功能(三)四、预约下单功能(四)1、生成订单后处理逻辑-封装短信接口2、生成订单后处理逻辑-更新排班数…

cracklib与libpwquality 评估密码的安全性

一、cracklib 检测密码强弱linux中采用pam pam_cracklib module来实现对密码强度的检测,可以通过配置让linux系统自动检测用户的密码是否为弱密码。yuminstall cracklib # centos apt-get install libcrack2 # ubuntu # 如果需要依赖此库做开发的话需要安装这个 y…

墨者——内部文件上传系统漏洞分析溯源 内部文件上传系统漏洞分析溯源

墨者——内部文件上传系统漏洞分析溯源 内部文件上传系统漏洞分析溯源 1.选择合适的文件上传 2.可以看到为*.asp文件 3.可以推测出此站点为IIS 4.上传shell.asp试试 5.上传报错,将其改名为shell.asp.txt上传,发现上传成功 6.有个问题就是服务器将我们所…

Node 10.0.8.6:9003 is unknown to cluster

解决方案解决方案一解决方案一 ① 概念介绍 公网ip:就是任意两台连接了互联网的电脑可以互相ping ip,能够通的ip 内网ip:只是在内网中使用无法与外网连接的ip ②问题背景 在腾讯云上搭建的一个redis集群,集群启动后 可以看到启动节点…

腾讯企微、泛微、契约锁举办的这场大会,超1000人共商数字化转型

全程数字化运营平台体验大会 2月23日下午,腾讯企业微信、泛微、契约锁在上海联合举办了全程数字化运营平台体验大会,旨在推动政企客户的数字化转型。 活动以“智能、协同、高效”为主题,吸引了上千位政府及企事单位的信息化负责人参与。 在…

Tapdata 和 Databend 数仓数据同步实战

作者:韩山杰https://github.com/hantmacDatabend Cloud 研发工程师基础架构在云计算时代也发生着翻天地覆的变化,对于业务的支持变成了如何能利用好云资源实现降本增效,同时更好的支撑业务也成为新时代技术人员的挑战。 本篇文章通过&#xf…

含泪推荐5款小体积,超实用的工具

大家好,我又来啦,今天给大家带来的5款软件,共同特点都是体积小、无广告、超实用,大家观看完可以自行搜索下载哦。 1.文字自动验证工具——LanguageTool LanguageTool是一款文字自动验证工具,不仅支持中英文&#xff…

如何将虚拟机ubuntu设置网路连接为桥接模式

当将虚拟机的网络适配器设置为NAT模式时,一切正常,可以ping通所有ip, 但是当设置为桥接模式时,ping命令会提示ubuntu connect: Network is unreachable 这是因为桥接模式没有设置正确,通过下面的步骤可以将桥接模式配置正确。 1…

iOS 生命周期 (最新最完整)

要知道APP的生命周期,首先要了解一下生命周期的5种状态,结合状态理解生命周期的使用。 一、应用的状态 应用的状态包括: 1.1 未运行(Not running) 程序没启动 1.2 未激活(Inactive) 程序在…

尚医通 (二十一)预约挂号功能

目录一、预约挂号详情1、需求2、预约挂号详情接口3、预约挂号详情前端二、预约确认1、需求2、预约确认接口3、预约确认前端一、预约挂号详情 1、需求 接口分析 (1)根据预约周期,展示可预约日期数据,按分页展示 (2&…

JAVA设计模式之工厂模式讲解

目录 前言 开始表演 前言 Java中使用工厂模式的主要原因是为了实现代码的灵活性和可维护性。工厂模式是一种创建型设计模式,它提供了一种将对象的创建和使用进行分离的方式。具体来说,工厂模式可以将对象的创建过程封装在一个独立的工厂类中&#xff…

自动微分记录

计算图(数据流图):AI系统化问题 - 计算图的提出 计算图和自动微分 单算子切分、图的切分和调度 前端->统一表示->优化层->运行时态->底层库 语言 计算网络模型中间表示 计算图 自动微分 计算图优化 内存管理、计算图调度和执行 内核代码优化与编译 多硬件支持…