数仓面试之手写拉链表SQL,并分析有多少个job

news2024/9/29 3:24:58

数仓面试之手写拉链表SQL,并分析有多少个job

拉链表定义

维护历史状态,以及最新状态数据的一种表,拉链表根据拉链粒度的不同,实际上相当于快照,只不过做了优化,去除了一部分不变的记录而已,通过拉链表可以很方便的还原出拉链时点的客户记录。

拉链表,记录每条信息的生命周期,一旦一条记录的生命周期结束,就重新开始一条新的记录,并把当前日期放入生效开始日期。
如果当前信息至今有效,在生效结束日期中填入一个极大值 (如9999-99-99 )

用户表

用户表中的数据每日既有可能新增, 也有可能修改, 但修改频率并不高, 属于缓慢变化维度, 此处采用拉链表存储用户维度数据。

拉链表实现原理

第一步:比如在2020-01-01这一天,我们初始化用户数据到数据仓库user表,user表包含start_time和end_time字段用来标识该条数据的生命周期,明细数据如下所示。

在这里插入图片描述

第二步:在2020-01-02这一天,104用户被删除,同时增加了106和107用户,103用户的名称改成了小吴,明细数据如下所示。

在这里插入图片描述

第三步:在2020-01-03这一天,107用户被删除,同时106用户名称改为八王爷,103用户名称改为吴王,并增加了108用户,明细数据如下所示。

在这里插入图片描述

拉链表使用场景

数据量有点大,表中某些字段有变化,但是变化的频率不是很高(数据缓慢变化),业务需求又需要统计这种变化状态。但每天全量一份数据,有点不太现实,不仅浪费了存储空间,有时可能业务统计也有点麻烦。这时,拉链表的作用就体现出来了,既节省空间,又满足了需求。比如用户表,用户量1亿,一年365天,每天存储一份全量用户数据,效率非常低,此时就可以使用用户拉链表。

拉链表实现步骤

在这里插入图片描述

每日更新(下面截图取自尚硅谷)

在这里插入图片描述

在这里插入图片描述

原始业主用户表

DROP TABLE IF EXISTS ods_user_info;
CREATE EXTERNAL TABLE ods_user_info(
    `id` STRING COMMENT '用户id',
    `login_name` STRING COMMENT '用户名称',
    `nick_name` STRING COMMENT '用户昵称',
    `name` STRING COMMENT '用户姓名',
    `phone_num` STRING COMMENT '手机号码',
    `email` STRING COMMENT '邮箱',
    `user_level` STRING COMMENT '用户等级',
    `birthday` STRING COMMENT '生日',
    `gender` STRING COMMENT '性别',
    `create_time` STRING COMMENT '创建时间',
    `operate_time` STRING COMMENT '操作时间'
) COMMENT '用户表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

模拟数据:

ALTER TABLE ods_user_info DROP  PARTITION (dt='2020-06-14');
INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (1, 'user1', 'nick1', '张三', '13800138000', 'zhangsan@example.com', 1, '2000-01-01', '男', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (2, 'user2', 'nick2', '李四', '13900139000', 'lisi@example.com', 2, '2001-01-01', '女', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (3, 'user3', 'nick3', '王五', '13700137000', 'wangwu@example.com', 3, '2002-01-01', '男', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (4, 'user4', 'nick4', '赵六', '13600136000', 'zhaoliu@example.com', 4, '2003-01-01', '女', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (5, 'user5', 'nick5', '孙七', '13500135000', 'sunqi@example.com', 5, '2004-01-01', '男', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (6, 'user6', 'nick6', '周八', '13400134000', 'zhouba@example.com', 6, '2005-01-01', '女', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (7, 'user7', 'nick7', '吴九', '13300133000', 'wujiu@example.com', 7, '2006-01-01', '男', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (8, 'user8', 'nick8', '郑十', '13200132000', 'zhengshi@example.com', 8, '2007-01-01', '女', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (9, 'user9', 'nick9', '冯十一', '13100131000', 'fengshiyi@example.com', 9, '2008-01-01', '男', '2020-06-14', '2020-06-14');

INSERT INTO ods_user_info PARTITION (dt='2020-06-14')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (10, 'user10', 'nick10', '陈十二', '13000130000', 'chenshier@example.com', 10, '2009-01-01', '女', '2020-06-14', '2020-06-14');

select *from ods_user_info;

ALTER TABLE ods_user_info DROP  PARTITION (dt='2020-06-15');
INSERT INTO ods_user_info PARTITION (dt='2020-06-15')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (8, 'user8', 'nick8', '郑十san', '13200132000', 'zhengshi@example.com', 8, '2007-01-01', '女', '2020-06-15', '2020-06-15');

INSERT INTO ods_user_info PARTITION (dt='2020-06-15')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (9, 'user9', 'nick9', '冯十wu', '13100131000', 'fengshiyi@example.com', 9, '2008-01-01', '男', '2020-06-15', '2020-06-15');

INSERT INTO ods_user_info PARTITION (dt='2020-06-15')(id, login_name, nick_name, name, phone_num, email, user_level, birthday, gender, create_time, operate_time)
VALUES (10, 'user10', 'nick10', '陈yi二', '13000130000', 'chenshier@example.com', 10, '2009-01-01', '女', '2020-06-15', '2020-06-15');

创建拉链表

DROP TABLE IF EXISTS dim_user_info;
CREATE EXTERNAL TABLE dim_user_info(
    `id` STRING COMMENT '用户id',
    `login_name` STRING COMMENT '用户名称',
    `nick_name` STRING COMMENT '用户昵称',
    `name` STRING COMMENT '用户姓名',
    `phone_num` STRING COMMENT '手机号码',
    `email` STRING COMMENT '邮箱',
    `user_level` STRING COMMENT '用户等级',
    `birthday` STRING COMMENT '生日',
    `gender` STRING COMMENT '性别',
    `create_time` STRING COMMENT '创建时间',
    `operate_time` STRING COMMENT '操作时间',
    `start_date` STRING COMMENT '开始日期',
    `end_date` STRING COMMENT '结束日期'
) COMMENT '用户表'
PARTITIONED BY (`dt` STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

初始化拉链表

insert overwrite table dim_user_info partition(dt='9999-99-99')
select
    id,
    login_name,
    nick_name,
    md5(name),
    md5(phone_num),
    md5(email),
    user_level,
    birthday,
    gender,
    create_time,
    operate_time,
    '2020-06-14',
    '9999-99-99'
from ods_user_info
where dt='2020-06-14';

拉链表具体sql实现

with
tmp as
(
    select
        old.id old_id,
        old.login_name old_login_name,
        old.nick_name old_nick_name,
        old.name old_name,
        old.phone_num old_phone_num,
        old.email old_email,
        old.user_level old_user_level,
        old.birthday old_birthday,
        old.gender old_gender,
        old.create_time old_create_time,
        old.operate_time old_operate_time,
        old.start_date old_start_date,
        old.end_date old_end_date,
        new.id new_id,
        new.login_name new_login_name,
        new.nick_name new_nick_name,
        new.name new_name,
        new.phone_num new_phone_num,
        new.email new_email,
        new.user_level new_user_level,
        new.birthday new_birthday,
        new.gender new_gender,
        new.create_time new_create_time,
        new.operate_time new_operate_time,
        new.start_date new_start_date,
        new.end_date new_end_date
    from
    (
        select
            id,
            login_name,
            nick_name,
            name,
            phone_num,
            email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            start_date,
            end_date
        from dim_user_info
        where dt='9999-99-99'
    )old
    full outer join
    (
        select
            id,
            login_name,
            nick_name,
            md5(name) name,
            md5(phone_num) phone_num,
            md5(email) email,
            user_level,
            birthday,
            gender,
            create_time,
            operate_time,
            '2020-06-15' start_date,
            '9999-99-99' end_date
        from ods_user_info
        where dt='2020-06-15'
    )new
    on old.id=new.id
)
insert overwrite table dim_user_info partition(dt)
select
    nvl(new_id,old_id),
    nvl(new_login_name,old_login_name),
    nvl(new_nick_name,old_nick_name),
    nvl(new_name,old_name),
    nvl(new_phone_num,old_phone_num),
    nvl(new_email,old_email),
    nvl(new_user_level,old_user_level),
    nvl(new_birthday,old_birthday),
    nvl(new_gender,old_gender),
    nvl(new_create_time,old_create_time),
    nvl(new_operate_time,old_operate_time),
    nvl(new_start_date,old_start_date),
    nvl(new_end_date,old_end_date),
    nvl(new_end_date,old_end_date) dt
from tmp
union all
select
    old_id,
    old_login_name,
    old_nick_name,
    old_name,
    old_phone_num,
    old_email,
    old_user_level,
    old_birthday,
    old_gender,
    old_create_time,
    old_operate_time,
    old_start_date,
    cast(date_add('2020-06-15',-1) as string),
    cast(date_add('2020-06-15',-1) as string) dt
from tmp
where new_id is not null and old_id is not null;

查询最终拉链表数据

select * from dim_user_info where dt = '9999-99-99';

Hive中stage与job的关联

1)通过explain可以清晰的看到stage划分。

2)经过自动优化之后,有些stage并不会执行,这些stage就是空的stage。

3)基于stage划分,如何确定job数量,主要是看哪些stage会执行。只有需要执行的stage才会提交到yarn,然后生成具体的job application。

Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-13 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-13
  Stage-8 depends on stages: Stage-2 , consists of Stage-5, Stage-4, Stage-6
  Stage-5
  Stage-0 depends on stages: Stage-5, Stage-4, Stage-7
  Stage-3 depends on stages: Stage-0
  Stage-4
  Stage-6
  Stage-7 depends on stages: Stage-6

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string), end_date (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Reduce Output Operator
                key expressions: _col0 (type: string)
                sort order: +
                Map-reduce partition columns: _col0 (type: string)
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string)
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: (dt = '2020-06-15') (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), md5(name) (type: string), md5(phone_num) (type: string), md5(email) (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), '2020-06-15' (type: string), '9999-99-99' (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
            expressions: NVL(_col13,_col0) (type: string), NVL(_col14,_col1) (type: string), NVL(_col15,_col2) (type: string), NVL(_col16,_col3) (type: string), NVL(_col17,_col4) (type: string), NVL(_col18,_col5) (type: string), NVL(_col19,_col6) (type: string), NVL(_col20,_col7) (type: string), NVL(_col21,_col8) (type: string), NVL(_col22,_col9) (type: string), NVL(_col23,_col10) (type: string), NVL(_col24,_col11) (type: string), NVL(_col25,_col12) (type: string), NVL(_col25,_col12) (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-13
    Map Reduce Local Work
      Alias -> Map Local Tables:
        null-subquery2:$hdt$_0-subquery2:$hdt$_0:dim_user_info 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        null-subquery2:$hdt$_0-subquery2:$hdt$_0:dim_user_info 
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: zmm.dim_user_info
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '2020-06-15') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), '2020-06-14' (type: string), '2020-06-14' (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    Union
                      Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                      File Output Operator
                        compressed: false
                        Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                        table:
                            input format: org.apache.hadoop.mapred.TextInputFormat
                            output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                            serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                            name: zmm.dim_user_info
      Local Work:
        Map Reduce Local Work

  Stage: Stage-8
    Conditional Operator

  Stage: Stage-5
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://192.168.0.122:9000/user/hive/warehouse/zmm.db/dim_user_info/.hive-staging_hive_2023-09-28_16-48-39_357_1852508236922488629-1/-ext-10000

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            dt 
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: zmm.dim_user_info

  Stage: Stage-3
    Stats-Aggr Operator

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: zmm.dim_user_info

  Stage: Stage-6
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: zmm.dim_user_info

  Stage: Stage-7
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://192.168.0.122:9000/user/hive/warehouse/zmm.db/dim_user_info/.hive-staging_hive_2023-09-28_16-48-39_357_1852508236922488629-1/-ext-10000
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-10, Stage-12, Stage-13
  Stage-8 depends on stages: Stage-2 , consists of Stage-5, Stage-4, Stage-6
  Stage-5
  Stage-0 depends on stages: Stage-5, Stage-4, Stage-7
  Stage-3 depends on stages: Stage-0
  Stage-4
  Stage-6
  Stage-7 depends on stages: Stage-6
  Stage-14 is a root stage , consists of Stage-15, Stage-16, Stage-10
  Stage-15 has a backup stage: Stage-10
  Stage-12 depends on stages: Stage-15
  Stage-16 has a backup stage: Stage-10
  Stage-13 depends on stages: Stage-16
  Stage-10

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: (dt = '9999-99-99') (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string), end_date (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string)
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: (dt = '2020-06-15') (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), md5(name) (type: string), md5(phone_num) (type: string), md5(email) (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), '2020-06-15' (type: string), '9999-99-99' (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), _col12 (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
            expressions: NVL(_col13,_col0) (type: string), NVL(_col14,_col1) (type: string), NVL(_col15,_col2) (type: string), NVL(_col16,_col3) (type: string), NVL(_col17,_col4) (type: string), NVL(_col18,_col5) (type: string), NVL(_col19,_col6) (type: string), NVL(_col20,_col7) (type: string), NVL(_col21,_col8) (type: string), NVL(_col22,_col9) (type: string), NVL(_col23,_col10) (type: string), NVL(_col24,_col11) (type: string), NVL(_col25,_col12) (type: string), NVL(_col25,_col12) (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
                    output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
                    serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
                    name: zmm.dim_user_info
          TableScan
            Union
              Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 2 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                table:
                    input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
                    output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
                    serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
                    name: zmm.dim_user_info

  Stage: Stage-8
    Conditional Operator

  Stage: Stage-5
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://192.168.0.122:9000/warehouse/gmall/dim/dim_user_info/.hive-staging_hive_2023-09-28_16-23-09_282_7816820202500856515-1/-ext-10000

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            dt 
          replace: true
          table:
              input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
              output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
              serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
              name: zmm.dim_user_info

  Stage: Stage-3
    Stats-Aggr Operator

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
                  output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
                  serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
                  name: zmm.dim_user_info

  Stage: Stage-6
    Map Reduce
      Map Operator Tree:
          TableScan
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
                  output format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
                  serde: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
                  name: zmm.dim_user_info

  Stage: Stage-7
    Move Operator
      files:
          hdfs directory: true
          destination: hdfs://192.168.0.122:9000/warehouse/gmall/dim/dim_user_info/.hive-staging_hive_2023-09-28_16-23-09_282_7816820202500856515-1/-ext-10000

  Stage: Stage-14
    Conditional Operator

  Stage: Stage-15
    Map Reduce Local Work
      Alias -> Map Local Tables:
        null-subquery2:$hdt$_0-subquery2:$hdt$_1:ods_user_info 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        null-subquery2:$hdt$_0-subquery2:$hdt$_1:ods_user_info 
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '2020-06-15') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)

  Stage: Stage-12
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '9999-99-99') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), '2020-06-14' (type: string), '2020-06-14' (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    File Output Operator
                      compressed: false
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-16
    Map Reduce Local Work
      Alias -> Map Local Tables:
        null-subquery2:$hdt$_0-subquery2:$hdt$_0:dim_user_info 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        null-subquery2:$hdt$_0-subquery2:$hdt$_0:dim_user_info 
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '9999-99-99') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)

  Stage: Stage-13
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '2020-06-15') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: string)
                    1 _col0 (type: string)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Select Operator
                    expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), '2020-06-14' (type: string), '2020-06-14' (type: string)
                    outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    File Output Operator
                      compressed: false
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-10
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: dim_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '9999-99-99') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string), login_name (type: string), nick_name (type: string), name (type: string), phone_num (type: string), email (type: string), user_level (type: string), birthday (type: string), gender (type: string), create_time (type: string), operate_time (type: string), start_date (type: string)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string)
          TableScan
            alias: ods_user_info
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Filter Operator
              predicate: ((dt = '2020-06-15') and id is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator
                expressions: id (type: string)
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          keys:
            0 _col0 (type: string)
            1 _col0 (type: string)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col6 (type: string), _col7 (type: string), _col8 (type: string), _col9 (type: string), _col10 (type: string), _col11 (type: string), '2020-06-14' (type: string), '2020-06-14' (type: string)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            File Output Operator
              compressed: false
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1386742.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

phpinfo和php -m 加载的php.ini不一致

目的: 将phpinfo在web中展示的php.ini和在命令行中展示的php.ini加载路径设置一致。 原本的php.ini加载路劲是: /usr/local/lib/php.ini 解决思路: (1)which php 查看服务器加载的php的位置,这里原来是&a…

Sketch不会安装?教你在Windows中打开Sketch!

使用 Windows 系统的 UI 设计师可能遇到过这样一个问题:他们收到了其他人发送的 Sketch 文件,但 Windows 系统无法打开 Sketch 文件,也不知道如何在 Windows 上打开 Sketch 文件。这是一个真实工作场景的问题。对于这个问题,即时设…

RT-Thread: 控制台调试串口波特率更改

说明:rt_kprintf 函数是RT 的一个调试接口使用的函数,波特率默认是 115200 ,本文介绍更改这个波特率。 1.根据截图路径找到文件 serial.h 修改如下代码中关于波特率定义部分。 /* Default config for serial_configure structure */ #defin…

测试工程师必会能力之缺陷分析入门

缺陷分析也是测试工程师需要掌握的一个能力,但是很多时候大家只记得要提交缺陷、统计缺陷情况,而忽视了缺陷分析。 其实每个项目的缺陷记录都是有很大价值的: 在测试阶段分析当前缺陷情况,及时发现存在的问题并调整测试策略&…

微软Power Platform使用Canvas app画布应用添加自定义连接器调用外部API展示数据

微软Power Platform使用Power Apps的Canvas app画布应用添加自定义连接器,调用外部API展示数据 目录 微软Power Platform使用Power Apps的Canvas app画布应用添加自定义连接器,调用外部API展示数据1、在Power Apps中找到自定义连接器2、创建一个空白的自…

数码秒表设计

#include<reg51.h> // 包含51单片机寄存器定义的头文件 unsigned char code Tab[10]{0xc0,0xf9,0xa4,0xb0,0x99,0x92,0x82,0xf8,0x80,0x90}; //数码管显示0&#xff5e;9的段码表 unsigned char int_time; //记录中断次数 unsigned char second; //储存…

java基础 - 03 List之AbstractSequentialList、LinkedList

上一篇我们围绕了ArrayList以及List进行简单介绍&#xff0c;本篇我们将围绕AbstractSequentialList、LinkedList进行。 AbstractSequentialList AbstractSequentialList是Java集合框架中的一个抽象类&#xff0c;它实现了List接口&#xff0c;并且是针对顺序访问的列表数据结…

ME6211C33M5G-N 输出3.3V 500mA 线性稳压器LDO 参数

描述 ME6211系列是高精度&#xff0c;低噪声&#xff0c;CMOS LDO电压调压器。ME6211系列提供低输出噪声&#xff0c;高纹波抑制率&#xff0c;低辍学率和非常快速的开启时间&#xff0c;ME6211系列是当今最前沿的手机的理想选择。ME6211内部包括参考电压源、误差放大器、驱动…

运放负反馈

学习记录所使用书籍为西安交通大学杨建国教授著《新概念模拟电路》&#xff0c;可在ADI官网下载PDF版学习。 运算放大器&#xff0c;英文为 Operational Amplifier&#xff0c;简写 OA 或 OPA&#xff0c;中文简称为运放。 理想运算放大器如图所示&#xff0c;它具有两个差分的…

uni-table改表头的样式,uniapp项目,颜色,字体颜色

:first-child,:nth-child选择器的使用和隔行变色_firstchild怎么用-CSDN博客

C++_虚函数表

虚函数表 介绍源码运行结果笔记扩充函数名联编静态联编动态联编 介绍 1.编译器通过指针或引用调用虚函数&#xff0c;不会立即生成函数调用指令&#xff0c;而是用 二级函数指针 代替 1.1确定真实类型 1.2找到虚函数表从而找到入口地址 1.3根据入口地址调用函数(PS:俗称 函数指…

[ACM学习] 动态规划基础之一二三维dp

课内学习的动态规划 有记忆的迭代 优化解的结构&#xff1a;原始问题的一部分解是子问题的解 三要素&#xff1a;1.子问题 2.状态的定义 3.状态转移方程 定义 线性dp的一道例题 dp[i]表示以位置 i 结尾的方案总数&#xff0c;dp[4]2&#xff0c;因为&#xff1a;首先只放一…

卓越协同,数字化运维:智能工单系统助力企业解决派单难题-亿发

不少企业的I运维部门在管理制度上存在架构混乱、分工不明、流程不透明等问题&#xff0c;导致部门内部和合作服务商之间的协作常常呈现出“踢皮球”的状态。因此&#xff0c;有效的企业运维协同管理显得尤为关键。然而&#xff0c;如果内部的协同流程设计不合理&#xff0c;过多…

【数据结构】归并排序的非递归写法和计数排序

前言 &#x1f493;作者简介&#xff1a; 加油&#xff0c;旭杏&#xff0c;目前大二&#xff0c;正在学习C&#xff0c;数据结构等&#x1f440; &#x1f493;作者主页&#xff1a;加油&#xff0c;旭杏的主页&#x1f440; ⏩本文收录在&#xff1a;再识C进阶的专栏&#x1…

Sqoop作业调度:自动化数据传输任务

自动化数据传输任务是大数据处理中的一个重要方面&#xff0c;可以定期执行Sqoop作业&#xff0c;确保数据在不同系统之间的同步。本文将深入探讨如何使用Sqoop作业调度来自动化数据传输任务&#xff0c;并提供详细的示例代码和全面的内容&#xff0c;以帮助大家更好地理解和应…

网络安全B模块(笔记详解)- 利用python脚本进行web渗透测试

利用python脚本进行web渗透测试 1.使用渗透机场景kali中工具扫描确定Web服务器场景地址,浏览网站Flag.html页面,并将Flag.html中的Flag提交; 扫描发现是8081端口 访问页面查看 Flag:WXL0601 2.进入渗透机场景win7操作系统,完善桌面上的tupian.py文件,填写该文件当中空缺…

rime中州韵小狼毫 日期/农历 时间 事件 节气 滤镜

教程目录&#xff1a;rime中州韵小狼毫须鼠管安装配置教程 保姆级教程 100增强功能配置教程 网络上但凡提到 rime中州韵小狼毫须鼠管输入法&#xff0c;总少不了智能时间&#xff0c;日期等炫技&#xff0c;可见这个便捷时间/日期输入功能是多么的受欢迎。作者也不落窠臼&…

【C初阶——内存函数】鹏哥C语言系列文章,基本语法知识全面讲解

本文由睡觉待开机原创&#xff0c;转载请注明出处。 本内容在csdn网站首发 欢迎各位点赞—评论—收藏 如果存在不足之处请评论留言&#xff0c;共同进步&#xff01; 这里写目录标题 1.memcpy使用和模拟实现2.memmove的使用和模拟实现3.memset函数的使用4.memcpy函数的使用 1.m…

如何利用静态代理IP优化跨境电商运营

文章目录 什么是跨境电商心得分享IP的重要性如何注册小结 什么是跨境电商 跨境电商&#xff0c;即跨境电子商务&#xff0c;是指在不同关境&#xff08;国境&#xff09;之间通过互联网进行的商业交易活动。这种商业模式充分利用了网络的全球化特性&#xff0c;使得买卖双方不…

3d云渲染用什么显卡比较好?3d云渲染显卡推荐

3D云渲染能加快渲染速度&#xff0c;是众多公司的首选方案&#xff0c;作为公司负责人&#xff0c;选择哪个平台值得思考&#xff0c;今天我就说下我的选择吧。 首先我们要了解云渲染的渲染方式&#xff0c;云渲染的渲染方式分两种&#xff0c;一种是CPU渲染&#xff0c;一种是…