1、背景
业务现状:集团使用FineBI做数据呈现及报表分析工具,经过近两年的BI建设,供应链域及营销域的BI建设已初具规模并体系化。数仓规模60TB,FineBI数据集约8000个,BI挂出报表数约1600个,报表月增幅在40左右。
技术现状:数据加工链路:业务系统数据库->Hive数仓->PG导出库->FineBI抽取数据集->BI报表。该数据链路下,遇到了一些问题。本文不展开全部问题,仅讨论PG导出库数据规模过大,导致FineBI每日数据更新任务卡住问题。
问题描述:集团供应链的物料的帐龄表、库龄表按日分区,日增量100W,且存储了所有历史数据,该账龄、库龄数据在一年内单表接近160GB,FineBI每日早晨抽取数据的更新任务卡住(FineBI只配置了4个后台更新线程),导致大部分数据更新延迟。即使开放更多的后台数据更新线程(按CPU核数推荐经验值,过多会影响BI的报表查看性能),该数据的抽取对PG导出库也有压力。
解决思路:1)从报表的逻辑优化减少数据量;2)切换查询引擎,使用StarRocks替换PG,FineBI直连StarRocks,减少BI数据抽取。本文仅讨论切换查询引擎的方案。
2、最初实验
思路:
使用Broker Load导入Hive中的按日动态分区表数据导出到StarRocks中。StarRocks中的表也按日动态分区。
相关文档:
- 参考Broker Load官方文档:Broker load研究导入配置。
- 参考动态分区官方文档:动态分区研究StarRocks按日动态分区表建表语句。
问题:
使用Brocker Load方式遇到以下问题:
- hdfs parquet文件有几年的分区数据,有的是月分区,有的是天分区
- 一些分区的字段不同,有的有A BC字段,有的有ABD字段有的ABCD字段
- StarRocks导入时,不允许StarRocks的字段比原表字段多
官方给出了以下脚本,创建一张支持动态分区的表,表名为 site_access,动态分区通过 PROPERTIES 进行配置。分区的区间为当前时间的前后 3 天,总共 6 天。
CREATE TABLE site_access(
event_day DATE,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY RANGE(event_day)(
PARTITION p20200321 VALUES LESS THAN ("2020-03-22"),
PARTITION p20200322 VALUES LESS THAN ("2020-03-23"),
PARTITION p20200323 VALUES LESS THAN ("2020-03-24"),
PARTITION p20200324 VALUES LESS THAN ("2020-03-25")
)
DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32
PROPERTIES(
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.start" = "-3",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "32"
);
如果是Hive使用经验的用户,第一次使用StarRocks,参考上面的SQL逻辑设计自己的表逻辑时,可能有疑问:PARTITION BY RANGE(event_day)(这里的内容改如何填写呢)?比如,以下需要迁移的Hive表,
-- drop table if exists bda${db_para}.bda_inv_item_age_dtl;
create table if not exists bda${db_para}.bda_inv_item_age_dtl (
stat_date string comment '统计日期',
entr_date string comment '入库日期',
item_id string comment '物料id',
itm_cd string comment '物料编码',
org_id string comment '库存组织id',
org_cd string comment '库存组织编码',
sub_invtr_cd string comment '子库编码',
bch_nbr string comment '批次号',
entr_qty string comment '入库数量',
total_qty string comment '倒算总入库数',
left_qty string comment '剩余库存数',
alloc_qty string comment '分摊库存数',
stock_age string comment '库龄',
item_cost string comment 'pac成本',
actual_cost string comment '实际成本',
item_business string comment '最近三个月使用事业部',
pch_big_ctg string comment '采购品类大类',
pch_med_ctg string comment '采购品类中类',
pch_sml_ctg string comment '采购品类小类',
ship_customer_name string comment '订单收货客户',
team_bu_name string comment '战队',
real_customer_name string comment '真实客户',
item_category string comment '物料类别',
prod_model_id string comment '产品型号id',
prod_model string comment '产品型号',
job_belong_bu string comment '工单所属事业部',
item_bu string comment '近三个使用事业部',
pac_cost string comment 'pac成本单价',
item_category_bg string comment '物料大类')
comment '库龄'
partitioned by(part_dt string)
row format delimited fields terminated by '\036'
stored as parquet;
-- add by tjl 2022.03.10
-- alter table bda${db_para}.bda_inv_item_age_dtl add columns(dept_name string comment '工单所属部门') cascade;
-- 2022.04.18:xxx:新增:工单制单时间、制单人
alter table bda${db_para}.bda_inv_item_age_dtl add columns(wdj_creation_date string comment '工单制单时间') cascade;
alter table bda${db_para}.bda_inv_item_age_dtl add columns(created_by string comment '工单制单人') cascade;
请注意:这里面有表字段表更。Parquet格式的分区表使用Alter更新字段或新增字段,Parquet文件会如何变化?旧分区的历史数据要如何处理?
其hdfs文件如下:
drwxrwx--x+ - hive hive 0 2018-08-30 03:51 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-08-29
#此处省略一万行
drwxrwx--x+ - hive hive 0 2022-11-26 03:55 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-25
drwxrwx--x+ - hive hive 0 2022-11-27 03:52 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-26
讨论:
- StarRocks有多种数据表模型,该场景应该选择什么数据表模型呢?
- 应该如何建StarRocks的表呢?
- 应该如何建Brocker load任务呢?
3、问题分析
- Parquet格式的Hive表,执行add columns的工作流程及底层执行机制如何?
1、Hive元数据表中,在表末尾,追加一列。
2、对于Hive分区表,需要cascade修饰词,管理旧分区的元数据信息。
3、如果是使用Parquet、ORC等存储格式,文件中存储的内容没有重写,则不会变动。
4、Hive如何识别就分区表中没有新增列的?
参考:[Hive] Alter Table/Partition/Column - Huawei Enterprise Support Community
简单而言,如果不使用cascade修饰词,只是修改了Hive中bda_inv_item_age_dtl表元信息(表结构)。写入新分区的数据,hive能自动识别并正常显示。对于旧分区,如果包含了新列,即使旧分区使用了INSERT OVERWRITE重写数据,如果没有使用Drop 并 recreate分区,hive查询有不能正常工作。
参考:partitioning - how to add columns to existing hive partitioned table? - Stack Overflow
- Parquet文件中的schema变化情况
查看最早分区2016-06-30的数据结构:
[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2016-06-30/000000_0 [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive20160630.sql [cloud@dp-master001 ~]$ vcat hive20160630.sql -bash: vcat: command not found [cloud@dp-master001 ~]$ cat hive20160630.sql message hive_schema { optional binary stat_date (UTF8); optional binary entr_date (UTF8); optional binary item_id (UTF8); optional binary itm_cd (UTF8); optional binary org_id (UTF8); optional binary org_cd (UTF8); optional binary sub_invtr_cd (UTF8); optional binary bch_nbr (UTF8); optional binary entr_qty (UTF8); optional binary total_qty (UTF8); optional binary left_qty (UTF8); optional binary alloc_qty (UTF8); optional binary stock_age (UTF8); optional binary item_cost (UTF8); optional binary actual_cost (UTF8); optional binary item_business (UTF8); optional binary pch_big_ctg (UTF8); optional binary pch_med_ctg (UTF8); optional binary pch_sml_ctg (UTF8); optional binary ship_customer_name (UTF8); optional binary team_bu_name (UTF8); optional binary real_customer_name (UTF8); optional binary item_category (UTF8); optional binary prod_model_id (UTF8); optional binary prod_model (UTF8); optional binary job_belong_bu (UTF8); optional binary item_bu (UTF8); optional binary pac_cost (UTF8); optional binary item_category_bg (UTF8); } creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber}) file schema: hive_schema ---------------------------------------------------------------------------------------------------- stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1 item_id: OPTIONAL BINARY O:UTF8 R:0 D:1 itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 org_id: OPTIONAL BINARY O:UTF8 R:0 D:1 org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1 item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_business: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1 team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1 real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1 item_category: OPTIONAL BINARY O:UTF8 R:0 D:1 prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1 prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1 job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1 item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1 pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:190899 TS:8780200 ---------------------------------------------------------------------------------------------------- stat_date: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:204/204/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE entr_date: BINARY UNCOMPRESSED DO:0 FPO:208 SZ:249018/249018/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE item_id: BINARY UNCOMPRESSED DO:0 FPO:249226 SZ:245992/245992/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE itm_cd: BINARY UNCOMPRESSED DO:0 FPO:495218 SZ:409890/409890/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE org_id: BINARY UNCOMPRESSED DO:0 FPO:905108 SZ:3641/3641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE org_cd: BINARY UNCOMPRESSED DO:0 FPO:908749 SZ:3637/3637/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE sub_invtr_cd: BINARY UNCOMPRESSED DO:0 FPO:912386 SZ:187698/187698/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE bch_nbr: BINARY UNCOMPRESSED DO:0 FPO:1100084 SZ:286227/286227/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE entr_qty: BINARY UNCOMPRESSED DO:0 FPO:1386311 SZ:495330/495330/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE total_qty: BINARY UNCOMPRESSED DO:0 FPO:1881641 SZ:59/59/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE left_qty: BINARY UNCOMPRESSED DO:0 FPO:1881700 SZ:486445/486445/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE alloc_qty: BINARY UNCOMPRESSED DO:0 FPO:2368145 SZ:536856/536856/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE stock_age: BINARY UNCOMPRESSED DO:0 FPO:2905001 SZ:243903/243903/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE item_cost: BINARY UNCOMPRESSED DO:0 FPO:3148904 SZ:2322210/2322210/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,PLAIN,RLE actual_cost: BINARY UNCOMPRESSED DO:0 FPO:5471114 SZ:2922009/2922009/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE item_business: BINARY UNCOMPRESSED DO:0 FPO:8393123 SZ:1817/1817/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE pch_big_ctg: BINARY UNCOMPRESSED DO:0 FPO:8394940 SZ:4757/4757/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE pch_med_ctg: BINARY UNCOMPRESSED DO:0 FPO:8399697 SZ:13669/13669/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE pch_sml_ctg: BINARY UNCOMPRESSED DO:0 FPO:8413366 SZ:27595/27595/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE ship_customer_name: BINARY UNCOMPRESSED DO:0 FPO:8440961 SZ:26641/26641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE team_bu_name: BINARY UNCOMPRESSED DO:0 FPO:8467602 SZ:12107/12107/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE real_customer_name: BINARY UNCOMPRESSED DO:0 FPO:8479709 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE item_category: BINARY UNCOMPRESSED DO:0 FPO:8479742 SZ:399/399/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE prod_model_id: BINARY UNCOMPRESSED DO:0 FPO:8480141 SZ:36922/36922/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE prod_model: BINARY UNCOMPRESSED DO:0 FPO:8517063 SZ:33472/33472/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE job_belong_bu: BINARY UNCOMPRESSED DO:0 FPO:8550535 SZ:3769/3769/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE item_bu: BINARY UNCOMPRESSED DO:0 FPO:8554304 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE pac_cost: BINARY UNCOMPRESSED DO:0 FPO:8554337 SZ:225484/225484/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE item_category_bg: BINARY UNCOMPRESSED DO:0 FPO:8779821 SZ:383/383/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
查看最新分区2022-11-26,Parquet文件的数据结构
[cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-24/000000_0 [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive.sql [cloud@dp-master001 ~]$ cat hive.sql message hive_schema { optional binary stat_date (UTF8); optional binary entr_date (UTF8); optional binary item_id (UTF8); optional binary itm_cd (UTF8); optional binary org_id (UTF8); optional binary org_cd (UTF8); optional binary sub_invtr_cd (UTF8); optional binary bch_nbr (UTF8); optional binary entr_qty (UTF8); optional binary total_qty (UTF8); optional binary left_qty (UTF8); optional binary alloc_qty (UTF8); optional binary stock_age (UTF8); optional binary item_cost (UTF8); optional binary actual_cost (UTF8); optional binary item_business (UTF8); optional binary pch_big_ctg (UTF8); optional binary pch_med_ctg (UTF8); optional binary pch_sml_ctg (UTF8); optional binary ship_customer_name (UTF8); optional binary team_bu_name (UTF8); optional binary real_customer_name (UTF8); optional binary item_category (UTF8); optional binary prod_model_id (UTF8); optional binary prod_model (UTF8); optional binary job_belong_bu (UTF8); optional binary item_bu (UTF8); optional binary pac_cost (UTF8); optional binary item_category_bg (UTF8); optional binary dept_name (UTF8); optional binary wdj_creation_date (UTF8); optional binary created_by (UTF8); } creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber}) file schema: hive_schema ---------------------------------------------------------------------------------------------------- stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1 item_id: OPTIONAL BINARY O:UTF8 R:0 D:1 itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 org_id: OPTIONAL BINARY O:UTF8 R:0 D:1 org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1 item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_business: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1 team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1 real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1 item_category: OPTIONAL BINARY O:UTF8 R:0 D:1 prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1 prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1 job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1 item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1 pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1 dept_name: OPTIONAL BINARY O:UTF8 R:0 D:1 wdj_creation_date: OPTIONAL BINARY O:UTF8 R:0 D:1 created_by: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:428480 TS:21535478 ---------------------------------------------------------------------------------------------------- stat_date: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:381/381/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE entr_date: BINARY UNCOMPRESSED DO:0 FPO:385 SZ:608962/608962/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE item_id: BINARY UNCOMPRESSED DO:0 FPO:609347 SZ:774164/774164/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE itm_cd: BINARY UNCOMPRESSED DO:0 FPO:1383511 SZ:1185783/1185783/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE org_id: BINARY UNCOMPRESSED DO:0 FPO:2569294 SZ:49829/49829/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE org_cd: BINARY UNCOMPRESSED DO:0 FPO:2619123 SZ:49810/49810/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE sub_invtr_cd: BINARY UNCOMPRESSED DO:0 FPO:2668933 SZ:539203/539203/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE bch_nbr: BINARY UNCOMPRESSED DO:0 FPO:3208136 SZ:582198/582198/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE entr_qty: BINARY UNCOMPRESSED DO:0 FPO:3790334 SZ:1028459/1028459/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE total_qty: BINARY UNCOMPRESSED DO:0 FPO:4818793 SZ:141/141/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE left_qty: BINARY UNCOMPRESSED DO:0 FPO:4818934 SZ:1074391/1074391/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE alloc_qty: BINARY UNCOMPRESSED DO:0 FPO:5893325 SZ:1063130/1063130/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE stock_age: BINARY UNCOMPRESSED DO:0 FPO:6956455 SZ:599509/599509/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE item_cost: BINARY UNCOMPRESSED DO:0 FPO:7555964 SZ:5625018/5625018/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE actual_cost: BINARY UNCOMPRESSED DO:0 FPO:13180982 SZ:6410120/6410120/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE item_business: BINARY UNCOMPRESSED DO:0 FPO:19591102 SZ:114122/114122/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE pch_big_ctg: BINARY UNCOMPRESSED DO:0 FPO:19705224 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE pch_med_ctg: BINARY UNCOMPRESSED DO:0 FPO:19705257 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE pch_sml_ctg: BINARY UNCOMPRESSED DO:0 FPO:19705290 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE ship_customer_name: BINARY UNCOMPRESSED DO:0 FPO:19705323 SZ:37444/37444/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE team_bu_name: BINARY UNCOMPRESSED DO:0 FPO:19742767 SZ:17370/17370/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE real_customer_name: BINARY UNCOMPRESSED DO:0 FPO:19760137 SZ:18033/18033/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE item_category: BINARY UNCOMPRESSED DO:0 FPO:19778170 SZ:917/917/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE prod_model_id: BINARY UNCOMPRESSED DO:0 FPO:19779087 SZ:148421/148421/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE prod_model: BINARY UNCOMPRESSED DO:0 FPO:19927508 SZ:113394/113394/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE job_belong_bu: BINARY UNCOMPRESSED DO:0 FPO:20040902 SZ:16902/16902/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE item_bu: BINARY UNCOMPRESSED DO:0 FPO:20057804 SZ:91023/91023/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE pac_cost: BINARY UNCOMPRESSED DO:0 FPO:20148827 SZ:791340/791340/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE item_category_bg: BINARY UNCOMPRESSED DO:0 FPO:20940167 SZ:673/673/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE dept_name: BINARY UNCOMPRESSED DO:0 FPO:20940840 SZ:7675/7675/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE wdj_creation_date: BINARY UNCOMPRESSED DO:0 FPO:20948515 SZ:574110/574110/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE created_by: BINARY UNCOMPRESSED DO:0 FPO:21522625 SZ:12857/12857/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
4、方案验证
4.1、StarRocks建表
- 选择什么样的StarRocks表模型?
根据BI报表的分析逻辑,主要用于监控日/月维度库龄、帐龄的趋势(影响物料/产成品跌价计提),主要是明细表做分析。故选择StarRocks明细表模型。期望按照Hive的按日分区建表。考虑到Brocker load只支持忽略源表中的字段,不支持目标表多的字段忽略/或者填充默认值的机制。根据探源发现,最早分区2016-06-30与最新分区2022-11-26字段不同(DDL中显示,2022.03.10,2022.04.18增加了三个字段。)可以考虑,03.10前,建立一个StarRocks表,之后建立一个StarRocks表。数据同步到StarRocks后,做数据汇聚。为了简单0310前的存储到一个分区表。
- 如何建StarRocks的表
0310前一个分区表:
drop table if exists bda.bda_inv_item_age_dtl0310; create table if not exists bda.bda_inv_item_age_dtl0310( stat_date date, entr_date date, item_id varchar(10), itm_cd varchar(500), org_id varchar(5), org_cd varchar(500), sub_invtr_cd varchar(20), bch_nbr varchar(50), entr_qty decimal, total_qty decimal, left_qty decimal, alloc_qty decimal, stock_age decimal, item_cost decimal, actual_cost decimal, item_business varchar(500), pch_big_ctg varchar(500), pch_med_ctg varchar(500), pch_sml_ctg varchar(500), ship_customer_name varchar(500), team_bu_name varchar(500), real_customer_name varchar(500), item_category varchar(500), prod_model_id varchar(500), prod_model varchar(500), job_belong_bu varchar(500), item_bu varchar(500), pac_cost decimal, item_category_bg varchar(500) ) DUPLICATE KEY(stat_date,entr_date,item_id) DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32;
0310后动态分区
create table if not exists bda.bda_inv_item_age_dtl_part( stat_date date, entr_date date, item_id varchar(10), itm_cd varchar(500), org_id varchar(5), org_cd varchar(500), sub_invtr_cd varchar(20), bch_nbr varchar(50), entr_qty varchar(500), total_qty varchar(500), left_qty varchar(500), alloc_qty varchar(500), stock_age varchar(500), item_cost varchar(500), actual_cost varchar(500), item_business varchar(500), pch_big_ctg varchar(500), pch_med_ctg varchar(500), pch_sml_ctg varchar(500), ship_customer_name varchar(500), team_bu_name varchar(500), real_customer_name varchar(500), item_category varchar(500), prod_model_id varchar(500), prod_model varchar(500), job_belong_bu varchar(500), item_bu varchar(500), pac_cost varchar(500), item_category_bg varchar(500), dept_name varchar(300), wdj_creation_date varchar(200), created_by varchar(200) ) DUPLICATE KEY(stat_date,entr_date,item_id) PARTITION BY RANGE(stat_date)() DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32 PROPERTIES( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.end" = "1", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "32" );
4.1、Brocker 同步任务
- 如何建Brocker load任务
参考文档:FileSystem (Apache Hadoop Main 3.3.4 API)
0310之前分区的抽取任务
LOAD LABEL bda.bda_inv_item_age_dtl0310_before ( DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=[2016-2021]-*/*") INTO TABLE `bda_inv_item_age_dtl_part` COLUMNS TERMINATED BY "\036" FORMAT AS "parquet" where entr_date is not null ) WITH BROKER broker198 ( "hadoop.security.authentication"="kerberos", "kerberos_principal"="hdfs", "kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab" );
执行发现,依旧报错,字段缺少,通过二分法,逐一排查hdfs分区文件,发现2022-03-10之前的分区,有的是按月分区,有的是按日分区,一些分区的字段各不相同,例如2018年分区情况、2022-02分区情况:
[cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2018_12_02 >> hive20181202.sql [cloud@dp-master001 ~]$ cat hive20181202.sql message hive_schema { optional binary stat_date (UTF8); optional binary entr_date (UTF8); optional binary item_id (UTF8); optional binary itm_cd (UTF8); optional binary org_id (UTF8); optional binary org_cd (UTF8); optional binary sub_invtr_cd (UTF8); optional binary bch_nbr (UTF8); optional binary entr_qty (UTF8); optional binary total_qty (UTF8); optional binary left_qty (UTF8); optional binary alloc_qty (UTF8); optional binary stock_age (UTF8); optional binary item_cost (UTF8); optional binary actual_cost (UTF8); optional binary item_business (UTF8); } creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber}) file schema: hive_schema ---------------------------------------------------------------------------------------------------- stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1 item_id: OPTIONAL BINARY O:UTF8 R:0 D:1 itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 org_id: OPTIONAL BINARY O:UTF8 R:0 D:1 org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1 item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_business: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:244224 TS:12749582 ---------------------------------------------------------------------------------------------------- stat_date: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED entr_date: BINARY UNCOMPRESSED DO:0 FPO:267 SZ:354158/354158/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED item_id: BINARY UNCOMPRESSED DO:0 FPO:354425 SZ:746753/746753/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED itm_cd: BINARY UNCOMPRESSED DO:0 FPO:1101178 SZ:1059151/1059151/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED org_id: BINARY UNCOMPRESSED DO:0 FPO:2160329 SZ:52235/52235/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED org_cd: BINARY UNCOMPRESSED DO:0 FPO:2212564 SZ:52222/52222/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED sub_invtr_cd: BINARY UNCOMPRESSED DO:0 FPO:2264786 SZ:279733/279733/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED bch_nbr: BINARY UNCOMPRESSED DO:0 FPO:2544519 SZ:633935/633935/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED entr_qty: BINARY UNCOMPRESSED DO:0 FPO:3178454 SZ:690868/690868/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED total_qty: BINARY UNCOMPRESSED DO:0 FPO:3869322 SZ:100/100/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED left_qty: BINARY UNCOMPRESSED DO:0 FPO:3869422 SZ:686921/686921/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED alloc_qty: BINARY UNCOMPRESSED DO:0 FPO:4556343 SZ:747099/747099/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED stock_age: BINARY UNCOMPRESSED DO:0 FPO:5303442 SZ:345381/345381/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED item_cost: BINARY UNCOMPRESSED DO:0 FPO:5648823 SZ:3316772/3316772/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED actual_cost: BINARY UNCOMPRESSED DO:0 FPO:8965595 SZ:3783958/3783958/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED item_business: BINARY UNCOMPRESSED DO:0 FPO:12749553 SZ:33/33/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0 get: `warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0': No such file or directory [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2020-02-26/000000_0 [cloud@dp-master001 ~]$ mv 000000_0 000000_0_2020_02_26 [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2020_02_26 >> hive20200226.sql [cloud@dp-master001 ~]$ cat hive20200226.sql message hive_schema { optional binary stat_date (UTF8); optional binary entr_date (UTF8); optional binary item_id (UTF8); optional binary itm_cd (UTF8); optional binary org_id (UTF8); optional binary org_cd (UTF8); optional binary sub_invtr_cd (UTF8); optional binary bch_nbr (UTF8); optional binary entr_qty (UTF8); optional binary total_qty (UTF8); optional binary left_qty (UTF8); optional binary alloc_qty (UTF8); optional binary stock_age (UTF8); optional binary item_cost (UTF8); optional binary actual_cost (UTF8); optional binary item_business (UTF8); optional binary pch_big_ctg (UTF8); optional binary pch_med_ctg (UTF8); optional binary pch_sml_ctg (UTF8); } creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber}) file schema: hive_schema ---------------------------------------------------------------------------------------------------- stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1 item_id: OPTIONAL BINARY O:UTF8 R:0 D:1 itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 org_id: OPTIONAL BINARY O:UTF8 R:0 D:1 org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1 bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1 entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1 stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1 item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1 item_business: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1 row group 1: RC:242410 TS:9017086 ---------------------------------------------------------------------------------------------------- stat_date: BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED entr_date: BINARY UNCOMPRESSED DO:0 FPO:267 SZ:317433/317433/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED item_id: BINARY UNCOMPRESSED DO:0 FPO:317700 SZ:562585/562585/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED itm_cd: BINARY UNCOMPRESSED DO:0 FPO:880285 SZ:900712/900712/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED org_id: BINARY UNCOMPRESSED DO:0 FPO:1780997 SZ:42974/42974/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED org_cd: BINARY UNCOMPRESSED DO:0 FPO:1823971 SZ:42965/42965/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED sub_invtr_cd: BINARY UNCOMPRESSED DO:0 FPO:1866936 SZ:294784/294784/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED bch_nbr: BINARY UNCOMPRESSED DO:0 FPO:2161720 SZ:437366/437366/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED entr_qty: BINARY UNCOMPRESSED DO:0 FPO:2599086 SZ:691103/691103/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED total_qty: BINARY UNCOMPRESSED DO:0 FPO:3290189 SZ:100/100/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED left_qty: BINARY UNCOMPRESSED DO:0 FPO:3290289 SZ:707584/707584/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED alloc_qty: BINARY UNCOMPRESSED DO:0 FPO:3997873 SZ:754179/754179/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED stock_age: BINARY UNCOMPRESSED DO:0 FPO:4752052 SZ:310402/310402/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED item_cost: BINARY UNCOMPRESSED DO:0 FPO:5062454 SZ:33/33/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED actual_cost: BINARY UNCOMPRESSED DO:0 FPO:5062487 SZ:3631076/3631076/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED item_business: BINARY UNCOMPRESSED DO:0 FPO:8693563 SZ:67646/67646/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED pch_big_ctg: BINARY UNCOMPRESSED DO:0 FPO:8761209 SZ:63343/63343/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED pch_med_ctg: BINARY UNCOMPRESSED DO:0 FPO:8824552 SZ:83151/83151/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED pch_sml_ctg: BINARY UNCOMPRESSED DO:0 FPO:8907703 SZ:109387/109387/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED [cloud@dp-master001 ~]$
这就导致了同步的数据导入的难度,需要检查每个分区的schema。为什么会出现这种情况呢?出现了这种情况如何快速同步呢?
0310以后的分区抽取任务
LOAD LABEL bda.bda_inv_item_age_dtl0310_after ( DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-[03-12]-*/*") INTO TABLE `bda_inv_item_age_dtl_part` COLUMNS TERMINATED BY "\036" FORMAT AS "parquet" where entr_date is not null ) WITH BROKER broker198 ( "hadoop.security.authentication"="kerberos", "kerberos_principal"="hdfs", "kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab" );
最终方案思考:
方案1:
修改StarRocks源码,支持StarRocks目标表字段多于原表字段时,填充默认值。
方案2:
- 使用python 读取hdfs 分区的schema,根据schema建立StarRocks分区临时表
- 各分区导入成功后,自动合并成StarRocks目标表
方案3:
- 使用DataX同步Hive表,每个分区配置一个同步作业。有点不用考虑表字段问题。但是分区较多时,配置同步任务也比较麻烦。可以使用脚本自动生成。如果数据量较大,同步效率也不如Brocks Load方式。