探索ClickHouse——使用Projection加速查询

news2024/11/26 20:32:14

在测试Projection之前,我们需要先创建一张表,并导入大量数据。
我们可以直接使用指令,从URL指向的文件中获取内容并导入表。但是担心网络不稳定,我们先将文件下载下来。

下载文件

wget wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv .

检查文件

wc -l pp-complete.csv 

28497127 pp-complete.csv

ll pp-complete.csv

-rw-rw-r-- 1 fangliang fangliang 4982107267 Aug 29 05:13 pp-complete.csv

即这个文件约有2850万行,占4个多G磁盘。

移动文件

su root
cp pp-complete.csv /var/lib/clickhouse/user_files/
exit

创建表

查看文件

使用下面指令查看文件内容

head -10 pp-complete.csv 
"{F887F88E-7D15-4415-804E-52EAC2F10958}","70000","1995-07-07 00:00","MK15 9HP","D","N","F","31","","ALDRICH DRIVE","WILLEN","MILTON KEYNES","MILTON KEYNES","MILTON KEYNES","A","A"
"{40FD4DF2-5362-407C-92BC-566E2CCE89E9}","44500","1995-02-03 00:00","SR6 0AQ","T","N","F","50","","HOWICK PARK","SUNDERLAND","SUNDERLAND","SUNDERLAND","TYNE AND WEAR","A","A"
"{7A99F89E-7D81-4E45-ABD5-566E49A045EA}","56500","1995-01-13 00:00","CO6 1SQ","T","N","F","19","","BRICK KILN CLOSE","COGGESHALL","COLCHESTER","BRAINTREE","ESSEX","A","A"
"{28225260-E61C-4E57-8B56-566E5285B1C1}","58000","1995-07-28 00:00","B90 4TG","T","N","F","37","","RAINSBROOK DRIVE","SHIRLEY","SOLIHULL","SOLIHULL","WEST MIDLANDS","A","A"
"{444D34D7-9BA6-43A7-B695-4F48980E0176}","51000","1995-06-28 00:00","DY5 1SA","S","N","F","59","","MERRY HILL","BRIERLEY HILL","BRIERLEY HILL","DUDLEY","WEST MIDLANDS","A","A"
"{AE76CAF1-F8CC-43F9-8F63-4F48A2857D41}","17000","1995-03-10 00:00","S65 1QJ","T","N","L","22","","DENMAN STREET","ROTHERHAM","ROTHERHAM","ROTHERHAM","SOUTH YORKSHIRE","A","A"
"{709FB471-3690-4945-A9D6-4F48CE65AAB6}","58000","1995-04-28 00:00","PE7 3AL","D","Y","F","4","","BROOK LANE","FARCET","PETERBOROUGH","PETERBOROUGH","CAMBRIDGESHIRE","A","A"
"{5FA8692E-537B-4278-8C67-5A060540506D}","19500","1995-01-27 00:00","SK10 2QW","T","N","L","38","","GARDEN STREET","MACCLESFIELD","MACCLESFIELD","MACCLESFIELD","CHESHIRE","A","A"
"{E78710AD-ED1A-4B11-AB99-5A0614D519AD}","20000","1995-01-16 00:00","SA6 5AY","D","N","F","592","","CLYDACH ROAD","YNYSTAWE","SWANSEA","SWANSEA","SWANSEA","A","A"
"{1DFBF83E-53A7-4813-A37C-5A06247A09A8}","137500","1995-03-31 00:00","NR2 2NQ","D","N","F","26","","LIME TREE ROAD","NORWICH","NORWICH","NORWICH","NORFOLK","A","A"

使用客户端连接服务端

clickhouse-client

创建表

CREATE TABLE uk_price_paid ( price UInt32, date Date, postcode1 LowCardinality(String), postcode2 LowCardinality(String), type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0), is_new UInt8, duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0), addr1 String, addr2 String, street LowCardinality(String), locality LowCardinality(String), town LowCardinality(String), district LowCardinality(String), county LowCardinality(String) ) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);

导入数据

INSERT INTO uk_price_paid WITH splitByChar(' ', postcode) AS p SELECT toUInt32(price_string) AS price, parseDateTimeBestEffortUS(time) AS date, p[1] AS postcode1, p[2] AS postcode2, transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type, b = 'Y' AS is_new, transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county FROM file( 'pp-complete.csv', 'CSV', 'uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String' );

在这里插入图片描述
整个处理速度大概是210 thousand rows/s,36.5MB/s。

INSERT INTO uk_price_paid WITH splitByChar(’ ', postcode) AS p
SELECT
toUInt32(price_string) AS price,
parseDateTimeBestEffortUS(time) AS date,
p[1] AS postcode1,
p[2] AS postcode2,
transform(a, [‘T’, ‘S’, ‘D’, ‘F’, ‘O’], [‘terraced’, ‘semi-detached’, ‘detached’, ‘flat’, ‘other’]) AS type,
b = ‘Y’ AS is_new,
transform(c, [‘F’, ‘L’, ‘U’], [‘freehold’, ‘leasehold’, ‘unknown’]) AS duration,
addr1,
addr2,
street,
locality,
town,
district,
county
FROM file(‘pp-complete.csv’, ‘CSV’, ‘uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String’)
Query id: 32a2a670-8417-470d-ab26-6368dd1725e5
Ok.
0 rows in set. Elapsed: 140.063 sec. Processed 28.50 million rows, 4.98 GB (203.46 thousand rows/s., 35.57 MB/s.)

检查数据

检查数据行数

SELECT count() From uk_price_paid;

SELECT count()
FROM uk_price_paid
Query id: 2d05b3f1-c683-4f2d-bcaf-e05b777eb3f8
┌──count()───┐
│ 28497127 │
└──────────┘
1 row in set. Elapsed: 0.005 sec.

一共有28,497,127行数据,和文件中行数一致。

检查所占磁盘

SELECT formatReadableSize(total_bytes) FROM system.tables WHERE name = 'uk_price_paid';

SELECT formatReadableSize(total_bytes)
FROM system.tables
WHERE name = ‘uk_price_paid’
Query id: 7cca5694-6d15-4f38-8f8d-ef8331a4caa3
┌─formatReadableSize(total_bytes)─┐
│ 308.18 MiB │
└──────────────────────┘
1 row in set. Elapsed: 0.007 sec.

和之前文件4G多大小对比,减少了9/10,这个比例是相当大的。

查询

SELECT toYear(date), district, town, avg(price), sum(price), count() FROM uk_price_paid  GROUP BY toYear(date), district, town;

80441 rows in set. Elapsed: 2.114 sec. Processed 28.50 million rows, 284.78 MB (13.48 million rows/s., 134.71 MB/s.)

新增PROJECTION

使用下面指令给toYear(date), district, town创建一个PROJECTION ,这样之后插入的数据就会被自动优化。

ALTER TABLE uk_price_paid ADD PROJECTION projection_by_year_district_town(SELECT toYear(date), district, town, avg(price), sum(price), count() GROUP BY toYear(date), district, town);

ALTER TABLE uk_price_paid
ADD PROJECTION projection_by_year_district_town
(
SELECT
toYear(date),
district,
town,
avg(price),
sum(price),
count()
GROUP BY
toYear(date),
district,
town
)
Query id: 3c5ca13e-4805-412c-845a-ab18c411261c
Ok.
0 rows in set. Elapsed: 0.007 sec.

然后使用下面指令修改现有数据

ALTER TABLE uk_price_paid MATERIALIZE PROJECTION projection_by_year_district_town SETTINGS mutations_sync = 1;

ALTER TABLE uk_price_paid
MATERIALIZE PROJECTION projection_by_year_district_town
SETTINGS mutations_sync = 1
Query id: 7bd22c05-c74c-4972-be6d-174eaf99c498
Ok.
0 rows in set. Elapsed: 0.183 sec.

优化后查询

80441 rows in set. Elapsed: 0.170 sec. Processed 92.93 thousand rows, 5.76 MB (548.06 thousand rows/s., 33.98 MB/s.)

可以看到时间也缩短到未优化的1/10。

参考资料

  • https://clickhouse.com/docs/zh/getting-started/example-datasets/uk-price-paid

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1042572.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

OpenHarmony应用模型的构成要素与Stage优势

一、应用模型的构成要素 应用模型是OpenHarmony为开发者提供的应用程序所需能力的抽象提炼,它提供了应用程序必备的组件和运行机制。有了应用模型,开发者可以基于一套统一的模型进行应用开发,使应用开发更简单、高效。 二、Stage主推模型优势…

多输入多输出 | MATLAB实现PSO-LSSVM粒子群优化最小二乘支持向量机多输入多输出

多输入多输出 | MATLAB实现PSO-LSSVM粒子群优化最小二乘支持向量机多输入多输出 目录 多输入多输出 | MATLAB实现PSO-LSSVM粒子群优化最小二乘支持向量机多输入多输出预测效果基本介绍程序设计往期精彩参考资料 预测效果 基本介绍 MATLAB实现PSO-LSSVM粒子群优化最小二乘支持向…

layuiselect设置为不可下拉选取

$("#exam").siblings(".layui-form-select").find("dl").remove(); 或 layuiSelectDisable($("#exam")); // 设置selet元素不可下拉选择function layuiSelectDisable(selectElem) {try {var dlElem selectElem.siblings(".layu…

华为云云耀云服务器L实例评测|云耀云服务器L实例部署Gitblit服务器

华为云云耀云服务器L实例评测|云耀云服务器L实例部署Gitblit服务器 一、云耀云服务器L实例介绍1.1 云耀云服务器L实例简介1.2 云耀云服务器L实例特点 二、Gitblit介绍2.1 Gitblit简介2.2 Gitblit特点 三、本次实践介绍3.1 本次实践简介3.2 本次环境规划 四、检查服务…

HarmonyOS CPU与I/O密集型任务开发指导

一、CPU密集型任务开发指导 CPU密集型任务是指需要占用系统资源处理大量计算能力的任务,需要长时间运行,这段时间会阻塞线程其它事件的处理,不适宜放在主线程进行。例如图像处理、视频编码、数据分析等。 基于多线程并发机制处理CPU密集型任务…

【北亚企安数据恢复】Ceph存储介绍Ceph数据恢复流程

Ceph存储基本架构: Ceph存储可分为块存储,对象存储和文件存储。Ceph基于对象存储,对外提供三种存储接口,故称为统一存储。 Ceph的底层是RADOS(分布式对象存储系统),RADOS由两部分组成:OSD和MON。 MON负责监…

vue项目开发环境工具-node

最近在开始接触做vue框架的前端项目,以前用的前端比如html,js,css等都是比较原生的,写好后直接浏览器打开就行。但vue跟java一样是需要编译的,和微信小程序类似。今天就先记录一下vue的开发运行搭建。所需工具如下 nod…

从小白到大咖:软件测试工作半年心得分享!总结我掉的4个坑…

从事软件测试工作已经半年多了,刚入职的时候还是一个缺乏实际经验的小白,而现在拿到需求之后也能比较快速地熟悉业务并顺利开展测试,虽然不能说掌握了很多技能,但是相比之前也是有不少收获的,在这个过程中我总结了一点…

spring-cloud-alibaba-dubbo-issues1805修复

spring-cloud-alibaba-dubbo-issues1805修复 文章目录 [toc] 1.官方信息2.版本代码对比3.修改尝试4.验证5.总结 这个issue就是我这前写了那两篇文章的那个issue Dubbo重启服务提供者或先启动服务消费者后启动服务提供者,消费者有时候会出现找不到服务的问题及解决 …

2023年第二届HiPChips解读

The 2nd International Workshop on High Performance Chiplet and Interconnnect Architectures (HiPChips) 主题 Optical and other advanced chiplet interconnect technologiesInterconnect standards of coherent and non-coherent data sharing protocols (e.g. CXL)D…

美妆护肤品经营小程序商城的作用是什么

美妆护肤品市场非常庞大,属于长期消耗品,对厂家或经营者来说,本质上市场并不缺购买的人,但由于品牌/同行众多,加之消费升级客户线上购物消费,因此传统线下商家发展困难。 但通过线上经营又受制于第三方平台…

9.25 day 2

1. 简述方法重写与方法重载的意义与区别: 方法重写: 1.参数列表必须完全与被重写方法相同 //参数列表(分为四种): (1)无参无返回值方法; (2)有参无返回…

顺风车软件搭建流程:数字化出行与社会共享的创新

随着移动互联网的快速发展,顺风车软件作为一种新型出行方式逐渐流行起来。本文将介绍顺风车软件搭建的流程,包括需求分析、技术架构设计、用户体验优化以及安全性保障。通过深入思考数字化出行与社会共享的关系,为读者呈现一个专业、有逻辑性…

【Synapse数据集】Synapse数据集介绍和预处理,数据集下载网盘链接

【Segment Anything Model】做分割的专栏链接,欢迎来学习。 【博主微信】cvxiaoyixiao 本专栏为公开数据集的介绍和预处理,持续更新中。 文章目录 1️⃣Synapse数据集介绍文件结构源文件样图文件内容 2️⃣Synapse数据集百度网盘下载链接官网下载登录下…

RocketMQ —消费进度管理

Apache RocketMQ 通过消费位点管理消费进度,本文为您介绍 Apache RocketMQ 的消费进度管理机制。 背景信息​ Apache RocketMQ 的生产者和消费者在进行消息收发时,必然会涉及以下场景,消息先生产后订阅或先订阅后生产。这两种场景下&#x…

【springboot3.x 记录】关于spring-cloud-gateway引入openfeign导致的循环依赖问题

最近升级springboot3真是一挖一个坑,又给我发现了 spring-cloud-gateway 引入 openfeign 会导致循环依赖异常,特此记录一下这个坑 一、发现问题 网关里面有一个全局的过滤器,因为要查询一些配置信息,目前是通过 feign client 的方…

中秋《乡村振兴战略下传统村落文化旅游设计》许少辉八月新书——2023学生思乡季辉少许

中秋《乡村振兴战略下传统村落文化旅游设计》许少辉八月新书——2023学生思乡季辉少许 中秋《乡村振兴战略下传统村落文化旅游设计》许少辉八月新书——2023学生思乡季辉少许

【C++ • STL】探究string的源码

文章目录 一、深浅拷贝二、传统版写法的string类(简单)三、string类的模拟实现四、现代版写法的string类五、总结 ヾ(๑╹◡╹)ノ" 人总要为过去的懒惰而付出代价ヾ(๑╹◡╹)ノ" 一、深浅拷贝 浅拷贝:也称位…

数据结构与算法之HashBitMap

一:引入 1.Hash扩容算法在多线程情况有什么问题? 2.如何在3亿个整数(0~2亿)中判断某一个数是否存在?内存限制500M,一台机器。 分治: 布隆过滤器:神器 Redis Hash: 开3亿个空间&#…

[Go 夜读 第 148 期] Excelize 构建 WebAssembly 版本跨语言支持实践

Excelize 是 Go 语言编写的用于操作电子表格文档的基础库,支持 XLAM / XLSM / XLSX / XLTM / XLTX 等多种文档格式,高度兼容带有样式、图片 (表)、透视表、切片器等复杂组件的文档,并提供流式读写支持,用于处理包含大规模数据的工…