MYSQL PARTITIONING分区操作和性能测试

PARTITION OR NOT PARTITION IN MYSQl

Bill Karwin says “In most circumstances, you’re better off using indexes instead of partitioning as your main method of query optimization.”
According to RICK JAMES: “It is so tempting to believe that PARTITIONing will solve performance problems. But it is so often wrong.”
let’s find out what’s going on by building a test case

TWO TABLES READY

How many partitions? views from Rick James: Have 20-50 partitions; no more.
In this page, we do 10 partitions
Remember: Always test your real case.

Partition table with 10 partitions

CREATE TABLE points_partition 
(id INT NOT NULL AUTO_INCREMENT,
 x FLOAT,
 y FLOAT,
 z FLOAT,
 created_time DATETIME,
 PRIMARY KEY(id, created_time))
PARTITION BY RANGE( YEAR(created_time) ) (
	PARTITION p16 VALUES less than (2016),
	PARTITION p17 VALUES less than (2017),
    PARTITION p18 VALUES less than (2018),
	PARTITION p19 VALUES less than (2019),
    PARTITION p20 VALUES less than (2020),
	PARTITION p21 VALUES less than (2021),
    PARTITION p22 VALUES less than (2022),
	PARTITION p23 VALUES less than (2023),
    PARTITION p24 VALUES less than (2024),
	PARTITION p25 VALUES less than (2025)
) ;

Normal table

CREATE TABLE points_full_table 
(id INT NOT NULL AUTO_INCREMENT,
 x FLOAT,
 y FLOAT,
 z FLOAT,
 created_time DATETIME,
 PRIMARY KEY(id, created_time));

Create millions of rows

For test case, each table holds 10 millions of rows
If using mysql to insert, example 2 is better than example 1

-- sql example 1
INSERT INTO `table1` (`field1`, `field2`) VALUES ("data1", "data2");
INSERT INTO `table1` (`field1`, `field2`) VALUES ("data1", "data2");
INSERT INTO `table1` (`field1`, `field2`) VALUES ("data1", "data2");
-- sql example 2
INSERT INTO `table1` (`field1`, `field2`) VALUES ("data1", "data2"),
                                                 ("data1", "data2"),
                                                 ("data1", "data2");

Add large data with tools

from faker import Faker
import random

def insert_large_data(nums=10):
    fake = Faker()

    data = [(random.random(), random.random(), random.random(),
             str(fake.date_time_between(start_date='-10y', end_date='now'))) for i in range(nums)]
             
    cursor = connection.cursor()

    sql = f"INSERT INTO points_partition (x, y, z, created_time) VALUES (%s, %s, %s, %s)"

    # execute sql with your idea tool

DB-status

partition table take extra files to preserve data, also, extra disk space
请添加图片描述

TEST RESULTS WITHOUT EXTRA INDEX（created_time）

test-1

select SQL_NO_CACHE * from sample.points_partition where created_time > '2024-01-01' limit 100;

FROM: explain

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_partition	p25	ALL					911625	33.33	Using where

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_full_table	ALL						9747207	33.33	Using where

FROM:mysqlslap

# partition_table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.156 seconds
	Minimum number of seconds to run all queries: 0.156 seconds
	Maximum number of seconds to run all queries: 0.156 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10
# full_table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.172 seconds
	Minimum number of seconds to run all queries: 0.172 seconds
	Maximum number of seconds to run all queries: 0.172 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10

In general, it is expected that fewer touched rows would result in less time for query execution.
since this query only required limit rows under condition without order, mysql optimizer is doing a good job here.
the worse case for the full table is that do a full table scan, but to get just 100 target rows from random data, much less time is needed.

however, if we put a order by in where clause, things will be a huge different.

test-2

select SQL_NO_CACHE * from sample.points_partition where created_time > '2024-01-01' order by created_time limit 100;

FROM: explain

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_partition	p25	ALL					911625	33.33	Using where; Using filesort

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_full_table		ALL					9747207	33.33	Using where; Using filesort

FROM:mysqlslap

# partition table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 4.931 seconds
	Minimum number of seconds to run all queries: 4.931 seconds
	Maximum number of seconds to run all queries: 4.931 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10
# full table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 54.652 seconds
	Minimum number of seconds to run all queries: 54.652 seconds
	Maximum number of seconds to run all queries: 54.652 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10

A huge time gap between two queries.
what’ going on？
under condition of “order by”
a full table needs a full table-field sort, that’s cost a lot,
a partition table only need to sort a partition after located target partition.
we always say: test your real case, by this way, you find your circumstance to do a partition table.

WHY：In most circumstances, you’re better off using indexes instead of partitioning

the test are not done yet
From mysql explain, the extra field print a message: “Using filesort”
normally, you should considering a index here to improve performance: MYSQL: explain-extra-information

let’s add a index

ALTER TABLE `points_partition` ADD INDEX `created_time_index` (`created_time`);
ALTER TABLE `points_full_table` ADD INDEX `created_time_index` (`created_time`);

TEST RESULTS WITH INDEX

test-3

select SQL_NO_CACHE * from sample.points_partition where created_time > '2024-01-01' limit 100;

FROM: explain

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_partition	p25	range	created_time_index	created_time_index	5		455812	100.00	Using index condition

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_full_table		range	created_time_index	created_time_index	5		2641784	100.00	Using index condition; Using MRR

FROM: mysqlslap

# partition table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.168 seconds
	Minimum number of seconds to run all queries: 0.168 seconds
	Maximum number of seconds to run all queries: 0.168 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10
# full table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.368 seconds
	Minimum number of seconds to run all queries: 0.368 seconds
	Maximum number of seconds to run all queries: 0.368 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10

again: In general, it is expected that fewer touched rows would result in less time for query execution.
new queries cost a little more time than without extra index.
what happens? explain shows “condition index” are being used here.
stop here, it’s not how indexes are introduced.
sometimes, index is not help if the goal was retrieve 100 target rows. the worst case, yes, but not all.

let’s put a “order by” to see the magic

test-4

select SQL_NO_CACHE * from sample.points_partition where created_time > '2024-01-01' order by created_time limit 100;

FROM: explain

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_partition	p25	range	created_time_index	created_time_index	5		455812	100.00	Using index condition

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	points_full_table		range	created_time_index	created_time_index	5		2641784	100.00	Using index condition

FROM: mysqlslap

# partition table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.162 seconds
	Minimum number of seconds to run all queries: 0.162 seconds
	Maximum number of seconds to run all queries: 0.162 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10
# full table
Benchmark
	Running for engine innodb
	Average number of seconds to run all queries: 0.185 seconds
	Minimum number of seconds to run all queries: 0.185 seconds
	Maximum number of seconds to run all queries: 0.185 seconds
	Number of clients running queries: 10
	Average number of queries per client: 10

same touched rows as no “order by”.
but the time cost of queries are getting really closed.
makes sense “In this circumstance, you’re better off using indexes instead of partitioning”.
after all, there are different types of queries were influenced and Maintenance of PARTITION is also a big thing.
For example: select count() is much slower for partition tables. unless doing a partition count()

more tests?
let’s stop here

table vs (better view)

key/type	partition	normal	partition+order	normal+order	partition+index	normal+index	partition+order+index	normal+order+index
diskspace	~590m	~540m	~590m	~540m	~750m	~700m	~750m	~700m
mysqlslap-benchmark	0.156s	0.172s	4.931s	54.652s	0.168s	0.368s	0.162s	0.185s
mysql-explain-touched-rows	911625	9747207	911625	9747207	455812	2641784	455812	2641784
index	/	/	/	/	created_time_index	created_time_index	created_time_index	created_time_index

POINTS BASED ON TEST(mysqlslap & mysql workbench)

Index works good without partitioning, most of cases even better
Under condition of range query by partition field, partitioning tables works good indeed
drop partitions is much more efficient when doing a big delete
if queries use specific partition, performance will better