文章目录
- 一、问题背景
- 二、字符集PAD_ATTRIBUTE属性(补齐属性)
- 2.2、PAD SPACE与NO PAD的具体意义
- 三、CHAR类型尾部空格的处理
- 四、其他问题
- 4.1、在PAD SPACE属性时如何实现精准查询
- 五、总结
以下内容基于MySQL8.0进行讲解
一、问题背景
一次查询中发现查询出来的内容结尾有空格,跟我预期的结果不一致,由此引发出对本篇的整理。
案例如下:
CREATE TABLE `user` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL COMMENT '姓名',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
insert into user values
(null,'123456789 '), -- 数字后面有9个空格
(null,'张三 '), -- 名字后面有3个空格
(null,'李四 '),-- 名字后面有3个空格
(null,'王五')
;
以上内容可以看出
- 当结尾是由于空格导致的超长插入时不会报错,会自动截取掉多余的空格。
- 当查询时尾部有空格时,也能匹配上。
以下我们就研究一下MySQL尾部空格处理与哪些设置有关
二、字符集PAD_ATTRIBUTE属性(补齐属性)
先给出结论: MySQL尾部空格处理与字符集有关
,具体的是看字符集的Pad Attributes属性
information_schema> SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' and COLLATION_NAME='utf8mb4_general_ci';
+--------------------+--------------------+----+------------+-------------+---------+---------------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+--------------------+--------------------+----+------------+-------------+---------+---------------+
| utf8mb3_general_ci | utf8mb3 | 33 | Yes | Yes | 1 | PAD SPACE |
+--------------------+--------------------+----+------------+-------------+---------+---------------+
1 row in set
information_schema> SELECT * FROM INFORMATION_SCHEMA.COLLATIONS WHERE CHARACTER_SET_NAME = 'utf8mb4' and COLLATION_NAME='utf8mb4_0900_ai_ci';
+--------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+--------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
+--------------------+--------------------+-----+------------+-------------+---------+---------------+
1 row in set
可以看到二者在 PAD_ATTRIBUTE 的上差别。
MySQL的排序规则有一个属性Pad Attributes属性,这个属性的设置会影响数据库如何处理尾部空格(是否忽略尾部空格),如下官方文档描述
Collation Pad Attributes
Collations based on UCA 9.0.0 and higher are faster than collations based on UCA versions prior to 9.0.0. They also have a pad attribute of NO PAD, in contrast to PAD SPACE as used in collations based on UCA versions prior to 9.0.0. For comparison of nonbinary strings, NO PAD collations treat spaces at the end of strings like any other character (see Trailing Space Handling in Comparisons).
Comparison of nonbinary string values (CHAR, VARCHAR, and TEXT) that have a NO PAD collation differ from other collations with respect to trailing spaces. For example, 'a' and 'a ' compare as different strings, not the same string. This can be seen using the binary collations for utf8mb4. The pad attribute for utf8mb4_bin is PAD SPACE, whereas for utf8mb4_0900_bin it is NO PAD. Consequently, operations involving utf8mb4_0900_bin do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations
官方文档,关于比较中尾部空格处理介绍如下:
Trailing Space Handling in Comparisons
MySQL collations have a pad attribute, which has a value of PAD SPACE or NO PAD:
• Most MySQL collations have a pad attribute of PAD SPACE.
• The Unicode collations based on UCA 9.0.0 and higher have a pad attribute of NO PAD; see Section 10.10.1, “Unicode Character Sets”.
For nonbinary strings (CHAR, VARCHAR, and TEXT values), the string collation pad attribute determines treatment in comparisons of trailing spaces at the end of strings:
• For PAD SPACE collations, trailing spaces are insignificant in comparisons; strings are compared without regard to trailing spaces.
• NO PAD collations treat trailing spaces as significant in comparisons, like any other character.
The differing behaviors can be demonstrated using the two utf8mb4 binary collations, one of which is PAD SPACE, the other of which is NO PAD.
2.2、PAD SPACE与NO PAD的具体意义
- PAD SPACE:在排序和比较运算中,忽略字符串尾部空格。
- NO PAD:在排序和比较运算中,字符串尾部空格当成普通字符,不能忽略。
官方文档中也要一个例子简单说明,两者比较时,如何处理尾部空格。如下所示,相当直观、明了:
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_general_ci;
+------------+
| 'a ' = 'a' |
+------------+
| 1 |
+------------+
mysql> SELECT 'a ' = 'a' COLLATE utf8mb4_0900_ai_ci;
+------------+
| 'a ' = 'a' |
+------------+
| 0 |
+------------+
以上示例可以看出不同的字符集对于尾部空格的处理方式也不同。最上面的示例中name字段我们用的是 utf8mb4_general_ci字符集,即Pad Attributes属性为PAD SPACE,所以忽略尾部空格,不管是字段存储中带空格,还是查询语句where条件后面带空格都能匹配上。
接下来我们用utf8mb4_0900_ai_ci字符集,即Pad Attributes属性为NO PAD测试一下:
CREATE TABLE `user` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(10) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL COMMENT '姓名',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
insert into user values
(null,'123456789 '), -- 数字后面有9个空格
(null,'张三 '), -- 名字后面有3个空格
(null,'李四 '),-- 名字后面有3个空格
(null,'王五')
;
我们发现当属性为NO PAD时是不会忽略尾部的空格的,所以where后面不带空格时查询不出来内容。
另外,我们这里测试的是VARCHAR类型,如果字段类型为CHAR呢?
三、CHAR类型尾部空格的处理
其实呢,对于CHAR类型和VARCHA类型,它们的存储略有区别:
CHAR(N):当插入的字符数小于N,它会在字符串的右边补充空格,直到总字符数达到N再进行存储;当查询返回数据时默认会将字符串尾部的空格去掉,除非SQL_MODE设置PAD_CHAR_TO_FULL_LENGTH。
VARCHAR(N):当插入的字符数小于N,它不会在字符串的右边补充空格,insert内容原封不动的进行存储;如果原本字符串右边有空格,在存储和查询返回时都会保留空格
The length of a CHAR column is fixed to the length that you declare when you create the table. The length can be any value from 0 to 255. When CHAR values are stored, they are right-padded with spaces to the specified length. When CHAR values are retrieved, trailing spaces are removed unless the PAD_CHAR_TO_FULL_LENGTH SQL mode is enabled.
PAD_CHAR_TO_FULL_LENGTH
By default, trailing spaces are trimmed from CHAR column values on retrieval. If PAD_CHAR_TO_FULL_LENGTH is enabled, trimming does not occur and retrieved CHAR values are padded to their full length. This mode does not apply to VARCHAR columns, for which trailing spaces are retained on retrieval.
Note: As of MySQL 8.0.13, PAD_CHAR_TO_FULL_LENGTH is deprecated. Expect it to be removed in a future version of MySQL.
注意事项:
- PAD_CHAR_TO_FULL_LENGTH只影响CHAR类型,不影响VARCHAR类型。
- MySQL 8.0.13后,PAD_CHAR_TO_FULL_LENGTH参数过时/废弃了。这个参数可能在后续的MySQL版本中被移除。
当前版本(MySQL 8.0.33)中,暂时还可以在SQL_MODE中设置这个参数,不过默认不会设置此参数。那么我们来测试验证一下:
drop table test;
create table test(id int not null, name char(10) );
insert into test(id , name) values(1, null);
insert into test(id , name) values(2, '');
insert into test(id , name) values(3, ' '); -- 包含一个空格
insert into test(id , name) values(4, ' ');-- 包含两个空格
mysql> show variables like 'sql_mode'\G
*************************** 1. row ***************************
Variable_name: sql_mode
Value: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
1 row in set (0.00 sec)
mysql> --如下所示,SQL_MODE没有设置PAD_CHAR_TO_FULL_LENGTH时,查询返回数据时默认会将字符串尾部的空格去掉,所以你看到长度为零
mysql> select id, length(name), char_length(name),hex(name) from test;
+----+--------------+-------------------+-----------+
| id | length(name) | char_length(name) | hex(name) |
+----+--------------+-------------------+-----------+
| 1 | NULL | NULL | NULL |
| 2 | 0 | 0 | |
| 3 | 0 | 0 | |
| 4 | 0 | 0 | |
+----+--------------+-------------------+-----------+
4 rows in set (0.00 sec)
mysql>
那么我们手工设置一下当前会话的SQL_MODE,然后对比测试一下:
mysql> set sql_mode='PAD_CHAR_TO_FULL_LENGTH';
Query OK, 0 rows affected, 1 warning (0.01 sec)
mysql> show variables like 'sql_mode';
+---------------+-------------------------+
| Variable_name | Value |
+---------------+-------------------------+
| sql_mode | PAD_CHAR_TO_FULL_LENGTH |
+---------------+-------------------------+
1 row in set (0.01 sec)
mysql> select id, length(name), char_length(name),hex(name) from test;
+----+--------------+-------------------+----------------------------------+
| id | length(name) | char_length(name) | hex(name) |
+----+--------------+-------------------+----------------------------------+
| 1 | NULL | NULL | NULL |
| 2 | 10 | 10 | 20202020202020202020 |
| 3 | 10 | 10 | 20202020202020202020 |
| 4 | 10 | 10 | 20202020202020202020 |
+----+--------------+-------------------+----------------------------------+
4 rows in set (0.00 sec)
mysql>
通过上面的分析讲述,我们知道当数据库的排序规则的Pad Attributes属性为NO PAD时,此时SQL_MODE的PAD_CHAR_TO_FULL_LENGTH设置与否将会影响查询结果。我们新建一个test2数据库,数据库排序规则为utf8mb4_0900_ai_ci,下面我们通过实验对比一下就知道了:
mysql> use test2;
Database changed
mysql> show variables like 'sql_mode'\G
*************************** 1. row ***************************
Variable_name: sql_mode
Value: ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_ENGINE_SUBSTITUTION
1 row in set (0.00 sec)
-- 对于CHAR类型,查询返回数据时默认会将字符串尾部的空格去掉,所以name=''会得到三条记录
mysql> select * from test where name='';
+----+------+
| id | name |
+----+------+
| 2 | |
| 3 | |
| 4 | |
+----+------+
3 rows in set (0.01 sec)
mysql> select * from test where name=' ';
Empty set (0.00 sec)
mysql> select * from test where name=' ';
Empty set (0.00 sec)
mysql> set sql_mode='PAD_CHAR_TO_FULL_LENGTH';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql> show variables like 'sql_mode'\G
*************************** 1. row ***************************
Variable_name: sql_mode
Value: PAD_CHAR_TO_FULL_LENGTH
1 row in set (0.01 sec)
-- SQL_MODE设置PAD_CHAR_TO_FULL_LENGTH。查询返回数据时,字符串尾部的空格不会去掉,此时,name字段时10个空格,故而下面查询条件查不到数据。
mysql> select * from test where name='';
Empty set (0.01 sec)
mysql> select * from test where name=' ';
Empty set (0.00 sec)
mysql> select * from test where name=' ';
Empty set (0.00 sec)
mysql>
四、其他问题
4.1、在PAD SPACE属性时如何实现精准查询
上述案例中我们知道PAD_ATTRIBUTE属性为PAD SPACE时,在排序和比较运算中,忽略字符串尾部空格。此时where后面用等值匹配会匹配到结尾有空格的数据,那么如何实现精准查询呢?
drop table user;
CREATE TABLE `user` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL COMMENT '姓名',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
insert into user values
(null,'张三'), -- 名字后无空格
(null,'张三 '),-- 名字后1个空格
(null,'张三 ')-- 名字后2个空格
;
方法一:like
select * from user where `name` like ?
方法二:BINARY
select * from user where `name` = BINARY ?
BINARY 不是函数,是类型转换运算符,它用来强制它后面的字符串为一个二进制字符串,可以理解成精确匹配
五、总结
关于MySQL的尾部空格是否忽略,以及对查询结果的影响,既跟数据库的排序规则有关(确切来说,是跟数据库排序规则的Pad Attributes有关),其实还跟字符类型和SQL_MODE是否设置PAD_CHAR_TO_FULL_LENGTH有关。
- MySQL的CHAR、VARCHAR、TEXT等字符串字段在等值比较(“=”)时,基于PAD SPACE校对规则,会忽略掉尾部的空格;
- 如果想要精确查询就不能用等值查询(“=”),而应改用LIKE或BINARY;
参考资料:
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
https://dev.mysql.com/doc/refman/8.0/en/charset-binary-collations.html