Lightdb/PG 编码相关问题梳理
之前在通过SQL文件导入数据时,报:ERROR: invalid byte sequence for encoding "EUC_CN"
错误。然后就梳理了一下编码相关问题,这边记录一下。涉及到如下两种类型的报错:
- ERROR: invalid byte sequence for encoding “EUC_CN”
- ERROR: character with byte sequence 0xad 0xe5 in encoding “GBK” has no equivalent in encoding “UTF8”
在使用Lightdb 过程中会涉及到多处的编码设置。具体如下:
- 终端工具(如:xshell)的编码,一般与服务器设置一致
- 服务器编码(locale, LANG)
- LightDB 客户端编码 client_encoding
- LightDB 服务端编码 server_encoding
下面分两种情况进行讨论,一种是通过ltsql/psql 交互式执行SQL, 一种是通过-f 选项执行SQL文件
一. 前期准备
create database test_c template template0 encoding 'EUC_CN' LC_CTYPE 'zh_CN' LC_COLLATE 'zh_CN';
\c test_c
二. ltsql/psql 设置client_encoding 原理
ltsql/psql 设置client_encoding 主要考虑以下几个方面:
- PGCLIENTENCODING 环境变量, 本文不考虑, 认为没有设置(设置了,即设置为PGCLIENTENCODING值)
- 是否不是终端,通过
pset.notty= (!isatty(fileno(stdin)) || !isatty(fileno(stdout)));
判断,标记,只有当输入输出都不是终端时,才会标记为非终端模式。
交互式执行SQL命令, 必然是终端, 然后会设置client_encoding 为 auto((pset.notty || lt_getenv("PGCLIENTENCODING")) ? NULL : "auto";
) , 最终在建立连接时, 对于auto会从locale获取编码设置(conn->client_encoding_initial = strdup(pg_encoding_to_char(pg_get_encoding_from_locale(NULL, true)))
, 即根据服务器编码来设置client_encoding。
非交互式的执行SQL脚本,则会更复杂一点, 在下面进行说明。
如果客户端不设置client_encoding (非终端模式), 则服务端会把client_encoding 设置为server_encoding.
三. SQL命令
-
当client_encoding与终端工具编码不一致时, 会报
ERROR: invalid byte sequence for encoding "EUC_CN"
,-
终端编码为unicode(UTF8), 服务端编码也为UTF8, client_encoding 初始值为UTF8, 下面通过set 命令修改客户端编码(也可以通过修改服务端编码,然后重连)
chuhx@postgres=# show client_en%; name | setting | description -----------------+---------+------------------------------------------- client_encoding | UTF8 | Sets the client's character set encoding. (1 row) chuhx@postgres=# set client_encoding= EUC_CN; # gbk2312 SET chuhx@postgres=# show client_en%; name | setting | description -----------------+---------+------------------------------------------- client_encoding | EUC_CN | Sets the client's character set encoding. (1 row) chuhx@postgres=# show server_en%; name | setting | description -----------------+---------+--------------------------------------------------- - server_encoding | UTF8 | Sets the server (database) character set encoding. (1 row) chuhx@postgres=# insert into test values('是打算'); ERROR: invalid byte sequence for encoding "EUC_CN": 0xe6 0x98 这个报错是由于输入文本为UTF8(终端编码), 但client_encoding 为EUC_CN(gbk2312), 导致不能用EUC_CN编码正确解析输入的文本导致。
-
修改终端编码 gbk2312, 修改后就不会报错。
chuhx@postgres=# insert into test values('是打算'); INSERT 0 1
-
-
当client_encoding与终端工具编码不一致时,也可能会报
ERROR: character with byte sequence 0xad 0xe5 in encoding "GBK" has no equivalent in encoding "UTF8"
-
比如client_encoding 为gbk, server_encoding 为utf8, 终端编码为utf8, 此时会有如下报错:
chuhx@postgres=# insert into test values('中国'); ERROR: character with byte sequence 0xad 0xe5 in encoding "GBK" has no equivalent in encoding "UTF8"
-
修改终端编码为 gbk 即可成功插入, 这是因为原先插入的是’中国‘(utf8编码)按gbk 进行了解析。但没有解析出错(没有检测到)。在转换为utf8 时报错。
chuhx@postgres=# insert into test values('中国'); INSERT 0 1
-
-
client_encoding 与server_encoding 不一致,可能出现如下报错:
character with byte sequence 0xe7 0x99 0xbc in encoding "UTF8" has no equivalent in encoding "EUC_CN"
, 这是由于’發‘ 不能用EUC_CN 编码表示,需要修改lightdb 服务端编码。chuhx@test_c=# insert into test values('發'); ERROR: character with byte sequence 0xe7 0x99 0xbc in encoding "UTF8" has no equivalent in encoding "EUC_CN" chuhx@test_c=#
四. SQL文件
在通过 -f/-c 非交互方式执行SQL时, 满足了!isatty(fileno(stdin))
, 如果对结果重定向到文件,则满足了 !isatty(fileno(stdout)));
此时,表示ltsql执行在非终端模式下,不会设置client_encoding, 当客户端不设置client_encoding时, 服务端会把client_encoding 设置为与server_encoding 一致。具体见如下:
[chuhx@test-host ~/citus]$ ltsql -p5432 -d test_c -c 'show %encoding;'
name | setting | description
-----------------+---------+---------------------------------------------------
-
client_encoding | UTF8 | Sets the client's character set encoding.
server_encoding | EUC_CN | Sets the server (database) character set encoding.
(2 rows)
重定向后, 非终端模式, client_encoding 与server_encoding 一致
[chuhx@test-host ~/citus]$ ltsql -p5432 -d test_c -c 'show %encoding;' > 1.txt
[chuhx@test-host ~/citus]$ cat 1.txt
name | setting | description
-----------------+---------+----------------------------------------------------
client_encoding | EUC_CN | Sets the client's character set encoding.
server_encoding | EUC_CN | Sets the server (database) character set encoding.
(2 rows)
[chuhx@test-host ~/citus]$
用-o 指定输出文件, 会根据服务端编码设置client_encoding。
[chuhx@test-host ~/citus]$ ltsql -p5432 -d test_c -c 'show %encoding;' -o 1.txt
[chuhx@test-host ~/citus]$ cat 1.txt
name | setting | description
-----------------+---------+----------------------------------------------------
client_encoding | UTF8 | Sets the client's character set encoding.
server_encoding | EUC_CN | Sets the server (database) character set encoding.
(2 rows)
五. 附录(gdb 跟踪)
r -p5432 -d test_c -f test.sql -o 1.txt 只有输入不是终端
pset.notty false
Breakpoint 3, PQconnectdbParams (keywords=0x6bc180, values=0x6bc1d0, expand_dbname=1) at fe-connect.c:651
651 PGconn *conn = PQconnectStartParams(keywords, values, expand_dbname);
(gdb) p keywords[6]
$60 = 0x48f9ed "client_encoding"
(gdb) p values[6]
$61 = 0x48fa0e "auto"
(gdb) c
Continuing.
Breakpoint 5, connectOptions2 (conn=0x6bc220) at fe-connect.c:1091
1091 conn->whichhost = 0;
(gdb) p conn->client_encoding
$62 = 0
(gdb) p conn->client_encoding_initial
$63 = 0x6c6470 "auto"
(gdb) c
Continuing.
Breakpoint 6, connectOptions2 (conn=0x6bc220) at fe-connect.c:1457
1457 if (conn->client_encoding_initial &&
(gdb) n
1458 strcmp(conn->client_encoding_initial, "auto") == 0)
(gdb)
1457 if (conn->client_encoding_initial &&
(gdb)
1460 free(conn->client_encoding_initial);
(gdb)
1461 conn->client_encoding_initial = strdup(pg_encoding_to_char(pg_get_encoding_from_locale(NULL, true)));
(gdb)
1462 if (!conn->client_encoding_initial)
(gdb)
1469 if (conn->target_session_attrs)
(gdb) p conn->client_encoding_initial
$64 = 0x6c6470 "UTF8"
r -p5432 -d test_c -f test.sql > 1.txt 输入输出都不是终端
pset.notty true
Breakpoint 3, PQconnectdbParams (keywords=0x6bbf40, values=0x6bbf90, expand_dbname=1) at fe-connect.c:651
651 PGconn *conn = PQconnectStartParams(keywords, values, expand_dbname);
(gdb) p keywords[6]
$66 = 0x48f9ed "client_encoding"
(gdb) p values[6]
$67 = 0x0
(gdb) c
Continuing.
Breakpoint 5, connectOptions2 (conn=0x6bbfe0) at fe-connect.c:1091
1091 conn->whichhost = 0;
(gdb) p conn->client_encoding
$68 = 0
(gdb) p conn->client_encoding_initial
$69 = 0x0
(gdb)