数据库版本
- PG 16.1
queryid 是什么
queryid 是将 sql 规范化 (normalization) 后,通过哈希函数计算出来的 64 位整数。
以 SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;
这条 SQL 为例。当我们在 PG 中执行这条 sql 时,内核在语义分析阶段会将其规范化后计算哈希值。这个 sql 中,如果将 id<300 改成 id<400 ,它们对应的规范化 sql 仍然是同样的,queryid 也是同样的。
如果启用了 pg_stat_statements 插件,会看到其规范化后的 sql 语句 SELECT id, data FROM tbl_a WHERE id < $1 ORDER BY data;
和计算出来的 queryid。
如何计算 queryid
众所周知,PG 在处理 SQL 时,会经过 词法分析->语法分析->语义分析->生成计划->执行计划 等阶段,而 queryid 的计算就是在语义分析阶段的 JumbleQuery
函数中完成的。
还以上文的 SQL SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;
为例,语义分析阶段后,内核已经得到了 SQL 对应的 query 结构,如下图所示。(借用 interdb.jp [1] 文章中的图)
内核对所有需要 jumble 的结构都设定了对应 jumble 的方法,维护在 queryjumblefuncs.switch.c
文件中,下面简单列几行
case T_Alias:
_jumbleAlias(jstate, expr);
break;
case T_RangeVar:
_jumbleRangeVar(jstate, expr);
break;
case T_TableFunc:
_jumbleTableFunc(jstate, expr);
break;
static void
_jumbleAlias(JumbleState *jstate, Node *node)
{
Alias *expr = (Alias *) node;
JUMBLE_STRING(aliasname);
JUMBLE_NODE(colnames);
}
而对应到每个最终的结构上时,使用 AppendJumble
函数进行处理。将所有的输入都转成 unsigned char* 存到 jstate->jumble 变量里。如果发现长度满了就把之前的哈希一下从头开始存
static void
AppendJumble(JumbleState *jstate, const unsigned char *item, Size size)
{
unsigned char *jumble = jstate->jumble;
Size jumble_len = jstate->jumble_len;
/*
* Whenever the jumble buffer is full, we hash the current contents and
* reset the buffer to contain just that hash value, thus relying on the
* hash to summarize everything so far.
*/
while (size > 0)
{
Size part_size;
if (jumble_len >= JUMBLE_SIZE)
{
uint64 start_hash;
start_hash = DatumGetUInt64(hash_any_extended(jumble,
JUMBLE_SIZE, 0));
memcpy(jumble, &start_hash, sizeof(start_hash));
jumble_len = sizeof(start_hash);
}
part_size = Min(size, JUMBLE_SIZE - jumble_len);
memcpy(jumble + jumble_len, item, part_size);
jumble_len += part_size;
item += part_size;
size -= part_size;
}
jstate->jumble_len = jumble_len;
}
上述过程完成后,就得到了一个完整的 jumble 字符串。然后用内核的 hash_any_extend
对这个字符串进行哈希,得到的 64 位无符号整数就是 queryid(输出时会将其转为 int64)。
pg_stat_statement 生成规范化 sql
PG 内核所做的只是将一个 SQL 解析成 Query 结构,然后将所需要的部分拼起来哈希一下,得到一个 queryid,并没有规范化 sql 的概念。而 pg_stat_statement 的 generate_normalized_query
函数可以帮我们做到这一点,从
SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;
生成
SELECT id, data FROM tbl_a WHERE id < $1 ORDER BY data;
规范化的 sql。
下面是函数的源码,代码很长,其实就是做了一件事:将 sql 中的常量替换成 $1、$2…
static char *
generate_normalized_query(JumbleState *jstate, const char *query,
int query_loc, int *query_len_p)
{
char *norm_query;
int query_len = *query_len_p;
int i,
norm_query_buflen, /* Space allowed for norm_query */
len_to_wrt, /* Length (in bytes) to write */
quer_loc = 0, /* Source query byte location */
n_quer_loc = 0, /* Normalized query byte location */
last_off = 0, /* Offset from start for previous tok */
last_tok_len = 0; /* Length (in bytes) of that tok */
/*
* Get constants' lengths (core system only gives us locations). Note
* this also ensures the items are sorted by location.
*/
fill_in_constant_lengths(jstate, query, query_loc);
/*
* Allow for $n symbols to be longer than the constants they replace.
* Constants must take at least one byte in text form, while a $n symbol
* certainly isn't more than 11 bytes, even if n reaches INT_MAX. We
* could refine that limit based on the max value of n for the current
* query, but it hardly seems worth any extra effort to do so.
*/
norm_query_buflen = query_len + jstate->clocations_count * 10;
/* Allocate result buffer */
norm_query = palloc(norm_query_buflen + 1);
for (i = 0; i < jstate->clocations_count; i++)
{
int off, /* Offset from start for cur tok */
tok_len; /* Length (in bytes) of that tok */
off = jstate->clocations[i].location;
/* Adjust recorded location if we're dealing with partial string */
off -= query_loc;
tok_len = jstate->clocations[i].length;
if (tok_len < 0)
continue; /* ignore any duplicates */
/* Copy next chunk (what precedes the next constant) */
len_to_wrt = off - last_off;
len_to_wrt -= last_tok_len;
Assert(len_to_wrt >= 0);
memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
n_quer_loc += len_to_wrt;
/* And insert a param symbol in place of the constant token */
n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d",
i + 1 + jstate->highest_extern_param_id);
quer_loc = off + tok_len;
last_off = off;
last_tok_len = tok_len;
}
/*
* We've copied up until the last ignorable constant. Copy over the
* remaining bytes of the original query string.
*/
len_to_wrt = query_len - quer_loc;
Assert(len_to_wrt >= 0);
memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
n_quer_loc += len_to_wrt;
Assert(n_quer_loc <= norm_query_buflen);
norm_query[n_quer_loc] = '\0';
*query_len_p = n_quer_loc;
return norm_query;
}
参考资料
[1] https://www.interdb.jp/pg/pgsql03/01.html