PostgreSQL计算 queryid 原理

news2025/7/13 23:36:00

数据库版本

PG 16.1

queryid 是什么

queryid 是将 sql 规范化 (normalization) 后，通过哈希函数计算出来的 64 位整数。

以 SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data; 这条 SQL 为例。当我们在 PG 中执行这条 sql 时，内核在语义分析阶段会将其规范化后计算哈希值。这个 sql 中，如果将 id<300 改成 id<400 ，它们对应的规范化 sql 仍然是同样的，queryid 也是同样的。

如果启用了 pg_stat_statements 插件，会看到其规范化后的 sql 语句 SELECT id, data FROM tbl_a WHERE id < $1 ORDER BY data; 和计算出来的 queryid。
在这里插入图片描述

如何计算 queryid

众所周知，PG 在处理 SQL 时，会经过词法分析->语法分析->语义分析->生成计划->执行计划等阶段，而 queryid 的计算就是在语义分析阶段的 JumbleQuery 函数中完成的。

还以上文的 SQL SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data; 为例，语义分析阶段后，内核已经得到了 SQL 对应的 query 结构，如下图所示。（借用 interdb.jp [1] 文章中的图）

在这里插入图片描述
内核对所有需要 jumble 的结构都设定了对应 jumble 的方法，维护在 queryjumblefuncs.switch.c 文件中，下面简单列几行

case T_Alias:
	_jumbleAlias(jstate, expr);
	break;
case T_RangeVar:
	_jumbleRangeVar(jstate, expr);
	break;
case T_TableFunc:
	_jumbleTableFunc(jstate, expr);
	break;

static void
_jumbleAlias(JumbleState *jstate, Node *node)
{
	Alias *expr = (Alias *) node;

	JUMBLE_STRING(aliasname);
	JUMBLE_NODE(colnames);
}

而对应到每个最终的结构上时，使用 AppendJumble 函数进行处理。将所有的输入都转成 unsigned char* 存到 jstate->jumble 变量里。如果发现长度满了就把之前的哈希一下从头开始存

static void
AppendJumble(JumbleState *jstate, const unsigned char *item, Size size)
{
	unsigned char *jumble = jstate->jumble;
	Size		jumble_len = jstate->jumble_len;

	/*
	 * Whenever the jumble buffer is full, we hash the current contents and
	 * reset the buffer to contain just that hash value, thus relying on the
	 * hash to summarize everything so far.
	 */
	while (size > 0)
	{
		Size		part_size;

		if (jumble_len >= JUMBLE_SIZE)
		{
			uint64		start_hash;

			start_hash = DatumGetUInt64(hash_any_extended(jumble,
														  JUMBLE_SIZE, 0));
			memcpy(jumble, &start_hash, sizeof(start_hash));
			jumble_len = sizeof(start_hash);
		}
		part_size = Min(size, JUMBLE_SIZE - jumble_len);
		memcpy(jumble + jumble_len, item, part_size);
		jumble_len += part_size;
		item += part_size;
		size -= part_size;
	}
	jstate->jumble_len = jumble_len;
}

上述过程完成后，就得到了一个完整的 jumble 字符串。然后用内核的 hash_any_extend 对这个字符串进行哈希，得到的 64 位无符号整数就是 queryid（输出时会将其转为 int64）。

pg_stat_statement 生成规范化 sql

PG 内核所做的只是将一个 SQL 解析成 Query 结构，然后将所需要的部分拼起来哈希一下，得到一个 queryid，并没有规范化 sql 的概念。而 pg_stat_statement 的 generate_normalized_query 函数可以帮我们做到这一点，从
SELECT id, data FROM tbl_a WHERE id < 300 ORDER BY data;
生成
SELECT id, data FROM tbl_a WHERE id < $1 ORDER BY data; 规范化的 sql。

下面是函数的源码，代码很长，其实就是做了一件事：将 sql 中的常量替换成 $1、$2…

static char *
generate_normalized_query(JumbleState *jstate, const char *query,
						  int query_loc, int *query_len_p)
{
	char	   *norm_query;
	int			query_len = *query_len_p;
	int			i,
				norm_query_buflen,	/* Space allowed for norm_query */
				len_to_wrt,		/* Length (in bytes) to write */
				quer_loc = 0,	/* Source query byte location */
				n_quer_loc = 0, /* Normalized query byte location */
				last_off = 0,	/* Offset from start for previous tok */
				last_tok_len = 0;	/* Length (in bytes) of that tok */

	/*
	 * Get constants' lengths (core system only gives us locations).  Note
	 * this also ensures the items are sorted by location.
	 */
	fill_in_constant_lengths(jstate, query, query_loc);

	/*
	 * Allow for $n symbols to be longer than the constants they replace.
	 * Constants must take at least one byte in text form, while a $n symbol
	 * certainly isn't more than 11 bytes, even if n reaches INT_MAX.  We
	 * could refine that limit based on the max value of n for the current
	 * query, but it hardly seems worth any extra effort to do so.
	 */
	norm_query_buflen = query_len + jstate->clocations_count * 10;

	/* Allocate result buffer */
	norm_query = palloc(norm_query_buflen + 1);

	for (i = 0; i < jstate->clocations_count; i++)
	{
		int			off,		/* Offset from start for cur tok */
					tok_len;	/* Length (in bytes) of that tok */

		off = jstate->clocations[i].location;
		/* Adjust recorded location if we're dealing with partial string */
		off -= query_loc;

		tok_len = jstate->clocations[i].length;

		if (tok_len < 0)
			continue;			/* ignore any duplicates */

		/* Copy next chunk (what precedes the next constant) */
		len_to_wrt = off - last_off;
		len_to_wrt -= last_tok_len;

		Assert(len_to_wrt >= 0);
		memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
		n_quer_loc += len_to_wrt;

		/* And insert a param symbol in place of the constant token */
		n_quer_loc += sprintf(norm_query + n_quer_loc, "$%d",
							  i + 1 + jstate->highest_extern_param_id);

		quer_loc = off + tok_len;
		last_off = off;
		last_tok_len = tok_len;
	}

	/*
	 * We've copied up until the last ignorable constant.  Copy over the
	 * remaining bytes of the original query string.
	 */
	len_to_wrt = query_len - quer_loc;

	Assert(len_to_wrt >= 0);
	memcpy(norm_query + n_quer_loc, query + quer_loc, len_to_wrt);
	n_quer_loc += len_to_wrt;

	Assert(n_quer_loc <= norm_query_buflen);
	norm_query[n_quer_loc] = '\0';

	*query_len_p = n_quer_loc;
	return norm_query;
}