PolarDB数据库的CSN机制

背景

对postgres数据库熟悉的同学会发现在高并发场景下在获取快照处易出现性能瓶颈，其原因在于PG使用全局数组在共享内存中保存所有事务的状态，在获取快照时需要加锁以保证数据一致性。获取快照时需要持有ProcArraryLock共享锁比遍历ProcArray数组中活跃事务，与此同时提交或回滚的事务需要申请ProcArray排他锁已清除本事务。可想而知，在高并发场景下对ProcArrayLock的申请会成为数据库的瓶颈。为克服上述问题，polardb引入CSN（COMMIT SEQUENCE NUM）事务快照机制避免对ProcarryLock的申请。

1 CSN 机制

1.1 CSN原理

PolarDB在事务层，通过CSN快照来代替PG原生快照
在这里插入图片描述
如图所示，每个非只读事务在运行过程中会被分配一个xid，在事务提交时推进CSN；同时会将单前的CSN与事务的XID的映射关系保存起来。
图中实心竖线标识获取快照时刻，会获取最新提交CSN的下一个值4。TX1、TX3、TX5均已提交，其对应的CSN为1、2、3。TX2、TX4、TX6正在运行，TX6、TX8是未来还未开启的事务。对于当前快照而言，严格小于CSN=4的事务的提交结果均可见；其余事务还未提交，不可见。

1.2 CSN的实现

CSN（Commit Sequence Number，提交顺序号）本身与XID（事务号）也会留存一个映射关系，以便将事务本身以及其对应的可见性进行关联，这个映射关系会留存在CSNLog中。事务ID 2048、2049、2050、2051、2052、2053对应的CSN号依次是5、4、7、10、6、8，也就是事务的提交顺序是2049、2048、2052、2050、2053、2051.
在这里插入图片描述
PolarDB与之对应为每个事务id分配8个字节uint64的CSN号，所以一个8kB页面能保存1k个事务的CSN号。CSNLOG达到一定大小后会分块，每个CSNLOG文件块的大小为256kB。同xid号类似，CSN号预留了几个特殊的号。CSNLOG定义代码如下：
在这里插入图片描述

2 CSN快照与可见性判断

2.1 CSN相关数据结构

polar_csn_mvcc_var_cache结构体维护了最老的活跃事务xid、下一个将要分配的CSN以及最新完成的事务xid。
在这里插入图片描述
当其他事务要获取该事务的CSN状态时，如果该事务处于正在提交阶段，那么其他事务通过获取CommitSeqNoLock锁的排他模式来等待其完成。
CSNLogControlLock用于写入csnlog文件时加锁保护。

2.2 CSN快照的获取

PolarDB中获取CSN快照函数为GetSnapshotDataCSN，实现流程如下：
1、获取polar_shmem_csn_mvcc_var_cache->polar_next_csn作为snapshot->polar_snapshot_csn值。
2、snapshot->xmin = polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid
3、snapshot->xmax=polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid+1
4、根据GUC参数old_snapshot_threshold，决定是否需要设置snapshot->lsn以及snapshot->whenTaken 。
5、最后根据GUC参数polar_csn_xid_snapshot表示是否从csn快照中生成xid快照。

tatic Snapshot
GetSnapshotDataCSN(Snapshot snapshot)
{
	TransactionId xmin;
	TransactionId xmax;
	CommitSeqNo snapshotcsn;

	Assert(snapshot != NULL);

	/*
	 * The ProcArrayLock is not needed here. We only set our xmin if
	 * it's not already set. There are only a few functions that check
	 * the xmin under exclusive ProcArrayLock:
	 * 1) ProcArrayInstallRestored/ImportedXmin -- can only care about
	 * our xmin long after it has been first set.
	 * 2) ProcArrayEndTransaction is not called concurrently with
	 * GetSnapshotData.
	 */

	/* Anything older than oldestActiveXid is surely finished by now. */
	xmin = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid);
	/* If no performance issue, we try best to maintain RecentXmin for xid based snapshot */
	RecentXmin = xmin;

	/* Announce my xmin, to hold back GlobalXmin. */
	if (!TransactionIdIsValid(MyPgXact->xmin))
	{
		TransactionId oldest_active_xid;

		MyPgXact->xmin = xmin;
		TransactionXmin = xmin;

		/*
		 * Recheck, if oldestActiveXid advanced after we read it.
		 *
		 * This protects against a race condition with GetRecentGlobalXmin().
		 * If a transaction ends runs GetRecentGlobalXmin(), just after we fetch
		 * polar_oldest_active_xid, but before we set MyPgXact->xmin, it's possible
		 * that GetRecentGlobalXmin() computed a new GlobalXmin that doesn't
		 * cover the xmin that we got. To fix that, check polar_oldest_active_xid
		 * again, after setting xmin. Redoing it once is enough, we don't need
		 * to loop, because the (stale) xmin that we set prevents the same
		 * race condition from advancing RecentGlobalXmin again.
		 *
		 * For a brief moment, we can have the situation that our xmin is
		 * lower than RecentGlobalXmin, but it's OK because we don't use that xmin
		 * until we've re-checked and corrected it if necessary.
		 */

		/*
		 * memory barrier to make sure that setting the xmin in our PGPROC entry
		 * is made visible to others, before the read below.
		 */
		pg_memory_barrier();

		oldest_active_xid  = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid);
		if (oldest_active_xid != xmin)
		{
			/*no cover begin*/
			xmin = oldest_active_xid;

			RecentXmin = xmin;
			MyPgXact->xmin = xmin;
			TransactionXmin = xmin;
			/*no cover end*/
		}
	}

	/*
	 * Get the current snapshot CSN. This
	 * serializes us with any concurrent commits.
	 */
	snapshotcsn = pg_atomic_read_u64(&polar_shmem_csn_mvcc_var_cache->polar_next_csn);
	
	/*
	 * Also get xmax. It is always latestCompletedXid + 1.
	 * Make sure to read it after CSN (see TransactionIdAsyncCommitTree())
	 */
	pg_read_barrier();
	xmax = pg_atomic_read_u32(&polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid);
	Assert(TransactionIdIsNormal(xmax));
	TransactionIdAdvance(xmax);

	snapshot->xmin = xmin;
	snapshot->xmax = xmax;
	snapshot->polar_snapshot_csn = snapshotcsn;
	snapshot->polar_csn_xid_snapshot = false;
	snapshot->xcnt = 0;
	snapshot->subxcnt = 0;
	snapshot->suboverflowed = false;
	snapshot->curcid = GetCurrentCommandId(false);

	/*
	 * This is a new snapshot, so set both refcounts are zero, and mark it as
	 * not copied in persistent memory.
	 */
	snapshot->active_count = 0;
	snapshot->regd_count = 0;
	snapshot->copied = false;

	if (old_snapshot_threshold < 0)
	{
		/*
		 * If not using "snapshot too old" feature, fill related fields with
		 * dummy values that don't require any locking.
		 */
		snapshot->lsn = InvalidXLogRecPtr;
		snapshot->whenTaken = 0;
	}
	else
	{
		/*
		 * Capture the current time and WAL stream location in case this
		 * snapshot becomes old enough to need to fall back on the special
		 * "old snapshot" logic.
		 */
		snapshot->lsn = GetXLogInsertRecPtr();
		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
	}

	/* 
	 * We get RecentGlobalXmin/RecentGlobalDataXmin lazily in polar csn.
	 * In master mode, we reset it when end transaction;
	 * In hot standby mode, wal replayed by startup backend, we has to reset
	 * it when get snapshot,
	 * because RecentGlobalXmin/RecentGlobalDataXmin are backend variables.
	 */
	if (RecoveryInProgress())
		resetGlobalXminCacheCSN();

	/* 
	 * We need xid snapshot, should generate it from csn snapshot.
	 * The logic is:
	 * 1. Scan csnlog from xmin(inclusive) to xmax(exclusive)
	 * 2. Add xids whose status are in_progress or committing or 
	 *    committed csn >= snapshotcsn to xid array
	 * Like hot standby, we don't know which xids are top-level and which are
	 * subxacts. So we use subxip to store xids as more as possible. 
	 */
	if (polar_csn_xid_snapshot)
	{
		if (TransactionIdPrecedes(xmin, xmax))
			polar_csnlog_get_running_xids(xmin, xmax, snapshotcsn, GetMaxSnapshotSubxidCount(),
				&snapshot->subxcnt, snapshot->subxip, &snapshot->suboverflowed);

		snapshot->polar_csn_xid_snapshot = true;
	}

	return snapshot;
}

2.3 MVCC可见性判断流程

结合行头的结构（其中的xmin、xmax）以及Clog、上述CSNLOG的映射机制，MVCC的大致判断流程如下所示，实现函数为HeapTupleSatisfiesMVCC，对于xid在CSN快照中的可见性判断函数为XidVisibleInSnapshotCSN，其流程图如下：
在这里插入图片描述

2.4 事务commit和abort如何更新CSN

CSN快照获取主要依据polar_shmem_csn_mvcc_var_cache变量中维护的成员变量，参考前面的CSN快照获取。
因此，这里主要关注事务在commit和abort时如何更新polar_shmem_csn_mvcc_var_cache的成员变量。

AdvanceOldestActiveXidCSN函数用于推进->polar_oldest_active_xid这个值：
进程退出、事务提交以及回滚之后、以及在备机上回放commit和abort时需要推进polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid，当事务的xid等于polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid时，才会推进polar_shmem_csn_mvcc_var_cache->polar_oldest_active_xid的值，否则直接返回。

polar_xact_abort_tree_csn在事务回滚时设置CSN的值（POLAR_CSN_ABORTED），并推进polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid值。
polar_xact_commit_tree_csn在事务提交时设置该事务CSN的值，并推进和polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid和polar_shmem_csn_mvcc_var_cache->polar_next_csn的值。

polar_shmem_csn_mvcc_var_cache->polar_next_csn只有事务提交才会推进，回滚事务不会推进该值。

对于开启CSN功能之后，PG中原来的维护xid分配的全局变量ShmemVariableCache中的数据成员只有ShmemVariableCache->nextXid会更新（用于分配xid）。而原来的ShmemVariableCache->latestCompletedXid等在已经被polar_shmem_csn_mvcc_var_cache->polar_latest_completed_xid所取代，因此事务状态变化时并不需要维护其值。