postgres 源码解析 45 btree分裂流程_bt

B+树简介

B+树一种多路平衡树，有如下特点：

m阶B+树表示每个节点最多含有m-1个元素，除了根节点之外，每个节点至少含有ceil(m/2)-1个元素。如5阶B+树，每个节点最多4个元素，除根节点之外最少含有2个元素；
内部节点不保存数据只保存索引，所有的数据保存在叶子节点中，其目的是最大化中间节点索引键数以减少树高度；
自带排序，叶子结点之间是有序的，查找的路径稳定；
插入与修改都拥有较为稳定的对数时间复杂度。叶子结点保存所有父节点的关键字记录，每次查找需要定位到叶子结点，B+树元素地插入均自底而上；
通过右指针将相邻的叶子节点连接起来，利于范围查找；
B+树为保持平衡，需要结合自身的结构规则被打破时会进行页分裂操作，本文将结合postgres的源码来学习下PG中的btree 分裂原理。其中pg中btree在常见的btree上有所变化，相关知识见回顾：
postgres源码解析41 btree索引文件的创建–1
Postgresql源码（30）Postgresql索引基础B-linked-tree (引用)

关键数据结构

1 FindSplitData
该结构体记录了分裂过程中的状态信息：左页空闲空间/右页空闲空间，候选分裂点总数和当前分裂点

typedef struct
{
	/* context data for _bt_recsplitloc */
	Relation	rel;			/* index relation */
	Page		origpage;		/* page undergoing split */
	IndexTuple	newitem;		/* new item (cause of page split) */
	Size		newitemsz;		/* size of newitem (includes line pointer) */
	bool		is_leaf;		/* T if splitting a leaf page */
	bool		is_rightmost;	/* T if splitting rightmost page on level */
	OffsetNumber newitemoff;	/* where the new item is to be inserted */
	int			leftspace;		/* space available for items on left page */
	int			rightspace;		/* space available for items on right page */
	int			olddataitemstotal;	/* space taken by old items */
	Size		minfirstrightsz;	/* smallest firstright size */

	/* candidate split point data */
	int			maxsplits;		/* maximum number of splits */
	int			nsplits;		/* current number of splits */
	SplitPoint *splits;			/* all candidate split points for page */
	int			interval;		/* current range of acceptable split points */
} FindSplitData;

2 SplitPoint
该结构体记录了分裂点的一些细节信息，包括假设以此位点分裂后左页与右页的空闲空间<该信息是选择最佳分裂点的依据>，以及新插入的元组是否位于左页。

typedef struct
{
	/* details of free space left by split */
	int16		curdelta;		/* current leftfree/rightfree delta */
	int16		leftfree;		/* space left on left page post-split */
	int16		rightfree;		/* space left on right page post-split */

	/* split point identifying fields (returned by _bt_findsplitloc) */
	OffsetNumber firstrightoff; /* first origpage item on rightpage */
	bool		newitemonleft;	/* new item goes on left, or right? */
} SplitPoint;

分类策略
typedef enum
{
	/* strategy for searching through materialized list of split points */
	SPLIT_DEFAULT,				/* give some weight to truncation */
	SPLIT_MANY_DUPLICATES,		/* find minimally distinguishing point */
	SPLIT_SINGLE_VALUE			/* leave left page almost full */
} FindSplitStrat;

_bt_split

以下图为例，介绍具体执行流程，红色项为high key，待分裂页面在上层插入函数_bt_doinsert中已被排他锁锁定。
在这里插入图片描述
1 调用 _bt_findsplitloc函数确定页面分裂点；

2 准备临时左页
1）在本地上下文申请索引页空间 leftpage，并初始化PageHeader相关字段信息；
2）将待分裂页opage原有标识复制给leftpage，清除BTP_ROOT/BTP_SPLIT_END/BTP_HAS_GARBAGE标识位信息，新增BTP_INCOMPLETE_SPLIT标识信息（表明分裂未完成）；同时将opage的页层btpo_level与前趋btpo_prev复制到leftpage对应字段；
3)将opage的 LSN 复制到leftpage，XLogInsert可能会用；
4)为 leftpage 确定high-key(右页第一项即左页的high-key)，右页第一项有以下两种情况：
(1)分裂点和插入点相等，插入项为右页第一项，即左页high-key
(2)其他情况，分裂点处原有的项为右页第一项，即左页的high-key
5)为左页high-key做后缀截断（若需要）
(1)首先确定左页最后一项，以便决定右页第一项中的多少个属性必须保留在左页的新high-key中
(2)调用_bt_truncate执行后缀截断
6)将high-key插入左页
在这里插入图片描述
3 申请右页buffer（新页调用ReadBufferExtended函数从从索引文件extend而来）

申请右页的Buffer并持有写锁，获得右页rightpage；
将临时左页 btpo_next 指向右页 rightpagenumber；
将右页 btpo_prev指向原始分裂页origpagenumber（左页最终会回写回原页）
将右页next指针指向原始页面的next指针所指的内容
获取vacuum id同时赋值给左页和右页
6 )若原始页面不是当前层最右页面，为 rightpage 设置 high-key，即原始页面high-key

4 数据分配与填充
1）遍历旧页中的所有索引元组，根据偏序关系判断索引元组临时左页还是右页rightpage；
2）调用_bt_pgaddtup函数将其填充至页中对应的偏移量处；
在这里插入图片描述
5 对opage的右页spage（如果有）持有写锁，更新前驱link,并为rightpage添加 BTP_SPLIT_END标识；

6 进入临界区进行写操作
1）首先将临时左页的内容复制到原旧页opage,释放临时左页占用的内存资源；
2）分别将旧页opage、右页rightpage和spage所在的缓冲区标记为脏；
3）如果分裂页即旧页不是叶子结点，则清除cbuf对应页cpage的 BTP_INCOMPLETE_SPLIT标识信息，设置该buf为脏；
4) 为上述分裂操作构建XLOG日志，重点信息包含rightpage所处层级、分裂点以及上述opage/right/spage/cpage信息；

7 清理工作
1）如果当前页不是最右页，则释放 sbuf的写锁和pin;
2) 如果当前页不是叶子结点，则释放 cbuf的写锁和pin;
3）如果是叶子结点则释放此过程申请的lefthighkey内存；
4) 最后返回右页buf；
（当前页opage和右页rightpage的写锁和pin还未释放）
在这里插入图片描述

zongtiliucheng图

_bt_insert_parent

通过上述流程可以发现rightpage与父页的link关系没有确定，且持有锁资源均未释放，这些操作由 _bt_insert_parent函数完成，其流程如下：
在这里插入图片描述

/*
 * _bt_insert_parent() -- Insert downlink into parent, completing split.
 *
 * On entry, buf and rbuf are the left and right split pages, which we
 * still hold write locks on.  Both locks will be released here.  We
 * release the rbuf lock once we have a write lock on the page that we
 * intend to insert a downlink to rbuf on (i.e. buf's current parent page).
 * The lock on buf is released at the same point as the lock on the parent
 * page, since buf's INCOMPLETE_SPLIT flag must be cleared by the same
 * atomic operation that completes the split by inserting a new downlink.
 *
 * stack - stack showing how we got here.  Will be NULL when splitting true
 *			root, or during concurrent root split, where we can be inefficient
 * isroot - we split the true root
 * isonly - we split a page alone on its level (might have been fast root)
 */
static void
_bt_insert_parent(Relation rel,
				  Buffer buf,
				  Buffer rbuf,
				  BTStack stack,
				  bool isroot,
				  bool isonly)
{
	/*
	 * Here we have to do something Lehman and Yao don't talk about: deal with
	 * a root split and construction of a new root.  If our stack is empty
	 * then we have just split a node on what had been the root level when we
	 * descended the tree.  If it was still the root then we perform a
	 * new-root construction.  If it *wasn't* the root anymore, search to find
	 * the next higher level that someone constructed meanwhile, and find the
	 * right place to insert as for the normal case.
	 *
	 * If we have to search for the parent level, we do so by re-descending
	 * from the root.  This is not super-efficient, but it's rare enough not
	 * to matter.
	 */
	if (isroot)
	{
		Buffer		rootbuf;

		Assert(stack == NULL);
		Assert(isonly);
		/* create a new root node and update the metapage */
		rootbuf = _bt_newroot(rel, buf, rbuf);
		/* release the split buffers */
		_bt_relbuf(rel, rootbuf);
		_bt_relbuf(rel, rbuf);
		_bt_relbuf(rel, buf);
	}
	else
	{
		BlockNumber bknum = BufferGetBlockNumber(buf);
		BlockNumber rbknum = BufferGetBlockNumber(rbuf);
		Page		page = BufferGetPage(buf);
		IndexTuple	new_item;
		BTStackData fakestack;
		IndexTuple	ritem;
		Buffer		pbuf;

		if (stack == NULL)
		{
			BTPageOpaque opaque;

			elog(DEBUG2, "concurrent ROOT page split");
			opaque = BTPageGetOpaque(page);

			/*
			 * We should never reach here when a leaf page split takes place
			 * despite the insert of newitem being able to apply the fastpath
			 * optimization.  Make sure of that with an assertion.
			 *
			 * This is more of a performance issue than a correctness issue.
			 * The fastpath won't have a descent stack.  Using a phony stack
			 * here works, but never rely on that.  The fastpath should be
			 * rejected within _bt_search_insert() when the rightmost leaf
			 * page will split, since it's faster to go through _bt_search()
			 * and get a stack in the usual way.
			 */
			Assert(!(P_ISLEAF(opaque) &&
					 BlockNumberIsValid(RelationGetTargetBlock(rel))));

			/* Find the leftmost page at the next level up */
			pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
			/* Set up a phony stack entry pointing there */
			stack = &fakestack;
			stack->bts_blkno = BufferGetBlockNumber(pbuf);
			stack->bts_offset = InvalidOffsetNumber;
			stack->bts_parent = NULL;
			_bt_relbuf(rel, pbuf);
		}

		/* get high key from left, a strict lower bound for new right page */
		ritem = (IndexTuple) PageGetItem(page,
										 PageGetItemId(page, P_HIKEY));

		/* form an index tuple that points at the new right page */
		new_item = CopyIndexTuple(ritem);
		BTreeTupleSetDownLink(new_item, rbknum);

		/*
		 * Re-find and write lock the parent of buf.
		 *
		 * It's possible that the location of buf's downlink has changed since
		 * our initial _bt_search() descent.  _bt_getstackbuf() will detect
		 * and recover from this, updating the stack, which ensures that the
		 * new downlink will be inserted at the correct offset. Even buf's
		 * parent may have changed.
		 */
		pbuf = _bt_getstackbuf(rel, stack, bknum);

		/*
		 * Unlock the right child.  The left child will be unlocked in
		 * _bt_insertonpg().
		 *
		 * Unlocking the right child must be delayed until here to ensure that
		 * no concurrent VACUUM operation can become confused.  Page deletion
		 * cannot be allowed to fail to re-find a downlink for the rbuf page.
		 * (Actually, this is just a vestige of how things used to work.  The
		 * page deletion code is expected to check for the INCOMPLETE_SPLIT
		 * flag on the left child.  It won't attempt deletion of the right
		 * child until the split is complete.  Despite all this, we opt to
		 * conservatively delay unlocking the right child until here.)
		 */
		_bt_relbuf(rel, rbuf);

		if (pbuf == InvalidBuffer)
			ereport(ERROR,
					(errcode(ERRCODE_INDEX_CORRUPTED),
					 errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u",
									 RelationGetRelationName(rel), bknum, rbknum)));

		/* Recursively insert into the parent */
		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
					   new_item, MAXALIGN(IndexTupleSize(new_item)),
					   stack->bts_offset + 1, 0, isonly);

		/* be tidy */
		pfree(new_item);
	}
}