B+树简介
B+树一种多路平衡树,有如下特点:
- m阶B+树表示每个节点最多含有m-1个元素,除了根节点之外,每个节点至少含有ceil(m/2)-1个元素。如5阶B+树,每个节点最多4个元素,除根节点之外最少含有2个元素;
- 内部节点不保存数据只保存索引,所有的数据保存在叶子节点中,其目的是最大化中间节点索引键数以减少树高度;
- 自带排序,叶子结点之间是有序的,查找的路径稳定;
- 插入与修改都拥有较为稳定的对数时间复杂度。叶子结点保存所有父节点的关键字记录,每次查找需要定位到叶子结点,B+树元素地插入均自底而上;
- 通过右指针将相邻的叶子节点连接起来,利于范围查找;
B+树为保持平衡,需要结合自身的结构规则被打破时会进行页分裂操作,本文将结合postgres的源码来学习下PG中的btree 分裂原理。其中pg中btree在常见的btree上有所变化,相关知识见回顾:
postgres源码解析41 btree索引文件的创建–1
Postgresql源码(30)Postgresql索引基础B-linked-tree (引用)
关键数据结构
1 FindSplitData
该结构体记录了分裂过程中的状态信息:左页空闲空间/右页空闲空间,候选分裂点总数和当前分裂点
typedef struct
{
/* context data for _bt_recsplitloc */
Relation rel; /* index relation */
Page origpage; /* page undergoing split */
IndexTuple newitem; /* new item (cause of page split) */
Size newitemsz; /* size of newitem (includes line pointer) */
bool is_leaf; /* T if splitting a leaf page */
bool is_rightmost; /* T if splitting rightmost page on level */
OffsetNumber newitemoff; /* where the new item is to be inserted */
int leftspace; /* space available for items on left page */
int rightspace; /* space available for items on right page */
int olddataitemstotal; /* space taken by old items */
Size minfirstrightsz; /* smallest firstright size */
/* candidate split point data */
int maxsplits; /* maximum number of splits */
int nsplits; /* current number of splits */
SplitPoint *splits; /* all candidate split points for page */
int interval; /* current range of acceptable split points */
} FindSplitData;
2 SplitPoint
该结构体记录了分裂点的一些细节信息,包括假设以此位点分裂后左页与右页的空闲空间<该信息是选择最佳分裂点的依据>,以及新插入的元组是否位于左页。
typedef struct
{
/* details of free space left by split */
int16 curdelta; /* current leftfree/rightfree delta */
int16 leftfree; /* space left on left page post-split */
int16 rightfree; /* space left on right page post-split */
/* split point identifying fields (returned by _bt_findsplitloc) */
OffsetNumber firstrightoff; /* first origpage item on rightpage */
bool newitemonleft; /* new item goes on left, or right? */
} SplitPoint;
分类策略
typedef enum
{
/* strategy for searching through materialized list of split points */
SPLIT_DEFAULT, /* give some weight to truncation */
SPLIT_MANY_DUPLICATES, /* find minimally distinguishing point */
SPLIT_SINGLE_VALUE /* leave left page almost full */
} FindSplitStrat;
_bt_split
以下图为例,介绍具体执行流程,红色项为high key,待分裂页面在上层插入函数_bt_doinsert中已被排他锁锁定。
1 调用 _bt_findsplitloc函数确定页面分裂点;
2 准备临时左页
1)在本地上下文申请索引页空间 leftpage,并初始化PageHeader相关字段信息;
2)将待分裂页opage原有标识复制给leftpage,清除BTP_ROOT/BTP_SPLIT_END/BTP_HAS_GARBAGE标识位信息,新增BTP_INCOMPLETE_SPLIT标识信息(表明分裂未完成);同时将opage的页层btpo_level与前趋btpo_prev复制到leftpage对应字段;
3)将opage的 LSN 复制到leftpage,XLogInsert可能会用;
4)为 leftpage 确定high-key(右页第一项即左页的high-key),右页第一项有以下两种情况:
(1)分裂点和插入点相等,插入项为右页第一项,即左页high-key
(2)其他情况,分裂点处原有的项为右页第一项,即左页的high-key
5)为左页high-key做后缀截断(若需要)
(1)首先确定左页最后一项,以便决定右页第一项中的多少个属性必须保留在 左页的新high-key中
(2)调用_bt_truncate执行后缀截断
6)将high-key插入左页
3 申请右页buffer(新页调用ReadBufferExtended函数从从索引文件extend而来)
- 申请右页的Buffer并持有写锁,获得右页rightpage;
- 将临时左页 btpo_next 指向右页 rightpagenumber;
- 将右页 btpo_prev指向原始分裂页origpagenumber(左页最终会回写回原页)
- 将右页next指针指向原始页面的next指针所指的内容
- 获取vacuum id同时赋值给左页和右页
6 )若原始页面不是当前层最右页面,为 rightpage 设置 high-key,即原始页面high-key
4 数据分配与填充
1) 遍历旧页中的所有索引元组,根据偏序关系判断索引元组临时左页还是右页rightpage;
2) 调用_bt_pgaddtup函数将其填充至页中对应的偏移量处;
5 对opage的右页spage(如果有)持有写锁,更新前驱link,并为rightpage添加 BTP_SPLIT_END标识;
6 进入临界区进行写操作
1)首先将临时左页的内容复制到原旧页opage,释放临时左页占用的内存资源;
2)分别将旧页opage、右页rightpage和spage所在的缓冲区标记为脏;
3)如果分裂页即旧页不是叶子结点,则清除cbuf对应页cpage的 BTP_INCOMPLETE_SPLIT标识信息,设置该buf为脏;
4) 为上述分裂操作构建XLOG日志,重点信息包含rightpage所处层级、分裂点以及上述opage/right/spage/cpage信息;
7 清理工作
1)如果当前页不是最右页,则释放 sbuf的写锁和pin;
2) 如果当前页不是叶子结点,则释放 cbuf的写锁和pin;
3)如果是叶子结点则释放此过程申请的lefthighkey内存;
4) 最后返回右页buf;
(当前页opage和右页rightpage的写锁和pin还未释放)
zongtiliucheng图
_bt_insert_parent
通过上述流程可以发现rightpage与父页的link关系没有确定,且持有锁资源均未释放,这些操作由 _bt_insert_parent函数完成,其流程如下:
/*
* _bt_insert_parent() -- Insert downlink into parent, completing split.
*
* On entry, buf and rbuf are the left and right split pages, which we
* still hold write locks on. Both locks will be released here. We
* release the rbuf lock once we have a write lock on the page that we
* intend to insert a downlink to rbuf on (i.e. buf's current parent page).
* The lock on buf is released at the same point as the lock on the parent
* page, since buf's INCOMPLETE_SPLIT flag must be cleared by the same
* atomic operation that completes the split by inserting a new downlink.
*
* stack - stack showing how we got here. Will be NULL when splitting true
* root, or during concurrent root split, where we can be inefficient
* isroot - we split the true root
* isonly - we split a page alone on its level (might have been fast root)
*/
static void
_bt_insert_parent(Relation rel,
Buffer buf,
Buffer rbuf,
BTStack stack,
bool isroot,
bool isonly)
{
/*
* Here we have to do something Lehman and Yao don't talk about: deal with
* a root split and construction of a new root. If our stack is empty
* then we have just split a node on what had been the root level when we
* descended the tree. If it was still the root then we perform a
* new-root construction. If it *wasn't* the root anymore, search to find
* the next higher level that someone constructed meanwhile, and find the
* right place to insert as for the normal case.
*
* If we have to search for the parent level, we do so by re-descending
* from the root. This is not super-efficient, but it's rare enough not
* to matter.
*/
if (isroot)
{
Buffer rootbuf;
Assert(stack == NULL);
Assert(isonly);
/* create a new root node and update the metapage */
rootbuf = _bt_newroot(rel, buf, rbuf);
/* release the split buffers */
_bt_relbuf(rel, rootbuf);
_bt_relbuf(rel, rbuf);
_bt_relbuf(rel, buf);
}
else
{
BlockNumber bknum = BufferGetBlockNumber(buf);
BlockNumber rbknum = BufferGetBlockNumber(rbuf);
Page page = BufferGetPage(buf);
IndexTuple new_item;
BTStackData fakestack;
IndexTuple ritem;
Buffer pbuf;
if (stack == NULL)
{
BTPageOpaque opaque;
elog(DEBUG2, "concurrent ROOT page split");
opaque = BTPageGetOpaque(page);
/*
* We should never reach here when a leaf page split takes place
* despite the insert of newitem being able to apply the fastpath
* optimization. Make sure of that with an assertion.
*
* This is more of a performance issue than a correctness issue.
* The fastpath won't have a descent stack. Using a phony stack
* here works, but never rely on that. The fastpath should be
* rejected within _bt_search_insert() when the rightmost leaf
* page will split, since it's faster to go through _bt_search()
* and get a stack in the usual way.
*/
Assert(!(P_ISLEAF(opaque) &&
BlockNumberIsValid(RelationGetTargetBlock(rel))));
/* Find the leftmost page at the next level up */
pbuf = _bt_get_endpoint(rel, opaque->btpo_level + 1, false, NULL);
/* Set up a phony stack entry pointing there */
stack = &fakestack;
stack->bts_blkno = BufferGetBlockNumber(pbuf);
stack->bts_offset = InvalidOffsetNumber;
stack->bts_parent = NULL;
_bt_relbuf(rel, pbuf);
}
/* get high key from left, a strict lower bound for new right page */
ritem = (IndexTuple) PageGetItem(page,
PageGetItemId(page, P_HIKEY));
/* form an index tuple that points at the new right page */
new_item = CopyIndexTuple(ritem);
BTreeTupleSetDownLink(new_item, rbknum);
/*
* Re-find and write lock the parent of buf.
*
* It's possible that the location of buf's downlink has changed since
* our initial _bt_search() descent. _bt_getstackbuf() will detect
* and recover from this, updating the stack, which ensures that the
* new downlink will be inserted at the correct offset. Even buf's
* parent may have changed.
*/
pbuf = _bt_getstackbuf(rel, stack, bknum);
/*
* Unlock the right child. The left child will be unlocked in
* _bt_insertonpg().
*
* Unlocking the right child must be delayed until here to ensure that
* no concurrent VACUUM operation can become confused. Page deletion
* cannot be allowed to fail to re-find a downlink for the rbuf page.
* (Actually, this is just a vestige of how things used to work. The
* page deletion code is expected to check for the INCOMPLETE_SPLIT
* flag on the left child. It won't attempt deletion of the right
* child until the split is complete. Despite all this, we opt to
* conservatively delay unlocking the right child until here.)
*/
_bt_relbuf(rel, rbuf);
if (pbuf == InvalidBuffer)
ereport(ERROR,
(errcode(ERRCODE_INDEX_CORRUPTED),
errmsg_internal("failed to re-find parent key in index \"%s\" for split pages %u/%u",
RelationGetRelationName(rel), bknum, rbknum)));
/* Recursively insert into the parent */
_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
new_item, MAXALIGN(IndexTupleSize(new_item)),
stack->bts_offset + 1, 0, isonly);
/* be tidy */
pfree(new_item);
}
}
分裂流程_bt_split本身是在插入流程_bt_insertonpg中的,而分裂后与父节点建链的插入操作依旧是用的_bt_insertonpg,这样的递归调用,保证了每次分裂的完整性。