上篇讲解了分裂的规则和填充策略等内容,而本文着重讲解postgres Btree分裂点确认流程,接口函数为 _bt_findsplitloc。相关知识点见回顾:postgres源码解析48 Btree节点分裂点确认流程–1
执行流程
_bt_findsplitloc
该函数的功能是确定该分裂页分裂点的确定,并返回落在新页中的第一个索引元组在分裂页中偏移量 offsetNumber
1 首先初始化分裂左页与右页空闲空间,如果分裂页不是最右页,则右叶需去除高键所占用空间;
/* Total free space available on a btree page, after fixed overhead */
leftspace = rightspace =
PageGetPageSize(origpage) - SizeOfPageHeaderData -
MAXALIGN(sizeof(BTPageOpaqueData));
/* The right page will have the same high key as the old page */
if (!P_RIGHTMOST(opaque))
{
itemid = PageGetItemId(origpage, P_HIKEY);
rightspace -= (int) (MAXALIGN(ItemIdGetLength(itemid)) +
sizeof(ItemIdData));
}
/* Count up total space in data items before actually scanning 'em */
olddataitemstotal = rightspace - (int) PageGetExactFreeSpace(origpage);
leaffillfactor = BTGetFillFactor(rel);
2 初始化 FindSplitData 结构体并填充相关字段信息;
/* Passed-in newitemsz is MAXALIGNED but does not include line pointer */
newitemsz += sizeof(ItemIdData);
state.rel = rel;
state.origpage = origpage;
state.newitem = newitem;
state.newitemsz = newitemsz;
state.is_leaf = P_ISLEAF(opaque);
state.is_rightmost = P_RIGHTMOST(opaque);
state.leftspace = leftspace;
state.rightspace = rightspace;
state.olddataitemstotal = olddataitemstotal;
state.minfirstrightsz = SIZE_MAX;
state.newitemoff = newitemoff;
/* newitem cannot be a posting list item */
Assert(!BTreeTupleIsPosting(newitem));
/*
* nsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples, once
* you imagine that the new item is already on the original page (the
* final number of splits may be slightly lower because not all points
* between tuples will be legal).
*/
state.maxsplits = maxoff;
state.splits = palloc(sizeof(SplitPoint) * state.maxsplits);
state.nsplits = 0;
3 遍历分裂页所有索引元组,调用 _bt_recsplitloc 找到所有满足条件的候选分裂点,该信息保存在 FindSplitData结构体中的 splits数组;
for (offnum = P_FIRSTDATAKEY(opaque);
offnum <= maxoff;
offnum = OffsetNumberNext(offnum))
{
Size itemsz;
itemid = PageGetItemId(origpage, offnum);
itemsz = MAXALIGN(ItemIdGetLength(itemid)) + sizeof(ItemIdData);
/*
* When item offset number is not newitemoff, neither side of the
* split can be newitem. Record a split after the previous data item
* from original page, but before the current data item from original
* page. (_bt_recsplitloc() will reject the split when there are no
* previous items, which we rely on.)
*/
if (offnum < newitemoff)
_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
else if (offnum > newitemoff)
_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
else
{
/*
* Record a split after all "offnum < newitemoff" original page
* data items, but before newitem
*/
_bt_recsplitloc(&state, offnum, false, olddataitemstoleft, itemsz);
/*
* Record a split after newitem, but before data item from
* original page at offset newitemoff/current offset
*/
_bt_recsplitloc(&state, offnum, true, olddataitemstoleft, itemsz);
}
olddataitemstoleft += itemsz;
}
/*
* Record a split after all original page data items, but before newitem.
* (Though only when it's possible that newitem will end up alone on new
* right page.)
*/
Assert(olddataitemstoleft == olddataitemstotal);
if (newitemoff > maxoff)
_bt_recsplitloc(&state, newitemoff, false, olddataitemstotal, 0);
4 根据分裂页类型以及元组特点确定填充因子;
if (!state.is_leaf)
{
/* fillfactormult only used on rightmost page */
usemult = state.is_rightmost;
fillfactormult = BTREE_NONLEAF_FILLFACTOR / 100.0;
}
else if (state.is_rightmost)
{
/* Rightmost leaf page -- fillfactormult always used */
usemult = true;
fillfactormult = leaffillfactor / 100.0;
}
else if (_bt_afternewitemoff(&state, maxoff, leaffillfactor, &usemult))
{
/*
* New item inserted at rightmost point among a localized grouping on
* a leaf page -- apply "split after new item" optimization, either by
* applying leaf fillfactor multiplier, or by choosing the exact split
* point that leaves newitem as lastleft. (usemult is set for us.)
*/
if (usemult)
{
/* fillfactormult should be set based on leaf fillfactor */
fillfactormult = leaffillfactor / 100.0;
}
else
{
/* find precise split point after newitemoff */
for (int i = 0; i < state.nsplits; i++)
{
SplitPoint *split = state.splits + i;
if (split->newitemonleft &&
newitemoff == split->firstrightoff)
{
pfree(state.splits);
*newitemonleft = true;
return newitemoff;
}
}
/*
* Cannot legally split after newitemoff; proceed with split
* without using fillfactor multiplier. This is defensive, and
* should never be needed in practice.
*/
fillfactormult = 0.50;
}
}
else
{
/* Other leaf page. 50:50 page split. */
usemult = false;
/* fillfactormult not used, but be tidy */
fillfactormult = 0.50;
}
5 调用 _bt_strategy 函数确定分裂策略和”最佳罚分” perfectpenalty;
1)如果分裂页为非叶子结点,直接将 state->minfirstrightsz 作为 perfectpenalty,该字段含义是如果按照候选分裂点分裂后在右叶中位于最小的第一个索引元组大小。
2) 根据默认分裂间隔确定分裂区间,即leftmost索引元组, rightmost索引元组;
3)调用_bt_keep_natts_fast 找到 第一个leftmost与roghtmost属性值不同的属性号(perfectpenalty);
4)如果 perfectpenalty小于等于索引属性数量,则返回 perfectpenalty;
5)条件4)不满足,则根据第一个候选分裂点和最后一个候选分裂点重新确定leftmost索引元组, rightmost索引元组(可以看出,上述2-4步骤是确认perfectpenalty的一种优化方式)。
6)确定此时的 perfectpenalty大小,如果小于等于索引属性数量,则将分裂策略和 perfectpenalty 分别设置为 SPLIT_MANY_DUPLICATES和索引属性数量;如果大于且该页处于非叶子层最右节点,则将分裂策略设置为 SPLIT_SINGLE_VALUE,如果上述都不满足比较分裂页高键与待插索引比较进一步确认 perfectpenalty。
/*
* Subroutine to decide whether split should use default strategy/initial
* split interval, or whether it should finish splitting the page using
* alternative strategies (this is only possible with leaf pages).
*
* Caller uses alternative strategy (or sticks with default strategy) based
* on how *strategy is set here. Return value is "perfect penalty", which is
* passed to _bt_bestsplitloc() as a final constraint on how far caller is
* willing to go to avoid appending a heap TID when using the many duplicates
* strategy (it also saves _bt_bestsplitloc() useless cycles).
*/
static int
_bt_strategy(FindSplitData *state, SplitPoint *leftpage,
SplitPoint *rightpage, FindSplitStrat *strategy)
{
IndexTuple leftmost,
rightmost;
SplitPoint *leftinterval,
*rightinterval;
int perfectpenalty;
int indnkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
/* Assume that alternative strategy won't be used for now */
*strategy = SPLIT_DEFAULT;
/*
* Use smallest observed firstright item size for entire page (actually,
* entire imaginary version of page that includes newitem) as perfect
* penalty on internal pages. This can save cycles in the common case
* where most or all splits (not just splits within interval) have
* firstright tuples that are the same size.
*/
if (!state->is_leaf)
return state->minfirstrightsz;
/*
* Use leftmost and rightmost tuples from leftmost and rightmost splits in
* current split interval
*/
_bt_interval_edges(state, &leftinterval, &rightinterval);
leftmost = _bt_split_lastleft(state, leftinterval);
rightmost = _bt_split_firstright(state, rightinterval);
/*
* If initial split interval can produce a split point that will at least
* avoid appending a heap TID in new high key, we're done. Finish split
* with default strategy and initial split interval.
*/
perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
return perfectpenalty;
/*
* Work out how caller should finish split when even their "perfect"
* penalty for initial/default split interval indicates that the interval
* does not contain even a single split that avoids appending a heap TID.
*
* Use the leftmost split's lastleft tuple and the rightmost split's
* firstright tuple to assess every possible split.
*/
leftmost = _bt_split_lastleft(state, leftpage);
rightmost = _bt_split_firstright(state, rightpage);
/*
* If page (including new item) has many duplicates but is not entirely
* full of duplicates, a many duplicates strategy split will be performed.
* If page is entirely full of duplicates, a single value strategy split
* will be performed.
*/
perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
if (perfectpenalty <= indnkeyatts)
{
*strategy = SPLIT_MANY_DUPLICATES;
/*
* Many duplicates strategy should split at either side the group of
* duplicates that enclose the delta-optimal split point. Return
* indnkeyatts rather than the true perfect penalty to make that
* happen. (If perfectpenalty was returned here then low cardinality
* composite indexes could have continual unbalanced splits.)
*
* Note that caller won't go through with a many duplicates split in
* rare cases where it looks like there are ever-decreasing insertions
* to the immediate right of the split point. This must happen just
* before a final decision is made, within _bt_bestsplitloc().
*/
return indnkeyatts;
}
/*
* Single value strategy is only appropriate with ever-increasing heap
* TIDs; otherwise, original default strategy split should proceed to
* avoid pathological performance. Use page high key to infer if this is
* the rightmost page among pages that store the same duplicate value.
* This should not prevent insertions of heap TIDs that are slightly out
* of order from using single value strategy, since that's expected with
* concurrent inserters of the same duplicate value.
*/
else if (state->is_rightmost)
*strategy = SPLIT_SINGLE_VALUE;
else
{
ItemId itemid;
IndexTuple hikey;
itemid = PageGetItemId(state->origpage, P_HIKEY);
hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
state->newitem);
if (perfectpenalty <= indnkeyatts)
*strategy = SPLIT_SINGLE_VALUE;
else
{
/*
* Have caller finish split using default strategy, since page
* does not appear to be the rightmost page for duplicates of the
* value the page is filled with
*/
}
}
return perfectpenalty;
}
6 结合上述分裂策略、候选分裂点和 perfectpenalty信息,调用 _bt_bestsplitloc在所有候选分裂点中确定最佳分裂点
7 释放内存