MIT6.5830 Lab1-GoDB实验记录（五）

MIT6.5830 Lab1-GoDB实验记录（五） – WhiteNight's Site

标签：Golang

完成了Exercise 1，还有四个Exercise在等着我，慢慢来吧。

实验准备
了解缓冲池

缓冲池，俗称BP。相关的概念还有数据页和缓存页。页（Pages）的概念和操作系统中“分页”的概念是一样的，指的都是把逻辑地址空间分为若干同等大小的页，并从0开始编号。

而缓冲池（Buffer Pool），简单来说它的作用就是就是读磁盘中的页，再把页写入回磁盘。下列这张图就很好的解释了缓冲池的工作流程。

实验步骤
完成getPage()

在本次实验中，我们只需要补全buffer_pool.go中的getPage()函数，就完成一个函数？还是比较简单的…吧？

先看下实验要求

2.4. BufferPool
The buffer pool (class BufferPool in GoDB) is responsible for caching pages in memory that have been recently read from disk. All operators read and write pages from various files on disk through the buffer pool. It consists of a fixed number of pages, defined by the numPages parameter to the BufferPool constructor NewBufferPool.

For this lab, you only need to implement the constructor and the BufferPool.getPage() method used by the HeapFile iterator. The buffer pool stores structs that implement the Page interface; these pages can be read from underlying database files (such as a heap file) which implement the DBFile interface using the readPage method. The BufferPool should store up to numPages pages. If more than numPages requests are made for different pages, you should evict one of them according to an eviction policy of your choice. Note that you should not evict dirty pages (pages where the Page method isDirty() returns true), for reasons we will explain when we discuss transactions later in the class. You don't need to worry about locking in lab 1.

再来看看练习2给的实验要求注释

// Retrieve the specified page from the specified DBFile (e.g., a HeapFile), on
// behalf of the specified transaction. If a page is not cached in the buffer pool,
// you can read it from disk uing [DBFile.readPage]. If the buffer pool is full (i.e.,
// already stores numPages pages), a page should be evicted.  Should not evict
// pages that are dirty, as this would violate NO STEAL. If the buffer pool is
// full of dirty pages, you should return an error. For lab 1, you do not need to
// implement locking or deadlock detection. [For future labs, before returning the page,
// attempt to lock it with the specified permission. If the lock is
// unavailable, should block until the lock is free. If a deadlock occurs, abort
// one of the transactions in the deadlock]. You will likely want to store a list
// of pages in the BufferPool in a map keyed by the [DBFile.pageKey].

写者注
uing？一眼using错字。是时候让我狠狠的提交pull request了。

好吧，我就知道没这么简单。先总结一下我们要做什么

如果缓冲池没有第x页，从磁盘中读取；如果缓冲池满了，自己写个调度算法驱逐页（但不驱逐脏页）。
如果缓冲池全是脏页（脏页和NO STEAL的概念后续开个文章仔细讲讲），返回一个error。
写一个分布式锁，而且要带特定的perm。同时还要释放死锁占用的其中某个事务。

这还不是最折磨的，更折磨的是这个：

Exercise 2
Implement the getPage() method in:

buffer_pool.go
There is a unit test buffer_pool_test.go, but you will not be able to pass this test until you implement the heap file and heap page methods below. You will also test the functionality of the buffer pool when you implement your heap file iterator.

“完成Exercise 4之前，Exercise 2的test是无法通过的”。即不像Exercise 1那样“所见即所得”，没test连自己写的对不对都不知道。而且在Exercise 4又要调用这里的GetPage，但在Exercise2和4之间还夹着个Exercise 3要完成……三个Exercise缺一不可，令人头大。

So，综合考虑之后，我决定先完成heap_page，再完成heap_file，最后再来完成GetPage()。

实验步骤
琢磨heap_page

打开heap_page.go的那一刻，我就知道——要寄了。

从接口到结构体都要自己补全，这可不是一般的难啊。heap page和heap file的关系我倒是知道，但是具体的数据结构我还真不知道。

那不知道能怎么办呢？找现有的源码例子慢慢看呗。我自行学习了InnoDB的heap page结构后，大致能明白自己该完成什么功能了。

比如这一段heap_page.go中的注释。

In addition, all pages are PageSize bytes.  They begin with a header with a 32
bit integer with the number of slots (tuples), and a second 32 bit integer with
the number of used slots.

这就是在说：每一个page的大小都是固定的，而且都有一个header（标头）。这个Header有两个成员：一个是总共的slot插槽数量，另一个是已用的slot数量。而slot则用于存放元组。

很合理嘛，一个页一共能存多少个元组，存了x个元组后还能存多少个元组？这个页的编号是什么？我们通过什么方式访问该page？（一般都是通过指针的方式访问）这些肯定是要我们自己规定的。那我们可以先把header结构写上。

同时，不要忘了元组Tuple是不带字段的，即我们还需要指明这些记录的数据是属于哪个字段的。而这部分的工作在tupledesc中就已经完成了。

type Header struct {
	TotalSlots int32
	UsedSlots  int32
}
type heapPage struct {
	// TODO: some code goes here
	hdr   Header
	tuple []*Tuple
	desc  *TupleDesc

	file   *HeapFile
	pageNo int
}

写者注
上面这些内容是我在10月份写的，而以下内容是11月写的，中间因为各种事情（实习啊投简历啊这些）耽搁了一段时间。而且在此之前我从未接触过copilot。所以我建议各位自带个copilot会比较好一些。

接下来要初始化heap page。heap page的计算公式在注释中已经给我们了。我们只需要判断字段是int还是string类型，再单独计算大小最后求和即可。

func newHeapPage(desc *TupleDesc, pageNo int, f *HeapFile) *heapPage {
	// TODO: some code goes here
	bytesPerTuple := 0
	for _, field := range desc.Fields {
		if field.Ftype == StringType {
			bytesPerTuple += int(unsafe.Sizeof(byte('a'))) * StringLength
		} else if field.Ftype == IntType {
			bytesPerTuple += int(unsafe.Sizeof(int64(0)))
		}
	}
	remPageSize := PageSize - 8
	numSlots := remPageSize / bytesPerTuple

	return &heapPage{
		Hdr:    Header{TotalSlots: int32(numSlots), UsedSlots: 0},
		Tuple:  make([]*Tuple, numSlots),
		Desc:   desc,
		File:   f,
		PageNo: pageNo,
	}
}

写者注
麻了，我敲几段copilot就给我自动补全了。有一种AIGC的美。再这样下去就要变成离开copilot就活不下去的人了

接下来是getNumSlot，说实话没看懂要补全啥，先写个return看看。

func (h *heapPage) getNumSlots() int {
	// TODO: some code goes here

	return int(h.Hdr.TotalSlots) //replace me
}

写者注
麻了，copilot给的代码太抽象了，直接跑肯定是跑不了的，唯一用得上的就是翻译和代码解释功能。

接下来就是插入tuple。哪个槽是空的就往哪里插入tuple，同时要给Tuple的rid赋值并返回。那我们先来回顾一下rid是什么。

type Tuple struct {
	Desc   TupleDesc
	Fields []DBValue
	Rid    recordID //used to track the page and position this page was read from
}
type recordID interface {
}

用于追踪这个元组在xx页的xx位置？但这里给的是interface，所以我们传个结构体进去也没问题。

type RecordId struct {
	PageNo int
	SlotNo int32
}

插入元组，如果page满了插不进去那就返回err。如果没满，那就遍历page的tuple看看哪个位置的tuple是nil。

找到空位之后除了插入元组，还需要更新该记录所在的页和位置（i）。插入完之后记得直接return，毕竟是只插入单个记录，而不是把空位全填上。

func (h *heapPage) insertTuple(t *Tuple) (recordID, error) {
	// TODO: some code goes here
	rid := RecordID{PageNo: h.PageNo, SlotNo: 0}
	if h.Hdr.UsedSlots == h.Hdr.TotalSlots {
		return rid, errors.New("no free slots")
	}
	for i, tuple := range h.Tuple {
		if tuple == nil {
			h.Tuple[i] = t
			rid.SlotNo = i
			h.Tuple[i].Rid = rid
			h.Hdr.UsedSlots += 1
			return rid, nil
		}
	}
	return rid, errors.New("Can't insert Tuple") //replace me
}

搞定之后向跑下heap_page_test.go试试，结果发现会报错。发现还有一个tupleIter函数没补全。那先补全tupleter之后跑个test看看，不然心里不踏实。

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x30 pc=0x18541a]

看看tupleter，为什么要遍历page的tuple？你如果常写SQL题就肯定能明白为什么。最经典的例子就是两个NOT EXISTS嵌套循环，最内层的NOT EXISTS就是很明显的逐条匹配。

所以tupleter返回的才是一个函数，我们要在这个函数里遍历tuple中的所有非空记录，并且每次都返回新的非空记录。

有了个思路，那就开写咯。迭代的话不用多说，跳过空插槽，直到最后一个插槽为止。不过还要注意最多有n个插槽≠最多有n个tuple。我一开始没注意老是过不了测试，一直报访问越界的错误。后面仔细看了下才发现i可能会导致数组越界。

这里根据注释还需要给返回的tuple初始化rid，rid直接取p的就好了。

func (p *heapPage) tupleIter() func() (*Tuple, error) {
	// TODO: some code goes here
	i := 0
	return func() (*Tuple, error) {
		for i < int(p.Hdr.TotalSlots) && p.Tuple[i] == nil {
			i++
		}
		if i <= int(p.Hdr.TotalSlots) && i < len(p.Tuple) {
			if p.Tuple[i] != nil {
				tuple := p.Tuple[i]
				tuple.Rid = p.Tuple[i].Rid
				i++
				return tuple, nil
			}
		}
		return nil, nil
	}
}

跑一下TestInsertHeapPage，过。

发现还有一个TestHeapPageInsertTuple可以test一下，而且刚好发现里面有我们前面提到的“不知道是干啥”的getNumSlots。

定睛一看，这不就是返回page还剩多少空插槽的函数嘛。那修正一下前面的getNumSlots

func (h *heapPage) getNumSlots() int {
	// TODO: some code goes here
	free := h.Hdr.TotalSlots - h.Hdr.UsedSlots
	return int(free) //replace me
}

写者注
这里有个…不知道算不算坑的坑。为什么这里偏偏是int，而前面我们定义插槽数也好，已用插槽数也好用的都是int32。如果你把鼠标移到int上面会看到注释说“int是一个大小至少为32的数据类型”。
至少，这就意味着int是区别于int32的。所以这里还是要做个强制类型转换。但我是没搞懂为啥前面定义slot不直接用int。省空间？不至于吧，这也就64bit。
事出有因，但我看不懂。我推测是通过int32限制了page的最大插槽数量。毕竟int64最大可取到9,223,372,036,854,775,807。这对于heap_page来说有点太大了。但这也只是我觉得比较合理的推测，人注释也没说，找个自己能信服的理由就过了吧。

跑一下TestHeapPageInsertTuple，过。