(done) MIT6.S081 2023 学习笔记 (Day7: LAB6 Multithreading)

网页：https://pdos.csail.mit.edu/6.S081/2023/labs/thread.html

(任务1教会了你如何用 C 语言调用汇编，编译后链接即可)
任务1：Uthread: switching between threads (完成)

在这个练习中，你将设计一个用户级线程系统中的上下文切换机制，并实现它。为了帮助你开始，你的xv6系统中有两个文件 user/uthread.c 和 user/uthread_switch.S，以及Makefile中的一个规则来构建uthread程序。uthread.c 包含了大部分用户级线程包的代码和三个简单测试线程的代码。线程包缺少一些创建线程和在线程之间切换的代码。

根据讲义要求，当执行 uthread 程序时，应该出现如下输出：
在这里插入图片描述

我们先来运行试试，如下：
在这里插入图片描述

可以看到没有任何输出。先来看看 uthread 的源码：

int 
main(int argc, char *argv[]) 
{
  a_started = b_started = c_started = 0;
  a_n = b_n = c_n = 0;
  thread_init();
  thread_create(thread_a);
  thread_create(thread_b);
  thread_create(thread_c);
  current_thread->state = FREE;
  thread_schedule();
  exit(0);
}

从 main 函数来看，可以看出大致逻辑和框架：初始化thread，创建三个线程，设置 main thread 状态为 FREE，让出执行流(类似于 yield 函数)。

先来看 thread_schedule() 源码：

void 
thread_schedule(void)
{
  struct thread *t, *next_thread;

  /* Find another runnable thread. */
  next_thread = 0;
  t = current_thread + 1;
  for(int i = 0; i < MAX_THREAD; i++){
    if(t >= all_thread + MAX_THREAD)
      t = all_thread;
    if(t->state == RUNNABLE) {
      next_thread = t;
      break;
    }
    t = t + 1;
  }

  if (next_thread == 0) {
    printf("thread_schedule: no runnable threads\n");
    exit(-1);
  }

  if (current_thread != next_thread) {         /* switch threads?  */
    next_thread->state = RUNNING;
    t = current_thread;
    current_thread = next_thread;
    /* YOUR CODE HERE
     * Invoke thread_switch to switch from t to next_thread:
     * thread_switch(??, ??);
     */
  } else
    next_thread = 0;
}

从函数源码来理解，是先从 thread 数组获取一个元素，随后调用 thread_switch 函数把执行流切换过去。这个 thread_switch 是需要我们自己实现的。

再看看 thread a b c 的源码：

void 
thread_a(void)
{
  int i;
  printf("thread_a started\n");
  a_started = 1;
  while(b_started == 0 || c_started == 0)
    thread_yield();
  
  for (i = 0; i < 100; i++) {
    printf("thread_a %d\n", i);
    a_n += 1;
    thread_yield();
  }
  printf("thread_a: exit after %d\n", a_n);

  current_thread->state = FREE;
  thread_schedule();
}

其实就是不断打印东西，再切换到别的线程上。

思路其实很简单，跟着 xv6 的讲义做吧：
1.切换线程的函数 thread_switch 在 user/uthread_switch.S 中实现 (doing)
2.你的任务是制定一个计划来创建线程以及保存/恢复寄存器以在线程之间进行切换，并实现该计划。完成之后，运行 make grade 应该显示你的解决方案通过了uthread测试。
3.你需要在 user/uthread.c 中的 thread_create() 和 thread_schedule() 函数以及 user/uthread_switch.S 中的 thread_switch 添加代码。一个目标是确保当 thread_schedule() 第一次运行给定的线程时，线程能够在其自己的栈上执行传递给 thread_create() 的函数。另一个目标是确保 thread_switch 保存被切换离开的线程的寄存器，恢复被切换到的线程的寄存器，并返回到后者的线程指令中上次离开的位置。你需要决定在哪里保存/恢复寄存器；修改 struct thread 以保存寄存器是一个不错的计划。你需要在 thread_schedule 中添加对 thread_switch 的调用；你可以传递任何你需要给 thread_switch 的参数，但目的是从线程 t 切换到 next_thread。
4.thread_switch 只需要保存/恢复 callee-saved 的寄存器。为什么？ (回答：编译器在编译 C 语言的时候，在编译对 thread_switch() 的函数调用时会自动保存 caller-saved 寄存器)
5.你可以在 user/uthread.asm 中查看 uthread 的汇编代码，这对于调试可能很有帮助。
6.为了测试你的代码，使用 riscv64-linux-gnu-gdb 单步执行 thread_switch 可能会有所帮助。你可以按照以下方式开始：
在这里插入图片描述

这块地方其实跟内核线程中的上下文切换很相似，我们定义一个如下的 context 结构体就可以把事情变得简单化：

struct context {
  uint64 ra;
  uint64 sp;

  // callee-saved
  uint64 s0;
  uint64 s1;
  uint64 s2;
  uint64 s3;
  uint64 s4;
  uint64 s5;
  uint64 s6;
  uint64 s7;
  uint64 s8;
  uint64 s9;
  uint64 s10;
  uint64 s11;
};

struct thread {
  char       stack[STACK_SIZE]; /* the thread's stack */
  int        state;             /* FREE, RUNNING, RUNNABLE */
  struct context context;
};

后续很简单，自己悟吧。

运行 make grade，可以看到任务1-uthread 已通过
在这里插入图片描述

任务2：Using threads (完成)

在这个作业中，你将使用哈希表探索基于线程和锁的并行编程。你应该在一个拥有多个核心的真实 Linux 或 MacOS 计算机上完成这个作业（不是 xv6，也不是 qemu）。大多数最新的笔记本电脑都配备了多核处理器。

这个作业使用了 UNIX pthread 线程库。你可以在手册页中找到关于它的信息，使用 man pthreads 命令，你也可以在网上查找，例如这里、这里和这里。

文件 notxv6/ph.c 包含一个简单的哈希表，如果从单个线程中使用它是正确的，但当从多个线程中使用时它是错误的。若运行：

make ph
./ph 1

会得到类似以下的输出：

100000 puts, 3.991 seconds, 25056 puts/second
0: 0 keys missing
100000 gets, 3.981 seconds, 25118 gets/second

你看到的数字可能与这个示例输出相差两倍或更多，这取决于你的计算机速度、是否有多个核心，以及它是否忙于执行其他任务。

ph 运行两个基准测试。首先，它通过调用 put() 向哈希表中添加大量键，并打印每秒 put 操作的速率。然后，它通过 get() 从哈希表中获取键。它打印出由于 put 操作应该存在于哈希表中的键的数量，但在这种情况下缺失的键数量（这里是零），以及它实现的每秒 get 操作的数量。

未经修改时运行 ph 2，会得到如下输出
在这里插入图片描述

这段 ph 2 输出的第一行表明，当两个线程同时向哈希表添加条目时，它们实现了每秒 53,044 次插入的总速率。这大约是运行 ph 1 单个线程速率的两倍。这是一个非常好的“并行加速”，大约是 2 倍，这是人们可能期望的最高加速（即核心数量加倍，每单位时间的工作量也加倍）。

然而，说有 16579 个键缺失的两行表明，大量应该存在于哈希表中的键却不在那里。也就是说，put 操作本应将这些键添加到哈希表中，但出了问题。请查看 notxv6/ph.c 文件，特别是 put() 和 insert() 函数。

问题：为什么在两个线程的情况下会有缺失的键，而在一个线程的情况下却没有？请识别出两个线程中的一系列事件，这些事件可能导致一个键的缺失。将你的事件序列和简短的解释提交到 answers-thread.txt 文件中。（回答看下面）

为了避免这一系列事件，请在 notxv6/ph.c 中的 put 和 get 函数中插入锁的获取和释放语句，以便在两个线程的情况下缺失的键数量始终为 0。相关的 pthread 调用如下：

pthread_mutex_t lock; // 声明一个锁 
pthread_mutex_init(&lock, NULL); // 初始化锁 
pthread_mutex_lock(&lock); // 获取锁 
pthread_mutex_unlock(&lock); // 释放锁

当你完成这些修改，并且 make grade 显示你的代码通过了 ph_safe 测试（该测试要求两个线程下缺失的键数量为零）时，你就完成了这个任务。在这个阶段，ph_fast 测试失败是可以接受的。

不要忘记调用 pthread_mutex_init()。首先使用 1 个线程测试你的代码，然后用 2 个线程测试它。代码是否正确（即你是否消除了缺失的键？）两个线程的版本相对于单线程版本是否实现了并行加速（即每单位时间完成的总工作是否更多？）

存在这样的情况，即并发的 put() 操作在哈希表中的读写内存没有重叠，因此它们不需要锁来相互保护。你能修改 ph.c 来利用这种情况，以获得某些 put() 操作的并行加速吗？提示：每个哈希桶一个锁怎么样？

修改你的代码，以便一些 put 操作可以并行运行，同时保持正确性。当你完成修改并且 make grade 显示你的代码同时通过了 ph_safe 和 ph_fast 测试时，你就完成了任务。ph_fast 测试要求两个线程的 put 操作每秒至少是单个线程的 1.25 倍。

先来稍微读读 ph.c 源码，看为什么会 miss keys，仔细看对哈希表进行插入的代码:

static void 
insert(int key, int value, struct entry **p, struct entry *n)
{
  struct entry *e = malloc(sizeof(struct entry));
  e->key = key;
  e->value = value;
  e->next = n; // 注意看这里 <--- 当两个线程同时执行到这里时，两个 key 的 next 都等于链表头
  *p = e; // 随后让链表头等于这两个 key，此时就会有一个 key 被遗漏掉
}

上面代码的注释就是对之前问题的回答

为了验证猜想，对 insert 函数调用加锁，如下：

    // the new is new.
    pthread_mutex_lock(&lock); // 获取锁 
    insert(key, value, &table[i], table[i]);
    pthread_mutex_unlock(&lock); // 释放锁

编译运行，发现已经符合 ph_safe 了，如下

100000 puts, 2.288 seconds, 43702 puts/second
0: 0 keys missing
1: 0 keys missing
200000 gets, 4.546 seconds, 43995 gets/second

运行 make grade，可以同时通过 ph_safe 和 ph_fast，我们就不管了
在这里插入图片描述

任务3：Barrier (完成)

在这个作业中，你将实现一个 barrier：在应用程序中的一个点，所有参与的线程必须等待，直到所有其他参与的线程也到达那个点。你将使用 pthread 条件变量，这是一种类似于 xv6 的睡眠和唤醒的序列协调技术。

You should do this assignment on a real computer (not xv6, not qemu).

The file notxv6/barrier.c contains a broken barrier.

$ make barrier
$ ./barrier 2
barrier: notxv6/barrier.c:42: thread: Assertion `i == t' failed.

数字 2 指定了在屏障上同步的线程数量（在 barrier.c 中的 nthread）。每个线程执行一个循环。在每次循环迭代中，一个线程调用 barrier()，然后随机休眠一定数量的微秒。断言触发的原因是一个线程在另一个线程到达屏障之前离开了屏障。期望的行为是每个线程在 barrier() 中阻塞，直到所有 nthreads 个线程都调用了 barrier()。

你的目标是实现期望的屏障行为。除了你在 ph 作业中看到的锁原语之外，你还需要以下新的 pthread 原语；详细信息请查看这里和这里。

pthread_cond_wait(&cond, &mutex);  // go to sleep on cond, releasing lock mutex, acquiring upon wake up
pthread_cond_broadcast(&cond);     // wake up every thread sleeping on cond

Make sure your solution passes make grade’s barrier test.

pthread_cond_wait releases the mutex when called, and re-acquires the mutex before returning.

We have given you barrier_init(). Your job is to implement barrier() so that the panic doesn’t occur.

We’ve defined struct barrier for you; its fields are for your use.

There are two issues that complicate your task:
1.You have to deal with a succession of barrier calls, each of which we’ll call a round. bstate.round records the current round. You should increment bstate.round each time all threads have reached the barrier.
2.You have to handle the case in which one thread races around the loop before the others have exited the barrier. In particular, you are re-using the bstate.nthread variable from one round to the next. Make sure that a thread that leaves the barrier and races around the loop doesn’t increase bstate.nthread while a previous round is still using it.

Test your code with one, two, and more than two threads.

我们先来看 barrier 的 main 源码：

int
main(int argc, char *argv[])
{
  pthread_t *tha;
  void *value;
  long i;
  double t1, t0;

  if (argc < 2) {
    fprintf(stderr, "%s: %s nthread\n", argv[0], argv[0]);
    exit(-1);
  }
  nthread = atoi(argv[1]);
  tha = malloc(sizeof(pthread_t) * nthread);
  srandom(0);

  barrier_init();

  for(i = 0; i < nthread; i++) {
    assert(pthread_create(&tha[i], NULL, thread, (void *) i) == 0);
  }
  for(i = 0; i < nthread; i++) {
    assert(pthread_join(tha[i], &value) == 0);
  }
  printf("OK; passed\n");
}

可以看到程序先根据命令行参数创建几个线程，随后让这些线程执行 thread() 命令

看 thread() 函数实现：

static void *
thread(void *xa)
{
  long n = (long) xa;
  long delay;
  int i;

  for (i = 0; i < 20000; i++) {
    int t = bstate.round;
    assert (i == t);
    barrier();
    usleep(random() % 100);
  }

  return 0;
}