经典RCU锁原理及Linux内核实现

RCU锁原理

RCU锁第一个特点就是适用于读很多写很少的场景，那它和读写锁有什么区别呢？区别就是RCU锁读者完全不用加锁（多个写者之间仍需要竞争锁），而读写锁（不管是读优先、写优先或者读写公平）读者和写者之间是需要竞争锁的。由于RCU锁读者完全不用竞争锁，这也带来了其第二个特点，RCU锁需要读者能够忍受旧数据（即在写者开始修改数据到写着完成修改这段时间，读者读取到的还是旧数据）。
为详细阐述RCU锁原理，如上图所示，有多个读者和一个写者先后参与进来。在t0时刻写者开始进来，在此之前已经有读者1和读者2读取数据M，且读者1已经完成数据M的访问。在t1时刻写者完成数据的修改，C对象不再指向数据M而指向新数据N。
RCU锁操作的是指针地址。写者修改数据时，首先将Read原数据Copy一份，然后进行Update，修改完成后将指针地址修改为新数据内存地址（而这一步是通过内存屏障保证修改前后数据一致性）。
在t0至t1期间，由于写者还未完成数据的修改，所以这段期间进入的读者（读者3和再次进入的读者1）将会读到旧数据M。这就是为什么RCU锁读者需要忍受旧数据。t1时刻以后再进入的读者（读者4）将会读取到新数据N。虽然t1时刻已经完成数据的修改，但是旧数据N此时还不能删除，因为需要等待所有访问旧数据N的读者结束访问。在t2时刻，最后一个读者（读者2）完成旧数据N的访问，此时就可以删除旧数据了。如果后续再有新的读者和写者进来，继续上述的过程即可。
RCU锁的基本原理就是这样，是不是感觉不过如此，但是实现起来就不是这个feel了，咱们接着往下看。

RCU锁Linux内核实现（基于2.6.11.1版本）

RCU锁的其中一个关键点在于如何知道最后一个读者完成了对旧数据访问的这一时机。下面就来看看为了做到这一点，内核RCU锁实现都干了啥：
读者只能在局部范围内访问数据

OUT_TYPE Func(IN_TYPE* p)
{
    // ...
    rcu_read_lock();

    IN_TYPE* pLocal = rcu_dereference(p);
    // pLocal->...

    rcu_read_unlock();
    // ...
}

如上述代码所示，对数据的访问（pLocal->…）只能在rcu_read_lock()和rcu_read_unlock()之间（至于这两个函数做了啥，咱们后续分析）。另外还有一点就是这里为什么不直接通过指针p访问，而是通过rcu_dereference§将指针地址读取到局部变量pLocal来访问。先看看内核里这个函数做了啥.

/**
 * rcu_dereference - fetch an RCU-protected pointer in an
 * RCU read-side critical section.  This pointer may later
 * be safely dereferenced.
 *
 * Inserts memory barriers on architectures that require them
 * (currently only the Alpha), and, more importantly, documents
 * exactly which pointers are protected by RCU.
 */

#define rcu_dereference(p)     ({ \
				typeof(p) _________p1 = p; \
				smp_read_barrier_depends(); \
				(_________p1); \
				})

简而言之，就是除了Alpha架构处理器，直接访问指针p没有问题。但是在Alpha架构处理器下，编译器优化可能会进行指令重排，导致直接访问指针p的指令可能被重排到rcu_read_lock()之前。因此rcu_dereference宏里面增加了一个优化屏障（smp_read_barrier_depends()函数），从而使得后续对局部变量pLocal的访问不会被重排。
读者数据访问结束的判断
为了做到这一点，RCU锁划分出来2种以分别用于内核进程和软中断。这样做的原因在于判断读者数据访问结束的时机是通过进程调度（进程切换和中断）来判断的。先来看看用于内核进程版本的RCU锁实现，软中断版本只在几个点稍有区别。
先来看看读者相关的两个接口

#define rcu_read_lock()		preempt_disable()
#define rcu_read_unlock()	preempt_enable()

顾名思义，在访问RCU锁保护数据前禁止抢占，结束访问后重新启用抢占。我们深入看下preempt_disable()和preempt_enable()这两个宏的实现。

#ifdef CONFIG_PREEMPT

asmlinkage void preempt_schedule(void);

#define preempt_disable() \
do { \
	inc_preempt_count(); \
	barrier(); \
} while (0)

#define preempt_enable_no_resched() \
do { \
	barrier(); \
	dec_preempt_count(); \
} while (0)

#define preempt_check_resched() \
do { \
	if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
		preempt_schedule(); \
} while (0)

#define preempt_enable() \
do { \
	preempt_enable_no_resched(); \
	preempt_check_resched(); \
} while (0)

#else

#define preempt_disable()		do { } while (0)
#define preempt_enable_no_resched()	do { } while (0)
#define preempt_enable()		do { } while (0)
#define preempt_check_resched()		do { } while (0)

#endif

如果操作系统不支持**内核抢占**，则这两个宏啥都不需要做，因为该情况下如果执行到RCU锁保护区间的读者代码，则一定不会被其他进程抢占（中断除外），从而保证了如果发生了进程调度，则RCU锁保护区间的读者代码一定执行完成了。相反，如果操作系统支持内核抢占（定义宏CONFIG_PREEMPT），则在访问读者代码前关闭内核抢占，在访问读者代码结束后启用内核抢占。
禁止内核抢占就是将当前线程信息的preempt_count加1，如果preempt_count>0表示不可被抢占。启用内核抢占就是将当前线程信息的preempt_count减1，如果preempt_count=0表示可被抢占。另外在preempt_enable()时，会看当前线程信息的flags是否置位TIF_NEED_RESCHED，如果置位的话会主动触发进程调度从而被抢占（当然此时RCU锁读者代码已执行完毕）。
读者相关的接口比较简单，下面来看看写者相关的接口

/**
 * rcu_assign_pointer - assign (publicize) a pointer to a newly
 * initialized structure that will be dereferenced by RCU read-side
 * critical sections.  Returns the value assigned.
 *
 * Inserts memory barriers on architectures that require them
 * (pretty much all of them other than x86), and also prevents
 * the compiler from reordering the code that initializes the
 * structure after the pointer assignment.  More importantly, this
 * call documents which pointers will be dereferenced by RCU read-side
 * code.
 */
#define rcu_assign_pointer(p, v)	({ \
						smp_wmb(); \
						(p) = (v); \
					})

rcu_assign_pointer(p, v)宏用于更新指针值（参考RCU锁原理Updater分析）。一方面，加入了写内存屏障，保证多次写操作的顺序正确，另一方面，调用者要加锁保证同时只有一个写者去更新数据。用法例如（摘自内核net/core/netfilter.c）:

int nf_log_register(int pf, nf_logfn *logfn)
{
	int ret = -EBUSY;

	/* Any setup of logging members must be done before
	 * substituting pointer. */
	spin_lock(&nf_log_lock);
	if (!nf_logging[pf]) {
		rcu_assign_pointer(nf_logging[pf], logfn);
		ret = 0;
	}
	spin_unlock(&nf_log_lock);
	return ret;
}

最后，就是最不好拿捏的几个接口了，先列出来

extern void FASTCALL(call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *head)));
extern void synchronize_kernel(void);

synchronize_kernel函数就是调用call_rcu函数，只不过增加了一个同步对象，从而成为一个同步接口。所以我们就只需要分析下call_rcu函数是干嘛的

/**
 * call_rcu - Queue an RCU callback for invocation after a grace period.
 * @head: structure to be used for queueing the RCU updates.
 * @func: actual update function to be invoked after the grace period
 *
 * The update function will be invoked some time after a full grace
 * period elapses, in other words after all currently executing RCU
 * read-side critical sections have completed.  RCU read-side critical
 * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
 * and may be nested.
 */
void fastcall call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
{
	unsigned long flags;
	struct rcu_data *rdp;

	head->func = func;
	head->next = NULL;
	local_irq_save(flags);
	rdp = &__get_cpu_var(rcu_data);
	*rdp->nxttail = head;
	rdp->nxttail = &head->next;
	local_irq_restore(flags);
}

代码中rdp的数据结构类型为struct rcu_data

/*
 * Per-CPU data for Read-Copy UPdate.
 * nxtlist - new callbacks are added here
 * curlist - current batch for which quiescent cycle started if any
 */
struct rcu_data {
	/* 1) quiescent state handling : */
	long		quiescbatch;     /* Batch # for grace period */
	int		passed_quiesc;	 /* User-mode/idle loop etc. */
	int		qs_pending;	 /* core waits for quiesc state */

	/* 2) batch handling */
	long  	       	batch;           /* Batch # for current RCU batch */
	struct rcu_head *nxtlist;
	struct rcu_head **nxttail;
	struct rcu_head *curlist;
	struct rcu_head **curtail;
	struct rcu_head *donelist;
	struct rcu_head **donetail;
	int cpu;
};

DECLARE_PER_CPU(struct rcu_data, rcu_data);

它是per-cpu的，所以call_rcu函数修改它前，先保存中断状态并关闭当前cpu的中断，修改完成后再恢复原来的中断状态，这样就保证了修改的串行化。call_rcu的主要逻辑就是把一个新的rcu_head对象添加到rdp的nxttail链表的尾部。这个rcu_head对象持有了一个函数指针，以便在所有读者完成访问后，调用这个函数完成相关资源的清理工作。下面咱们就从初始状态构造一个案例来分析内核是如何判断所有读者完成访问并清理资源的。
好了，万事俱备，可以进行案例分析了。如上图所示，操作系统启动将rcu_ctrlblk、rcu_state以及各个cpu的rcu_data进行初始化（对应于t0时刻）。接下来各个cpu可以读取RCU锁保护的相关资源，t1时刻cpu2调用call_rcu函数使得cpu2的rcu_data的nxt_list不为空（图中假定指向A）。后面cpu2再次触发时钟中断，调用函数rcu_pending函数判断是否需要RCU相关处理，相关函数实现如下：

// file: timer.c
// 时钟中断判断是否需要进行RCU相关处理
void update_process_times(int user_tick)
{
	// ...
	if (rcu_pending(cpu))
		rcu_check_callbacks(cpu, user_tick);
	scheduler_tick();
}

// file: rcupdate.h
// 判断是否需要进行RCU处理的逻辑判断
static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
						struct rcu_data *rdp)
{
	/* This cpu has pending rcu entries and the grace period
	 * for them has completed.
	 */
	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch))
		return 1;

	/* This cpu has no pending entries, but there are new entries */
	if (!rdp->curlist && rdp->nxtlist)
		return 1;

	/* This cpu has finished callbacks to invoke */
	if (rdp->donelist)
		return 1;

	/* The rcu core waits for a quiescent state from the cpu */
	if (rdp->quiescbatch != rcp->cur || rdp->qs_pending)
		return 1;

	/* nothing to do */
	return 0;
}

static inline int rcu_pending(int cpu)
{
	return __rcu_pending(&rcu_ctrlblk, &per_cpu(rcu_data, cpu)) ||
		__rcu_pending(&rcu_bh_ctrlblk, &per_cpu(rcu_bh_data, cpu));
}

可以看到由于cpu2的rcu_data的curlist为空，而nxtlist不为空，所以返回1表示需要进行RCU处理，于是调用rcu_check_callbacks函数，其函数实现如下图所示。

void rcu_check_callbacks(int cpu, int user)
{
	if (user || 
	    (idle_cpu(cpu) && !in_softirq() && 
				hardirq_count() <= (1 << HARDIRQ_SHIFT))) {
		rcu_qsctr_inc(cpu);
		rcu_bh_qsctr_inc(cpu);
	} else if (!in_softirq())
		rcu_bh_qsctr_inc(cpu);
	tasklet_schedule(&per_cpu(rcu_tasklet, cpu));
}

前面两个判断语句的意思是，如果当前cpu处于用户态或者idle模式且非中断处理过程，那么当前cpu肯定结束了RCU保护资源的访问（内核进程版本RCU和软中断版本RCU都是）。否则如果不是软中断处理过程，则软中断版本的RCU也可以认为结束了资源访问。通过判断的处理就是将rcu_data（软中断版本RCU是rcu_bh_data）的passed_quiesc字段赋值为1。
不过案例当前处理不用关注这两个判断，因为还未开启一个批次，后续开启一个批次时会将passed_quiesc重置为0。最后调用tasklet_schedule会调度执行rcu_process_callbacks函数，其函数实现如下：

static void __rcu_process_callbacks(struct rcu_ctrlblk *rcp,
			struct rcu_state *rsp, struct rcu_data *rdp)
{
	if (rdp->curlist && !rcu_batch_before(rcp->completed, rdp->batch)) {
		*rdp->donetail = rdp->curlist;
		rdp->donetail = rdp->curtail;
		rdp->curlist = NULL;
		rdp->curtail = &rdp->curlist;
	}

	local_irq_disable();
	if (rdp->nxtlist && !rdp->curlist) {
		rdp->curlist = rdp->nxtlist;
		rdp->curtail = rdp->nxttail;
		rdp->nxtlist = NULL;
		rdp->nxttail = &rdp->nxtlist;
		local_irq_enable();

		/*
		 * start the next batch of callbacks
		 */

		/* determine batch number */
		rdp->batch = rcp->cur + 1;
		/* see the comment and corresponding wmb() in
		 * the rcu_start_batch()
		 */
		smp_rmb();

		if (!rcp->next_pending) {
			/* and start it/schedule start if it's a new batch */
			spin_lock(&rsp->lock);
			rcu_start_batch(rcp, rsp, 1);
			spin_unlock(&rsp->lock);
		}
	} else {
		local_irq_enable();
	}
	rcu_check_quiescent_state(rcp, rsp, rdp);
	if (rdp->donelist)
		rcu_do_batch(rdp);
}

static void rcu_process_callbacks(unsigned long unused)
{
	__rcu_process_callbacks(&rcu_ctrlblk, &rcu_state,
				&__get_cpu_var(rcu_data));
	__rcu_process_callbacks(&rcu_bh_ctrlblk, &rcu_bh_state,
				&__get_cpu_var(rcu_bh_data));
}

假设t2时刻cpu2执行rcu_process_callbacks函数，此函数会将cpu2的rcu_data的nxtlist转移到curlist，并且batch字段增1至-299。接着该函数调用rcu_start_batch函数修改全局rcu_ctrlblk的cur字段增1至-299，rcu_state的cpumask字段各个cpu对应位置1。最后，

static void rcu_start_batch(struct rcu_ctrlblk *rcp, struct rcu_state *rsp,
				int next_pending)
{
	if (next_pending)
		rcp->next_pending = 1;

	if (rcp->next_pending &&
			rcp->completed == rcp->cur) {
		/* Can't change, since spin lock held. */
		cpus_andnot(rsp->cpumask, cpu_online_map, nohz_cpu_mask);

		rcp->next_pending = 0;
		/* next_pending == 0 must be visible in __rcu_process_callbacks()
		 * before it can see new value of cur.
		 */
		smp_wmb();
		rcp->cur++;
	}
}

调用rcu_check_quiescent_state函数，将cpu2的rcu_data的quiescbatch字段也设置未-299，passed_quiesc重置为0，qs_pending设置为1。这样一个就开启了一个新批次的处理。

static void rcu_check_quiescent_state(struct rcu_ctrlblk *rcp,
			struct rcu_state *rsp, struct rcu_data *rdp)
{
	if (rdp->quiescbatch != rcp->cur) {
		/* start new grace period: */
		rdp->qs_pending = 1;
		rdp->passed_quiesc = 0;
		rdp->quiescbatch = rcp->cur;
		return;
	}

	/* Grace period already completed for this cpu?
	 * qs_pending is checked instead of the actual bitmap to avoid
	 * cacheline trashing.
	 */
	if (!rdp->qs_pending)
		return;

	/* 
	 * Was there a quiescent state since the beginning of the grace
	 * period? If no, then exit and wait for the next call.
	 */
	if (!rdp->passed_quiesc)
		return;
	rdp->qs_pending = 0;

	spin_lock(&rsp->lock);
	/*
	 * rdp->quiescbatch/rcp->cur and the cpu bitmap can come out of sync
	 * during cpu startup. Ignore the quiescent state.
	 */
	if (likely(rdp->quiescbatch == rcp->cur))
		cpu_quiet(rdp->cpu, rcp, rsp);

	spin_unlock(&rsp->lock);
}

接着cpu0和cpu1再次触发时钟中断，由于它两rcu_data的quiescbatch还是-300，而全局正在处理的批次（rcu_ctrlblk的cur字段）为-299，所以rcu_pending函数返回1，表示需要进行RCU相关处理。与上面的调用栈一样，最后会在rcu_check_quiescent_state函数，将两者的rcu_data的相关字段进行设置（qs_pending设置为1，passed_quiesc设置为0，quiescbatch设置为-299），假设分别于t3和t4时刻，cpu0和cpu1完成该过程。
t5时刻cpu2再次触发时钟中断，由于其rcu_data的qs_pending字段为1，所以rcu_pending函数返回1继续调用函数rcu_check_callbacks。假设处于用户态，则cpu2完成RCU资源访问（也叫做通过quiescent state，所以quiescent state意思就是结束RCU资源访问的这段时间），passed_quiesc字段设置为1。随后调用至rcu_check_quiescent_state函数，qs_pending设置为0，cpumask对应cpu位置0，表明cpu2已经渡过quiescent state。
同样的流程，t6时刻cpu0因再次触发时钟中断，由于其rcu_data的qs_pending字段为1，所以rcu_pending函数返回1继续调用函数rcu_check_callbacks。假设处于用户态，则cpu0完成RCU资源访问，passed_quiesc字段设置为1。随后调用至rcu_check_quiescent_state函数，qs_pending设置为0，cpumask对应cpu位置0，表明cpu0已经渡过quiescent state。
t7时刻，cpu1也因再次时钟中断，经过与cpu0同样的过程。但是由于cpu1是最后结束quiescent state，cpumask被全部置0，此时rcu_ctrlblk的completed字段赋值为-299（赋值为其cur字段，意味着当前批次已完成）。由于cpu1自身rcu_data并没有需要处理的callback，所以后续就返回了。
但是，对于cpu2而言，当再次时钟中断，调用rcu_pending函数时，由于其rcu_data的curlist不为空，且rcu_ctrlblk的completed字段不小于其rcu_data的batch字段，所以还是返回1从而继续处理callback。如图假设此时为t8时刻，则在rcu_process_callbacks函数中会把rcu_data的nxtlist移动至donelist。最后由于donelist不为空，从而调用rcu_do_batch函数处理donelist的callback，假设为t9时刻处理完donelist，则donelist再次为空。
最后，软中断RCU实现只有些微的差别，首先是其读者RCU锁接口不同，内部实现会关闭软中断（当然也就无法抢占）。

#define rcu_read_lock_bh()	local_bh_disable()
#define rcu_read_unlock_bh()	local_bh_enable()

再就是前面以及提到的rcu_check_callbacks校验结束quiescent state的判断不同。

总结

本文通过示意图讲解了RCU的基本原理，它与读写锁的不同之处。然后结合2.6.11.1版本linux内核，通过构造一个案例，讲解RCU是如何实现的。其中包含全局的数据结构rcu_ctrlblk和rcu_state，以及各个cpu独有的数据结构rcu_data，案例从数据结构初始化，再到各个CPU的事件处理导致数据结构的变化，最后结束该批次的quiescent state，从而调用callback回收资源。当然还有很多其他复杂的情形，这里并没有涉及，但是经过这样一个简单的案例分析，相信再进行更复杂的案例分析、算法剖析乃至新版本RCU实现的演化更新，咱们也能更加有迹可循。