浅谈Linux内核页面换入换出

【推荐阅读】

浅谈linux 内核网络 sk_buff 之克隆与复制

深入linux内核架构--进程&线程

了解Docker 依赖的linux内核技术

怎么在Windows下使用Makefile文件

浅析linux内核网络协议栈--linux bridge

0x00内存页面分类与换入换出规则

内存页面分为用户页面和内核页面。

用户页面有以下几种:

1、普通的用户空间页面，包括进程的代码段、数据段、堆栈段、以及动态分配的存储堆。

2、通过系统调用mmap()映射到用户空间的已打开文件的内容。

这些页面既涉及分配、使用和回收，也涉及页面的换出/换入。

内核页面有以下几种：

1、kmalloc分配用作某些临时性的数据结构，如vma_area_struct。

2、内核通过alloc_page分配，如每个进程的系统堆栈所在的两个页面。

这些页面不涉及页面的换出/换入，一旦使用完毕，就可以释放、回收。

3、文件系统相关的结构体如dentry、node

这些页面不涉及页面的换出/换入，但即使使用完毕，其内容仍有保存的价值，只要条件允许，就将这些页面养起来，可以提高以后的操作效率

4、内核代码和内核中全局量所占的内存页面

这些页面既不需要分配，也不会被释放

0x01用户页面的换入

对于内核来说，只有两种用户页面，一种是文件映射，一种是匿名映射。前一种和swap没有关系，直接换出到硬盘上文件。后者会交换到swap。

1、文件映射--->换出到硬盘

2、匿名映射--->换出到swap

既然涉及到换出，我们还是先说下换入，换入也由换出定义的这两种页面为导向。

1、可执行文件(文件映射还包含直接映射硬盘上某个文件，不限于可执行文件)的换入

	error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, 0);

elf_map会进一步调用mmap将可执行文件的映射到内存，这种映射属于文件映射。

我们先看下映射到内存虚拟地址的情况：

0x00601000-0x00602000这个虚拟地址空间映射为可写，我们再来看看这个区域到底放着什么信息？

里面是.data段、.bss段一些可写的段。

那么就涉及到一个问题，既然是文件映射，当发生页面短缺时，是要换出到硬盘的。我们可以想象下总不可能每执行一次文件，我硬盘的可执行文件的数据就变化一下吧？那么linux内核是怎么处理的呢？我们放在后面再说。

首先理清一个思路，mmap这个系统调用，只是申请一片虚拟内存地址，并没有实际到把硬盘中的数据读到内存，并建立映射(建立页目录表、页表)。

当访问到对应的虚拟地址空间时，触发缺页中断do_no_page，从硬盘中把数据读到内存，并建立映射。

if (!pte_present(entry)) {//页面不在内存中  
        /* 
         * If it truly wasn't present, we know that kswapd 
         * and the PTE updates will not touch it later. So 
         * drop the lock. 
         */  
        spin_unlock(&mm->page_table_lock);  
        if (pte_none(entry))//页表项为空  
            return do_no_page(mm, vma, address, write_access, pte);  
        return do_swap_page(mm, vma, address, pte, pte_to_swp_entry(entry), write_access);//执行到这里  
    }

static int do_no_page(struct mm_struct * mm, struct vm_area_struct * vma,  
    unsigned long address, int write_access, pte_t *page_table)  
{  
    struct page * new_page;  
    pte_t entry;  
  
    if (!vma->vm_ops || !vma->vm_ops->nopage)  
        return do_anonymous_page(mm, vma, page_table, write_access, address);  
  
    /* 
     * The third argument is "no_share", which tells the low-level code 
     * to copy, not share the page even if sharing is possible.  It's 
     * essentially an early COW detection. 
     */  
    new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, (vma->vm_flags & VM_SHARED)?0:write_access);//指向了filemap_nopage  
    if (new_page == NULL)   /* no page was available -- SIGBUS */  
        return 0;  
    if (new_page == NOPAGE_OOM)  
        return -1;  
    ++mm->rss;  
    /* 
     * This silly early PAGE_DIRTY setting removes a race 
     * due to the bad i386 page protection. But it's valid 
     * for other architectures too. 
     * 
     * Note that if write_access is true, we either now have 
     * an exclusive copy of the page, or this is a shared mapping, 
     * so we can make it writable and dirty to avoid having to 
     * handle that later. 
     */  
    flush_page_to_ram(new_page);  
    flush_icache_page(vma, new_page);  
    entry = mk_pte(new_page, vma->vm_page_prot);  
    if (write_access) {  
        entry = pte_mkwrite(pte_mkdirty(entry));  
    } else if (page_count(new_page) > 1 &&  
           !(vma->vm_flags & VM_SHARED))  
        entry = pte_wrprotect(entry);  
    set_pte(page_table, entry);//建立映射  
    /* no need to invalidate: a not-present page shouldn't be cached */  
    update_mmu_cache(vma, address, entry);  
    return 2;   /* Major fault */  
}

此时分配的页面被放入了活跃队列active_list。

2、匿名内存换入

匿名内存通常是堆栈和存储堆，存储堆通过malloc分配，malloc底层调用mmap，不过传入的参数fd为-1，表示匿名映射

void *mmap(void *addr, size_t length, int prot, int flags,

int fd, off_t offset);

匿名映射当发生缺页中断时，也会调用do_no_page，然后调用do_anonymous_page，分配内页，建立映射，加入了活跃队列active_list。

0x02用户页面的换出

1、思考几个问题：

页面是直接换出到swap或者硬盘么？

这样的话，如果刚被换出，又立刻访问了呢？就需要再一次换入到内存，从而造成内存抖动。linux内核的处理如下：

linux内核有两个内核线程，专门用于换出页面，他们是kswapd和kreclaimd。详情请参考Linux内核源代码情景分析-内存管理之用户页面的定期换出。

1）kswapd线程：

当检测是页面短缺时，根据某些规则挑选一些页面，调用refill_inactive_scan和swap_out，把活跃的页面变成不活跃脏的页面。从active_list移除，并加入inactive_dirty队列；

之后还会调用page_launder，把不活跃脏的页面变成不活跃干净的页面，从inactive_dirty队列移除，并加入到inactive_clean队列。

2）kreclaimd内核线程：

把不活跃干净的页面，所有的链表关系都清除，但使用计数仍然为1。

__free_page，此时使用计数减为0，回收这个页面到free_area[MAX_ORDER]，下次alloc_page就能分配到了。

2、把活跃的页面变成不活跃脏的页面。从active_list移除，并加入inactive_dirty队列；swap_out会调用try_to_swap_out：

针对被移到inactive_dirty队列中的page，此时页面中内容和磁盘或者swap分区中的内容是一致的，现在的页表项分为两种情况：

1）针对匿名映射的页面，此时页表项指向一个新分配的swap分区地址，用于将来换出到 swap分区

2）针对文件映射的页面，此时页表项已经清空，换出到硬盘有专门的函数处理。

static int try_to_swap_out(struct mm_struct * mm, struct vm_area_struct* vma, unsigned long address, pte_t * page_table, int gfp_mask)
{
	pte_t pte;
	swp_entry_t entry;
	struct page * page;
	int onlist;
 
	pte = *page_table;
	if (!pte_present(pte))
		goto out_failed;
	page = pte_page(pte);
	if ((!VALID_PAGE(page)) || PageReserved(page))
		goto out_failed;
 
	if (!mm->swap_cnt)
		return 1;
 
	mm->swap_cnt--;
 
	onlist = PageActive(page);
	/* Don't look at this pte if it's been accessed recently. */
	if (ptep_test_and_clear_young(page_table)) {
		age_page_up(page);
		goto out_failed;
	}
	if (!onlist)
		/* The page is still mapped, so it can't be freeable... */
		age_page_down_ageonly(page);
 
	/*
	 * If the page is in active use by us, or if the page
	 * is in active use by others, don't unmap it or
	 * (worse) start unneeded IO.
	 */
	if (page->age > 0)
		goto out_failed;
 
	if (TryLockPage(page))
		goto out_failed;
 
	/* From this point on, the odds are that we're going to
	 * nuke this pte, so read and clear the pte.  This hook
	 * is needed on CPUs which update the accessed and dirty
	 * bits in hardware.
	 */
	pte = ptep_get_and_clear(page_table);
	flush_tlb_page(vma, address);
 
	/*
	 * Is the page already in the swap cache? If so, then
	 * we can just drop our reference to it without doing
	 * any IO - it's already up-to-date on disk.
	 *
	 * Return 0, as we didn't actually free any real
	 * memory, and we should just continue our scan.
	 */
	if (PageSwapCache(page)) {
		entry.val = page->index;
		if (pte_dirty(pte))
			set_page_dirty(page);
set_swap_pte:
		swap_duplicate(entry);
		set_pte(page_table, swp_entry_to_pte(entry));
drop_pte:
		UnlockPage(page);
		mm->rss--;
		deactivate_page(page);
		page_cache_release(page);
out_failed:
		return 0;
	}
 
	/*
	 * Is it a clean page? Then it must be recoverable
	 * by just paging it in again, and we can just drop
	 * it..
	 *
	 * However, this won't actually free any real
	 * memory, as the page will just be in the page cache
	 * somewhere, and as such we should just continue
	 * our scan.
	 *
	 * Basically, this just makes it possible for us to do
	 * some real work in the future in "refill_inactive()".
	 */
	flush_cache_page(vma, address);
	if (!pte_dirty(pte))
		goto drop_pte;
 
	/*
	 * Ok, it's really dirty. That means that
	 * we should either create a new swap cache
	 * entry for it, or we should write it back
	 * to its own backing store.
	 */
	if (page->mapping) {
		set_page_dirty(page);
		goto drop_pte;
	}
 
	/*
	 * This is a dirty, swappable page.  First of all,
	 * get a suitable swap entry for it, and make sure
	 * we have the swap cache set up to associate the
	 * page with that swap entry.
	 */
	entry = get_swap_page();
	if (!entry.val)
		goto out_unlock_restore; /* No swap space left */
 
	/* Add it to the swap cache and mark it dirty */
	add_to_swap_cache(page, entry);
	set_page_dirty(page);
	goto set_swap_pte;
 
out_unlock_restore:
	set_pte(page_table, pte);
	UnlockPage(page);
	return 0;
}

0x03用户页面的再次换入

当下一次访问到对应的页面时，由于页表项已经清空或者指向swap分区(但是最后一位为0，依然会触发缺页中断)。

对于匿名映射的页面，缺页中断会调用do_swap_page；

对于文件映射的页面，缺页中断会调用do_no_page；

此时页面并没有从hash列表移除，所以两者可以从hash表中读到对应的页面，直接建立映射即可，不需要重新换入，减少内存抖动。

static int do_swap_page(struct mm_struct * mm,  
    struct vm_area_struct * vma, unsigned long address,  
    pte_t * page_table, swp_entry_t entry, int write_access)  
{  
    struct page *page = lookup_swap_cache(entry);//从hash表中寻找  
    pte_t pte;  
  
    if (!page) {  
        lock_kernel();  
        swapin_readahead(entry);//预读页面  
        page = read_swap_cache(entry);//真正得到一个页面，这个页面可能从hash表中寻找到，因为上面预读了。或者自己申请页面，并且从盘上将其内容读进来。

PS: 虽然已经建立了映射，但是page所在的队列，应该被移到active_list中，什么时候移到的呢？请参考http://blog.csdn.net/jltxgcy/article/details/44055485

2、针对 kreclaimd内核线程处理后的页面，由于已经从hash列表中移除，所以此时如果发生缺页中断，就要真刀真枪的从swap分区或者硬盘中读入数据。

对于匿名页面映射，此时页表项已经指向了swap分区，页面从swap分区换入，并建立映射，重新放入active_list。

对于文件页面映射，此时页表项为空，页面从硬盘分区换入，并建立，重新放入active_list。

0x04回答最开始的问题

里面是.data段、.bss段一些可写的段。

那么就涉及到一个问题，既然是文件映射，当发生页面短缺时，是要换出到硬盘的。我们可以想象下总不可能每执行一次文件，我硬盘的可执行文件的数据就变化一下吧？那么linux内核是怎么处理的呢？

在加载可执行文件时，有这段代码：

elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
 
		vaddr = elf_ppnt->p_vaddr;
		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
			elf_flags |= MAP_FIXED;
		} else if (loc->elf_ex.e_type == ET_DYN) {
			/* Try and get dynamic programs out of the way of the
			 * default mmap base, as well as whatever program they
			 * might try to exec.  This is because the brk will
			 * follow the loader, and is not movable.  */
#if defined(CONFIG_X86) || defined(CONFIG_ARM)
			load_bias = 0;
#else
			load_bias = ELF_PAGESTART(ELF_ET_DYN_BASE - vaddr);
#endif
		}
 
		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, 0);

传递给mmap的flag中有MAP_PRIVATE，下面是对MAP_PRIVATE的解释。

MAP_PRIVATE

Create a private copy-on-write mapping. Updates to the

mapping are not visible to other processes mapping the same

file, and are not carried through to the underlying file. It

is unspecified whether changes made to the file after the

mmap() call are visible in the mapped region.

linxu内核也有这段描述， documentation/filesystems/proc.txt：

"Anonymous" shows the amount of memory that does not belong to any file. Even a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE and a page is modified, the file page is replaced by a private anonymous copy.

也就是说上面那段区域当写入时，会执行copy_on_write创建一份匿名映射，然后可以被换出到swap分区。我们在图中看到三个segment是不会被换出的硬盘的，我们已经声明了MAP_DENYWRITE。