背景  
在虚拟化应用场景中,虚拟机想要在访问PCI设备时达到IO性能最优,最直接的方法就是将物理设备暴露给虚拟机,虚拟机对设备的访问不经过任何中间层的转换,没有虚拟化的损耗。 但我们知道Linux没有为用户程序提供这样的设备访问机制,所有PCI设备都在kernel的管理之下,即使我们能够让一个PCI物理设备不被kernel管理,直接供用户态的程序使用,在安全上也不被允许,因为我们知道PCI设备可以DMA,用户程序管理的PCI设备一旦能够DMA,那么用户程序就能借助PCI设备访问整个系统内存。延伸到虚拟化场景下,一旦一个虚拟机拥有的PCI设备能够做DMA,这个虚机就能借助PCI设备入侵整个主机系统。 想要提供高IO性能的PCI设备给虚拟机,除尽量减少PCI设备访问损耗,还需要增加一层保护,防止访问PCI设备的应用程序借助PCI的DMA能力入侵系统。 VFIO协议应时而生,它通过定义一套标准接口将PCI设备的信息提供给用户程序,替代了传统的通过PCI driver来获取PCI设备信息的方式,从而让用户态程序也能直接访问PCI设备信息。   
 PCI设备模拟  
 PCI设备抽象  
我们知道访问PCI设备的传统步骤是首先通过PCI驱动框架枚举pci总线上所有设备,根据PCI硬件的要求读取其配置空间内容、bar空间类型和大小,记录到内核PCI设备相关数据结构并为内存BAR空间映射系统内存。 以上方式是访问PCI最自然的方式:根据厂商要求编写驱动,访问硬件设备。而用户态程序要访问PCI设备的信息,不能采用此方式,因为用户态程序无法像内核一样利用驱动框架枚举PCI设备,也无法针对设定PCI设备进行读写以获取其信息。那用户态程序应该怎么做才能访问PCI设备信息呢?其实用户态程序并不关心何种方式获取PCI信息,只关心最终结果,即PCI设备的信息,因此只要内核为用户态程序这样的信息即可。 为达到上述目的,内核的VFIO框架对PCI设备进行了抽象,将PCI设备的所有信息抽象为一组Region。如下图所示:   PCI设备所有信息记录在上述region中,通过ioctl命令字暴露给用户态程序。用户态程序在虚拟化场景下可以是Qemu或者ovs程序,我们知道Qemu可以软件模拟PCI设备,在使用VFIO设备后,当客户机有访问PCI设备的请求时,Qemu可以直接通过ioctl命令从内核获取信息,直接返回给客户机,Qemu完成了将Region重新组装成PCI设备,呈现给客户机的工作,如下图所示:     
 VFIO协议  
上一节中定义了一套接口和对应的数据结构,用于描述PCI设备信息,这套接口即所谓的VFIO协议,在最初的实现中,VFIO协议用来支持硬件PCI设备透传功能,因此协议两端分别是用户态程序和kernel,协议通过ioctl命令字实现。但该协议可以扩展到任何场景,比如vfio-user设备就是另一种场景,协议两端都在用户态,协议通过unix socket实现。我们将发起调用的一方称为客户端(client),接受调用并返回的一方称为服务端(server)。如下图所示:   在传统的VFIO-PCI硬件设备透传场景, 由于硬件机制的原因,某些PCI设备做IO可能通过物理关联的另一个设备进行,为防止这种侵入,内核在实现上提出了组和容器的概念,用于定义隔离PCI设备的最小单元,这部分实现与VFIO标准协议无关,相关命令可以认为是具体实现相关的,如上图的中server-specific command。   
VFIO User   
用户态的VFIO协议不涉及PCI设备的隔离,定义了如下命令字:   
USER_VERSION
DMA_MAP
DMA_UNMAP
DEVICE_GET_INFO
DEVICE_GET_REGION_INFO
DEVICE_GET_REGION_IO_FDS
DEVICE_GET_IRQ_INFO
EVICE_SET_IRQS
REGION_READ
REGION_WRITE
DMA_WRITE
DEVICE_RESET
  
VFIO Kernel   
对于kernel场景,VFIO PCI设备除了实现标准的VFIO协议外,还需要定义PCI设备隔离最小单元相关命令字,如下: 实现相关命令字:   
GET_API_VERSION
CHECK_EXTENSION
SET_IOMMU
GROUP_SET_STATUS
GROUP_SET_CONTAINER
......
  
 实验  
我们的实验非常简单,选取一个PCI设备,将其加载为VFIO PCI驱动,然后编写用户态程序访问该设备并获取其信息,具体实现我们参考了内核的VFIO文档。实验我们以透传显卡为例。   
查看显卡信息   
[root@Hyman_server1 ~]# lspci -s 04:00.0
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Turks [Radeon HD 7600 Series]
  
查看显卡所属的group   
[root@Hyman_server1 ~]# readlink /sys/bus/pci/devices/0000\:04\:00.0/iommu_group
../../../../kernel/iommu_groups/14
  
查看显卡所属组包含的设备   
[root@Hyman_server1 ~]# ll /sys/kernel/iommu_groups/14/devices/
total 0
lrwxrwxrwx 1 root root 0 Feb 13 23:27 0000:04:00.0 -> ../../../../devices/pci0000:00/0000:00:07.0/0000:04:00.0
lrwxrwxrwx 1 root root 0 Feb 13 23:27 0000:04:00.1 -> ../../../../devices/pci0000:00/0000:00:07.0/0000:04:00.1
  
卸载同组内的所有设备的内核驱动,加载为vfio-pci驱动,这里我们通过Libvirt提供的工具操作   
查看显卡设备,显卡所属的group包含两个设备,一个显卡,一个声卡   
[root@Hyman_server1 ~]# virsh nodedev-dumpxml pci_0000_04_00_0
<device>
  <name>pci_0000_04_00_0</name>
  <path>/sys/devices/pci0000:00/0000:00:07.0/0000:04:00.0</path>
  <parent>pci_0000_00_07_0</parent>
  <driver>
    <name>radeon</name>
  </driver>
  <capability type='pci'>
    <class>0x030000</class>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x675b'>Turks [Radeon HD 7600 Series]</product>
    <vendor id='0x1002'>Advanced Micro Devices, Inc. [AMD/ATI]</vendor>
    <iommuGroup number='14'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </iommuGroup>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='16'/>
      <link validity='sta' speed='5' width='16'/>
    </pci-express>
  </capability>
</device>
[root@Hyman_server1 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
  <name>pci_0000_04_00_1</name>
  <path>/sys/devices/pci0000:00/0000:00:07.0/0000:04:00.1</path>
  <parent>pci_0000_00_07_0</parent>
  <driver>
    <name>snd_hda_intel</name>
  </driver>
  <capability type='pci'>
    <class>0x040300</class>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>1</function>
    <product id='0xaa90'>Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]</product>
    <vendor id='0x1002'>Advanced Micro Devices, Inc. [AMD/ATI]</vendor>
    <iommuGroup number='14'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </iommuGroup>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='16'/>
      <link validity='sta' speed='5' width='16'/>
    </pci-express>
  </capability>
</device>
  
 
[root@Hyman_server1 ~]# virsh nodedev-detach pci_0000_04_00_0
Device pci_0000_04_00_0 detached
[root@Hyman_server1 ~]# virsh nodedev-detach pci_0000_04_00_1
Device pci_0000_04_00_1 detached
  
 
[root@Hyman_server1 ~]# virsh nodedev-dumpxml pci_0000_04_00_0
<device>
  <name>pci_0000_04_00_0</name>
  <path>/sys/devices/pci0000:00/0000:00:07.0/0000:04:00.0</path>
  <parent>pci_0000_00_07_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <class>0x030000</class>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x675b'>Turks [Radeon HD 7600 Series]</product>
    <vendor id='0x1002'>Advanced Micro Devices, Inc. [AMD/ATI]</vendor>
    <iommuGroup number='14'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </iommuGroup>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='16'/>
      <link validity='sta' speed='5' width='16'/>
    </pci-express>
  </capability>
</device>
[root@Hyman_server1 ~]# virsh nodedev-dumpxml pci_0000_04_00_1
<device>
  <name>pci_0000_04_00_1</name>
  <path>/sys/devices/pci0000:00/0000:00:07.0/0000:04:00.1</path>
  <parent>pci_0000_00_07_0</parent>
  <driver>
    <name>vfio-pci</name>
  </driver>
  <capability type='pci'>
    <class>0x040300</class>
    <domain>0</domain>
    <bus>4</bus>
    <slot>0</slot>
    <function>1</function>
    <product id='0xaa90'>Turks HDMI Audio [Radeon HD 6500/6600 / 6700M Series]</product>
    <vendor id='0x1002'>Advanced Micro Devices, Inc. [AMD/ATI]</vendor>
    <iommuGroup number='14'>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
      <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </iommuGroup>
    <pci-express>
      <link validity='cap' port='0' speed='5' width='16'/>
      <link validity='sta' speed='5' width='16'/>
    </pci-express>
  </capability>
</device>
  
编写应用程序,访问vfio pci设备,参考demo   
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <unistd.h>
#include <linux/vfio.h>
#include <fcntl.h>
void main (int argc, char *argv[]) {
    int container, group, device, i;
    struct vfio_group_status group_status =
                                    { .argsz = sizeof(group_status) };
    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
    struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
    struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
    int ret;
    /* Create a new container */
    container = open("/dev/vfio/vfio", O_RDWR);
    if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
        /* Unknown API version */
        fprintf(stderr, "unknown api version\n");
        return;
    }
    if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
        /* Doesn't support the IOMMU driver we want. */
        fprintf(stderr, "doesn't support the IOMMU driver we want\n");
        return;
    }
    /* Open the group and get group fd
     * readlink /sys/bus/pci/devices/0000\:04\:00.0/iommu_group
     * */
    group = open("/dev/vfio/14", O_RDWR);
    if (group == -ENOENT) {
        fprintf(stderr, "group is not managed by VFIO driver\n");
        return;
    }
    /* Test the group is viable and available */
    ret = ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
    if (ret) {
        fprintf(stderr, "cannot get VFIO group status, "
            "error %i (%s)\n", errno, strerror(errno));
        close(group);
        return;
    } else if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
        fprintf(stderr, "VFIO group is not viable! "
            "Not all devices in IOMMU group bound to VFIO or unbound\n");
        close(group);
        return;
    }
    if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
        /* Add the group to the container */
        ret = ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
        if (ret) {
            fprintf(stderr,
                "cannot add VFIO group to container, error "
                "%i (%s)\n", errno, strerror(errno));
            close(group);
            return;
        }
        /* Enable the IOMMU model we want */
        ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
        if (!ret) {
            fprintf(stderr, "using IOMMU type 1\n");
        } else {
            fprintf(stderr, "failed to select IOMMU type\n");
            return;
        }
    }
    /* Get addition IOMMU info */
    ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
    /* Allocate some space and setup a DMA mapping */
    dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    dma_map.size = 1024 * 1024;
    dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
    dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
    ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
    /* Get a file descriptor for the device */
    device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:04:00.0");
    /* Test and setup the device */
    ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
    for (i = 0; i < device_info.num_regions; i++) {
            struct vfio_region_info reg = { .argsz = sizeof(reg) };
            reg.index = i;
            ioctl(device, VFIO_DEVICE_GET_REGION_INFO, ®);
            /* Setup mappings... read/write offsets, mmaps
             * For PCI devices, config space is a region */
    }
    for (i = 0; i < device_info.num_irqs; i++) {
            struct vfio_irq_info irq = { .argsz = sizeof(irq) };
            irq.index = i;
            ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
            /* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
    }
    /* Gratuitous device reset and go... */
    ioctl(device, VFIO_DEVICE_RESET);
    ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
}
  
编译测试   
 
make
gdb demo
b main
r
  
 
Breakpoint 1, main (argc=1, argv=0x7fffffffe478) at demo.c:14
14	    struct vfio_group_status group_status =
(gdb) bt
#0  main (argc=1, argv=0x7fffffffe478) at demo.c:14
(gdb) list
9	#include <linux/vfio.h>
10	#include <fcntl.h>
11	
12	void main (int argc, char *argv[]) {
13	    int container, group, device, i;
14	    struct vfio_group_status group_status =
15	                                    { .argsz = sizeof(group_status) };
16	    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
17	    struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
18	    struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
(gdb) n
16	    struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(iommu_info) };
(gdb) n
17	    struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map) };
(gdb) n
18	    struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
(gdb) n
22	    container = open("/dev/vfio/vfio", O_RDWR);
(gdb) n
24	    if (ioctl(container, VFIO_GET_API_VERSION) != VFIO_API_VERSION) {
(gdb) n
30	    if (!ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
(gdb) n
39	    group = open("/dev/vfio/14", O_RDWR);
(gdb) n
41	    if (group == -ENOENT) {
(gdb) p group
$1 = 8
(gdb) n
47	    ret = ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
(gdb) n
48	    if (ret) {
(gdb) p ret
$2 = 0
(gdb) p group_status
$3 = {argsz = 8, flags = 1}
(gdb) n
53	    } else if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
(gdb) n
60	    if (!(group_status.flags & VFIO_GROUP_FLAGS_CONTAINER_SET)) {
(gdb) n
62	        ret = ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
(gdb) p container
$4 = 7
(gdb) n
63	        if (ret) {
(gdb) p ret
$5 = 0
(gdb) n
72	        ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
(gdb) n
73	        if (!ret) {
(gdb) n
74	            fprintf(stderr, "using IOMMU type 1\n");
(gdb) n
using IOMMU type 1
82	    ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info);
(gdb) n
85	    dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
(gdb) p iommu_info
$6 = {argsz = 116, flags = 3, iova_pgsizes = 4096, cap_offset = 0}
(gdb) n
87	    dma_map.size = 1024 * 1024;
(gdb) n
88	    dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
(gdb) n
89	    dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
(gdb) n
91	    ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
(gdb) n
94	    device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:04:00.0");
(gdb) n
97	    ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
(gdb) p device
$7 = 9
(gdb) n
99	    for (i = 0; i < device_info.num_regions; i++) {
(gdb) p device_info
$8 = {argsz = 20, flags = 2, num_regions = 9, num_irqs = 5, cap_offset = 0}
  
查看demo程序打开的文件     
 Q&A  
使用VFIO透传的设备,主机上必须开启IOMMU透传吗?   
是的。IOMMU如果不透传,所有设备的DMA请求,都会通过内核配置来实现地址转换,用户进程无法决定映射关系。开启IOMMU透传后,IOMMU的DMA映射关系交给了用户程序,这样用户态程序(Qemu)才能通过接口VFIO_IOMMU_MAP_DMA灵活配置虚机做DMA的空间。