探索 io_uring：理解高效异步 IO 的工作原理与实现细节

概述

io_uring 是一个 Linux 内核提供的高性能异步 I/O 框架，最初在 Linux 5.1 版本中引入。它的设计目标是解决传统的异步 I/O 模型（如 epoll 或者 POSIX AIO）在大规模 I/O 操作中效率不高的问题。

关键特点和优势包括：

零拷贝操作：io_uring 允许应用程序直接将数据从用户空间提交给内核，而无需在用户空间和内核空间之间进行额外的数据拷贝操作，从而减少了系统调用的开销。
高效的批处理操作：io_uring 支持批处理操作，即一次系统调用可以处理多个 I/O 请求，这降低了系统调用的次数，减少了上下文切换的开销。
更低的延迟和更高的吞吐量：由于减少了数据拷贝和系统调用的开销，io_uring 在处理大量 I/O 请求时能够提供更低的延迟和更高的吞吐量，特别是在高并发和大数据量的场景下表现突出。
灵活的事件通知机制：io_uring 使用 ring buffer 作为内核和用户空间之间的通信方式，通过 ring buffer 可以实现事件的批量通知，这种机制比传统的事件通知方式更为高效。
适用于各种 I/O 操作：io_uring 不仅支持文件 I/O，还支持网络套接字的 I/O 操作，这使得它在网络应用中也能发挥重要作用。

过往IO接口的缺陷

同步IO接口

最原始的文件IO系统调用就是read，write。

read系统调用从文件描述符所指代的打开文件中读取数据。write系统调用将数据写入一个已打开的文件中。

在文件特定偏移处的IO是pread，pwrite。调用时可以指定位置进行文件IO操作，而非始于文件的当前偏移处，且他们不会改变文件的当前偏移量。

分散输入和集中输出（Scatter-Gather IO）是readv, writev，调用并非只对单个缓冲区进行读写操作，而是一次即可传输多个缓冲区的数据，免除了多次系统调用的开销，提高文件 I/O 的效率，特别是当需要读写多个连续或非连续的数据块时。

该机制使用一个数组iov定义了一组用来传输数据的缓冲区，一个整形数iovcnt指定iov的成员个数，其中，iov中的每个成员都是如下形式的数据结构。

struct iovec {
   void  *iov_base;    /* Starting address */
   size_t iov_len;     /* Number of bytes to transfer */
};

上述接口在读写IO时，系统调用会阻塞住等待，在数据读取或写入后才返回结果。同步导致的后果就是在阻塞的同时无法继续执行其他的操作，只能等待IO结果返回。存储场景中对性能的要求非常高，所以需要异步IO。

异步IO接口：AIO

Linux 的异步 IO（AIO，Asynchronous I/O）是一种高级的文件 IO 模型，允许应用程序在发起 IO 操作后不必等待操作完成，而是可以继续执行其他任务。这与传统的同步 IO 模型不同，后者在 IO 操作完成之前会阻塞应用程序的执行。

在这里插入图片描述

io_uring设计思路

解决“系统调用开销大”的问题

针对这个问题，考虑是否每次都需要系统调用。如果能将多次系统调用中的逻辑放到有限次数中来，就能将消耗降为常数时间复杂度。

解决“拷贝开销大”的问题

之所以在提交和完成事件中存在大量的内存拷贝，是因为应用程序和内核之间的通信需要拷贝数据，所以为了避免这个问题，需要重新考量应用与内核间的通信方式。我们发现，两者通信，不是必须要拷贝，通过现有技术，可以让应用与内核共享内存。

要实现核外与内核的零拷贝，最佳方式就是实现一块内存映射区域，两者共享一段内存，核外往这段内存写数据，然后通知内核使用这段内存数据，或者内核填写这段数据，核外使用这部分数据。因此，需要一对共享的ring buffer用于应用程序和内核之间的通信。

一块用于核外传递数据给内核，一块是内核传递数据给核外，一方只读，一方只写。

提交队列SQ(submission queue)中，应用是IO提交的生产者，内核是消费者。
完成队列CQ(completion queue)中，内核是IO完成的生产者，应用是消费者。

内核控制SQ ring的head和CQ ring的tail，应用程序控制SQ ring的tail和CQ ring的head

解决“API不友好”的问题

问题在于需要多个系统调用才能完成，考虑是否可以把多个系统调用合而为一。有时候，将多个类似的函数合并并通过参数区分不同的行为是更好的选择，而有时候可能需要将复杂的函数分解为更简单的部分来进行重构。

如果发现函数中的某一部分代码可以独立出来成为一个单独的函数，可以先进行这样的提炼，然后再考虑是否需要进一步使用参数化方法重构。

io_uring实现原理

整体架构

在这里插入图片描述 SQE：提交队列项，表示IO请求。

CQE：完成队列项，表示IO请求结果。

SQ：Submission Queue，提交队列，用于存储SQE的数组。

CQ：Completion Queue，完成队列，用于存储CQE的数组。

SQ Ring：SQ环形缓冲区，包含SQ，头部索引（head），尾部索引（tail），队列大小等信息。

CQ Ring：CQ环形缓冲区，包含SQ，头部索引（head），尾部索引（tail），队列大小等信息。

SQ线程：内核辅助线程，用于从SQ队列获取SQE，并提交给内核处理，并将IO请求结果生成CQE存储在CQ队列。

io_uring、io_rings结构

io_uring定义了一个环形队列，用于管理异步 I/O 操作的发送队列 (Submission Queue, SQ) 和完成队列 (Completion Queue, CQ)。

io_rings扩展了 struct io_uring，提供了更多的管理和控制异步 I/O 操作的功能。

struct io_uring {
 u32 head ____cacheline_aligned_in_smp;
 u32 tail ____cacheline_aligned_in_smp;
};
struct io_rings {
 struct io_uring  sq, cq;
 u32   sq_ring_mask, cq_ring_mask;
 u32   sq_ring_entries, cq_ring_entries;
 u32   sq_dropped;
 u32   sq_flags;
 u32   cq_flags;
 u32   cq_overflow;
 struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp;
};

struct io_uring 包含了两个成员变量 head 和 tail，它们分别表示环形队列的头部和尾部。

____cacheline_aligned_in_smp 是一个宏，用于确保 head 和 tail 在多处理器环境下的缓存行对齐。

sq 和 cq 是 struct io_uring 类型的变量，分别表示异步I/O的发送队列和完成队列。

sq_ring_mask 和 cq_ring_mask 是掩码，用于将头部和尾部的偏移量转换为有效的索引。

sq_ring_entries 和 cq_ring_entries 分别表示发送队列和完成队列的环形缓冲区大小，必须是2的幂次方。

sq_dropped 表示由于应用程序存储了无效索引而由内核丢弃的无效条目数量。

sq_flags 和 cq_flags 分别是运行时的标志，用于异步I/O操作的状态控制。

cq_overflow 表示由于完成队列已满而丢失的完成事件数量。

cqes[] 是一个包含完成事件的环形缓冲区，类型为 struct io_uring_cqe，并确保缓存行对齐。

Submission Queue Entry单元数据结构

Submission Queue（下称SQ）是提交队列，核外写内核读的地方。Submission Queue Entry（下称SQE），即提交队列中的条目，队列由一个个条目组成。

描述一个SQE会复杂很多，不仅是因为要描述更多的信息，也是因为可扩展性这一设计原则。

struct io_uring_sqe {
 __u8 opcode;  /* type of operation for this sqe */
 __u8 flags;  /* IOSQE_ flags */
 __u16 ioprio;  /* ioprio for the request */
 __s32 fd;  /* file descriptor to do IO on */
 union {
  __u64 off; /* offset into file */
  __u64 addr2;
 };
 union {
  __u64 addr; /* pointer to buffer or iovecs */
  __u64 splice_off_in;
 };
 __u32 len;  /* buffer size or number of iovecs */
 union {
  __kernel_rwf_t rw_flags;
  __u32  fsync_flags;
  __u16  poll_events; /* compatibility */
  __u32  poll32_events; /* word-reversed for BE */
  __u32  sync_range_flags;
  __u32  msg_flags;
  __u32  timeout_flags;
  __u32  accept_flags;
  __u32  cancel_flags;
  __u32  open_flags;
  __u32  statx_flags;
  __u32  fadvise_advice;
  __u32  splice_flags;
 };
 __u64 user_data; /* data to be passed back at completion time */
 union {
  struct {
   /* pack this to avoid bogus arm OABI complaints */
   union {
    /* index into fixed buffers, if used */
    __u16 buf_index;
    /* for grouped buffer selection */
    __u16 buf_group;
   } __attribute__((packed));
   /* personality to use, if used */
   __u16 personality;
   __s32 splice_fd_in;
  };
  __u64 __pad2[3];
 };
};

opcode是操作码，例如IORING_OP_READV，代表向量读。
flags是标志位集合。
ioprio是请求的优先级，对于普通的读写，具体定义可以参照ioprio_set(2)，
fd是这个请求相关的文件描述符
off是操作的偏移量
addr表示这次IO操作执行的地址，如果操作码opcode描述了一个传输数据的操作，这个操作是基于向量的，addr就指向struct iovec的数组首地址，这和前文所说的preadv系统调用是一样的用法；如果不是基于向量的，那么addr必须直接包含一个地址，len这里（非向量场景）就表示这段buffer的长度，而向量场景就表示iovec的数量。
union，表示一系列针对特定操作码opcode的一些flag。
user_data是各操作码opcode通用的，仅仅只是拷贝给完成事件completion event
结构的最后用于内存对齐，对齐到64字节。

这就是核外往内核填写的Submission Queue Entry的数据结构，准备好这样的一个数据结构，将它写到对应的sqes所在的内存位置，然后再通知内核去对应的位置取数据，这样就完成了一次数据交接。

Completion Queue Entry单元数据结构

Completion Queue（下称CQ）是完成队列，内核写核外读的地方。Completion Queue Entry（下称CQE），即完成队列中的条目，队列由一个个条目组成。

描述一个CQE就简单得多。

/*
 * IO completion data structure (Completion Queue Entry)
 */
struct io_uring_cqe {
 __u64 user_data; /* sqe->data submission passed back */
 __s32 res;  /* result code for this event */
 __u32 flags;
};

user_data就是sqe发送时核外填写的，在完成时回传。典型用法是将其作为指针，指向原始请求的数据结构或上下文信息。这样，当完成处理时，可以通过 user_data 来识别并处理对应的请求。
res 字段用于存储与这个事件相关的结果码或返回值。它通常代表了系统调用的返回值。
flags 是一个标志位集合，用于提供关于完成事件的附加信息。

上下文结构io_ring_ctx

io_ring_ctx是贯穿整个io_uring所有过程的数据结构，基本上在任何位置只需要你能持有该结构就可以找到任何数据所在的位置，例如，sq_sqes就是指向io_uring_sqe结构的指针，指向SQEs的首地址。

其提供了完整的上下文信息，包括了对提交给 io_uring 的操作请求的管理、等待队列、文件管理、权限控制、同步处理等多方面的功能。

struct io_ring_ctx {
 struct {
  struct percpu_ref refs;
 } ____cacheline_aligned_in_smp;

 struct {
  unsigned int  flags;
  unsigned int  compat: 1;
  unsigned int  limit_mem: 1;
  unsigned int  cq_overflow_flushed: 1;
  unsigned int  drain_next: 1;
  unsigned int  eventfd_async: 1;
  unsigned int  restricted: 1;

  /*
   * Ring buffer of indices into array of io_uring_sqe, which is
   * mmapped by the application using the IORING_OFF_SQES offset.
   *
   * This indirection could e.g. be used to assign fixed
   * io_uring_sqe entries to operations and only submit them to
   * the queue when needed.
   *
   * The kernel modifies neither the indices array nor the entries
   * array.
   */
  u32   *sq_array;
  unsigned  cached_sq_head;
  unsigned  sq_entries;
  unsigned  sq_mask;
  unsigned  sq_thread_idle;
  unsigned  cached_sq_dropped;
  unsigned  cached_cq_overflow;
  unsigned long  sq_check_overflow;

  struct list_head defer_list;
  struct list_head timeout_list;
  struct list_head cq_overflow_list;

  wait_queue_head_t inflight_wait;
  struct io_uring_sqe *sq_sqes;
 } ____cacheline_aligned_in_smp;

 struct io_rings *rings;

 /* IO offload */
 struct io_wq  *io_wq;

 /*
  * For SQPOLL usage - we hold a reference to the parent task, so we
  * have access to the ->files
  */
 struct task_struct *sqo_task;

 /* Only used for accounting purposes */
 struct mm_struct *mm_account;

#ifdef CONFIG_BLK_CGROUP
 struct cgroup_subsys_state *sqo_blkcg_css;
#endif

 struct io_sq_data *sq_data; /* if using sq thread polling */

 struct wait_queue_head sqo_sq_wait;
 struct wait_queue_entry sqo_wait_entry;
 struct list_head sqd_list;

 /*
  * If used, fixed file set. Writers must ensure that ->refs is dead,
  * readers must ensure that ->refs is alive as long as the file* is
  * used. Only updated through io_uring_register(2).
  */
 struct fixed_file_data *file_data;
 unsigned  nr_user_files;

 /* if used, fixed mapped user buffers */
 unsigned  nr_user_bufs;
 struct io_mapped_ubuf *user_bufs;

 struct user_struct *user;

 const struct cred *creds;

#ifdef CONFIG_AUDIT
 kuid_t   loginuid;
 unsigned int  sessionid;
#endif

 struct completion ref_comp;
 struct completion sq_thread_comp;

 /* if all else fails... */
 struct io_kiocb  *fallback_req;

#if defined(CONFIG_UNIX)
 struct socket  *ring_sock;
#endif

 struct idr  io_buffer_idr;

 struct idr  personality_idr;

 struct {
  unsigned  cached_cq_tail;
  unsigned  cq_entries;
  unsigned  cq_mask;
  atomic_t  cq_timeouts;
  unsigned long  cq_check_overflow;
  struct wait_queue_head cq_wait;
  struct fasync_struct *cq_fasync;
  struct eventfd_ctx *cq_ev_fd;
 } ____cacheline_aligned_in_smp;

 struct {
  struct mutex  uring_lock;
  wait_queue_head_t wait;
 } ____cacheline_aligned_in_smp;

 struct {
  spinlock_t  completion_lock;

  /*
   * ->iopoll_list is protected by the ctx->uring_lock for
   * io_uring instances that don't use IORING_SETUP_SQPOLL.
   * For SQPOLL, only the single threaded io_sq_thread() will
   * manipulate the list, hence no extra locking is needed there.
   */
  struct list_head iopoll_list;
  struct hlist_head *cancel_hash;
  unsigned  cancel_hash_bits;
  bool   poll_multi_file;

  spinlock_t  inflight_lock;
  struct list_head inflight_list;
 } ____cacheline_aligned_in_smp;

 struct delayed_work  file_put_work;
 struct llist_head  file_put_llist;

 struct work_struct  exit_work;
 struct io_restriction  restrictions;
};

io_uring关键流程

使用上，大体分为准备、提交、收割过程。以下是几个io_uring相关的系统调用：

#include <linux/io_uring.h>

int io_uring_setup(u32 entries, struct io_uring_params *p);

int io_uring_enter(unsigned int fd, unsigned int to_submit,
                   unsigned int min_complete, unsigned int flags,
                   sigset_t *sig);
                   
int io_uring_register(unsigned int fd, unsigned int opcode,
                      void *arg, unsigned int nr_args);