Linux系统块存储子系统分析记录

1 Linux存储栈

通过网址Linux Storage Stack Diagram - Thomas-Krenn-Wiki-en，可以获取多个linux内核版本下的存储栈概略图，下面是kernel-4.0的存储栈概略图：

2 存储接口、传输速度和协议

2.1 硬盘

《深入浅出SSD：固态存储核心技术、原理与实战》第2版，1.4 SSD基本工作原理，表1-5

《深入浅出SSD：固态存储核心技术、原理与实战》第2版，9.2 NVMe综述，图9-4

2.2 闪存(Flash)

2.2.1 ONFI 接口

ONFI 2.0: 133MB/s
ONFI 2.1: 166BM/s 和 200MB/s (工作模式不同则速度不同)
ONFI 3.0: 400MB/s
ONFI 4.0: 800MB/s

《固态存储：原理、架构与数据安全》5.5 闪存接口，1. ONFI

2.2.2 Toggle接口

Toggle DDR 2.0: 400MB/s
《固态存储：原理、架构与数据安全》5.5 闪存接口，2. Toggle

2.3 SDIO

3 Linux块设备⼦系统

3.1 简介

本文涉及的内核源码版本是kernel-5.4

3.1.1 功能框图

《存储技术原理分析》5.1 概述，图5-1

《Linux设备驱动开发详解：基于最新的Linux4.0内核》13.1 块设备的I/O操作特点，图13.2

3.2 通用块层 / bio layer

3.2.1 简介

In summary, the bio layer is a thin layer that takes I/O requests in the form of bio structures and passes them directly to the appropriate make_request_fn() function.
A block layer introduction part 1: the bio layer [LWN.net]

Linux通⽤块层提供给上层的接⼝函数是submit_bio。上层在构造好bio请求之后，调⽤submit_bio提交给Linux通⽤块层处理。
《存储技术原理分析》5.4 请求处理过程

当内核⽂件⼦系统需要与块设备进⾏数据传输或者对块设备发送控制命令时，内核需要向对应块设备所属的请求队列发送请求对象。这个任务由函数submit_bio来完成。
《深⼊Linux设备驱动程序内核机制》 11.13 向队列提交请求

3.2.2 数据结构

3.3 request layer 和 I/O调度层

3.3.1 简介

接收通⽤块层发出的I/O请求，缓存请求并试图合并相邻的请求，并根据设置好的调度算法，回调驱动层提供的请求处理函数，以处理具体的I/O请求。
《存储技术原理分析》5.1 概述

3.3.2 single queue 和 Multiple queue(blk-mq)

Traditionally, most storage devices were made up of a set of spinning circular platters with magnetic coating and a single head (or set of heads, one per platter) that moved along the radius of the spinning disk to read or change the magnetic polarization at any location on any platter. Such a device can only process a single request at a time, and has a substantial cost in moving from one location on the platters to another. The single-queue implementation started out aimed at driving this sort of device and, while it has broadened in scope over the years, its structure still reflects the needs of rotating storage devices.
Block layer introduction part 2: the request layer [LWN.net]

blk-mq (Multi-Queue Block IO Queueing Mechanism) is a new framework for the Linux block layer that was introduced with Linux Kernel 3.13, and which has become feature-complete with Kernel 3.16.[1] Blk-mq allows for over 15 million IOPS with high-performance flash devices (e.g. PCIe SSDs) on 8-socket servers, though even single and dual socket servers also benefit considerably from blk-mq.[2] To use a device with blk-mq, the device must support the respective driver.
Linux Multi-Queue Block IO Queueing Mechanism (blk-mq) Details - Thomas-Krenn-Wiki-en

3.3.3 数据结构： request_queue 和 request(请求描述符)

3.3.4 Request affinity

On large, multiprocessor systems, there can be a performance benefit to ensuring that all processing of a block I/O request happens on the same CPU. In particular, data associated with a given request is most likely to be found in the cache of the CPU which originated that request, so it makes sense to perform the request postprocessing on that same CPU.

设置方式
/sys/class/block//queue/rq_affinity

If it is set to a non-zero value, CPU affinity will be turned on for that device.

Block layer: solid-state storage, timeouts, affinity, and more [LWN.net]

3.3.5 I/O调度

3.3.6 请求处理的代码流程

3.4 块设备驱动层

3.4.1 数据结构：struct blk_mq_ops;

来自上层的request最终会通过具体存储设备驱动的queue_rq()下发到存储设备上，然后存储设备会进行处理，处理完成后，存储设备会产生一个中断通知CPU，CPU在中断处理程序中进行request的完成操作。

常见存储设备驱动的queue_rq()函数：

3.4.2 request处理超时

每个request下发给存储设备后，留给存储设备的处理时间是有限的，默认是30秒，可以通过/sys/class/block/<disk>/queue/io_timeout修改。

在queue_rq()实例函数(如scsi_queue_rq())中都会调用blk_mq_start_request()，blk_mq_start_request()内核会设置定时器

blk_mq_start_request();
    -> trace_block_rq_issue(q, rq);
    -> rq->io_start_time_ns = ktime_get_ns();
    -> blk_add_timer(rq);
        -> req->timeout = q->rq_timeout;
        -> expiry = jiffies + req->timeout;
        -> mod_timer(&q->timeout, expiry);

超时处理函数为blk_mq_timeout_work，超时时间默认为30秒，超时工作项处理函数为blk_mq_timeout_work

blk_mq_init_queue();   //申请request_queue
    -> blk_alloc_queue_node();
        -> timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
    -> blk_mq_init_allocated_queue();
        -> INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
        -> blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
            -> q->rq_timeout = timeout;

超时处理流程

具体的超时处理工作留给存储设备驱动来完成。

4 不同存储设备的request处理过程

4.1 SATA、SCSI 和 SAS类存储设备

4.1.1 请求下发的流程

scsi_queue_rq();
    -> blk_mq_start_request();
    -> scsi_dispatch_cmd();
        -> scsi_log_send(cmd);
            -> scmd_printk(..., "Send: scmd 0x%p\n", cmd);
            -> scsi_print_command();
        -> host->hostt->queuecommand();

4.1.2 存储设备处理完成，产生中断，CPU处理中断的流程

在硬件中断被引发时，中断回调函数将会被调⽤，如果是对SCSI命令的响应，则将找到对应的 scsi_cmnd描述符，低层设备驱动处理完这个请求后，调⽤保存在它⾥⾯的scsi_done函数。
《存储技术原理分析》5.6.1

在scsi_queue_rq()中，scsi_done被赋值为scsi_mq_done。

scsi_queue_rq();
    -> cmd->scsi_done = scsi_mq_done;

所以中断处理流程如下：

scsi_mq_done();
    -> trace_scsi_dispatch_cmd_done(cmd);
    -> blk_mq_complete_request();
        -> __blk_mq_complete_request(rq);
            -> WRITE_ONCE(rq->state, MQ_RQ_COMPLETE);
            -> __blk_complete_request();
                -> raise_softirq_irqoff(BLOCK_SOFTIRQ);   //发出软中断

软中断BLOCK_SOFTIRQ的处理函数是blk_done_softirq

blk_done_softirq();
    -> rq->q->mq_ops->complete(rq);
        -> scsi_softirq_done();
            -> scsi_decide_disposition(cmd);
            -> scsi_log_completion();
                -> scsi_print_result(cmd, "Done", disposition);
                -> scsi_print_command();
                -> scmd_printk(..., "scsi host busy %d failed %d\n", ...);
            -> case SUCCESS:  scsi_finish_command(cmd);
                -> SCSI_LOG_MLCOMPLETE( ... "Notifying upper driver of completion " ...);

4.1.3 实际的dmesg信息

打开SCSI的日志开关：echo 0xffff > /sys/module/scsi_mod/parameters/scsi_logging_level

当系统下有硬盘操作时，就会在dmesg信息里看到如下信息：

向硬盘发送命令时的dmesg信息：
sd 2:0:0:0: [sda] tag#25 Send: scmd 0x0000000049a58ebd
sd 2:0:0:0: [sda] tag#25 CDB: Write(10) 2a 00 0a 42 60 00 00 00 40 00
......
硬盘收到命令后，对命令进行处理，处理完成后产生中断通知CPU，下面是CPU处理中断时的dmesg信息：
sd 2:0:0:0: [sda] tag#25 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
sd 2:0:0:0: [sda] tag#25 CDB: Write(10) 2a 00 0a 42 60 00 00 00 40 00
sd 2:0:0:0: [sda] tag#25 scsi host busy 1 failed 0
sd 2:0:0:0: Notifying upper driver of completion (result 0)

dmesg信息简单说明

CDB: Command Descriptor Block

Write(10)是SCSI的命令，含义如下：
(更多SCSI命令，请看《SCSI Commands Reference Manual》，下载链接：https://www.seagate.com/staticfiles/support/disc/manuals/scsi/100293068a.pdf)

4.2 NVMe

4.2.1 简介

当前有很多种NVMe的实现方式，例如：

NVMe over PCIe
NVMe over RDMA
NVMe over TCP
NVMe over FC

《深⼊浅出SSD：固态存储核⼼技术、原理与实战》第2版，9.9 NVMe over Fabrics

下面以NVMe over PCIe为例，介绍request的处理流程

4.2.2 NVMe处理命令的⼋个步骤

第⼀步，主机写命令到内存中的SQ；
第⼆步，主机写SQ的DB，通知SSD取指；
第三步，SSD收到通知后，到SQ中取指；
第四步，SSD执⾏指令；
第五步，指令执⾏完成，SSD往CQ中写指令执⾏结果；
第六步，SSD发送中断通知主机指令完成；
第七步，收到中断，主机处理CQ，查看指令完成状态；
第⼋步，主机处理完CQ中指令执⾏结果，通过DB恢复SSD。

《深⼊浅出SSD：固态存储核⼼技术、原理与实战》第2版，9.2 NVMe综述

4.2.3 请求(命令)下发

nvme_queue_rq();
    -> nvme_setup_cmd();
        -> trace_nvme_setup_cmd();
    -> blk_mq_start_request();
    -> nvme_submit_cmd();
        -> memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes), cmd, sizeof(*cmd));    //第⼀步，主机写命令到内存中的SQ；
        -> nvme_write_sq_db();    //第⼆步，主机写SQ的DB，通知SSD取指

4.2.4 存储设备处理完成，产生中断，CPU处理中断的流程

nvme_irq();             //第七步，收到中断，主机处理CQ，查看指令完成状态；
    -> nvme_process_cq();
        -> nvme_ring_cq_doorbell();
    -> nvme_complete_cqes();
        -> nvme_handle_cqe();
            -> trace_nvme_sq();
            -> nvme_end_request();
                -> blk_mq_complete_request(req);