RDMA介绍
RDMA( Remote Direct Memory Access )意为远程直接地址访问,通过RDMA,本端节点可以“直接”访问远端节点的内存。所谓直接,指的是可以像访问本地内存一样,绕过传统以太网复杂的TCP/IP网络协议栈读写远端内存,而这个过程对端是不感知的,而且这个读写过程的大部分工作是由硬件而不是软件完成的。
RDMA本身指的是一种技术,具体协议层面,包含Infiniband(IB),RDMA over Converged Ethernet(RoCE)和internet Wide Area RDMA Protocol(iWARP)。三种协议都符合RDMA标准,使用相同的上层接口,在不同层次上有一些差别。RoCE成本比IB低,效果比iWARP好。
RDMA技术实际应用的话是得依赖网卡来完成大部分工作的,需要硬件层面支持RDMA协议的智能网卡,好在我们有Soft-RoCE,它通过软件代替硬件来将IB传输层的报文加在普通UDP报文中,从而得以让普通网卡也可以发送RoCE报文,这对于为我们学习IB传输层协议,以及编写调试基于Verbs的RDMA程序提供了一种非常低成本的方案。接下来就介绍如何安装Soft-RoCE。
Soft-RoCE与网卡硬件支持的RoCE对比如下:
下面我们就开始演示在Linux上实践RoCE。
安装Soft-RoCE
apt-get安装必要组件
$ sudo apt-get install libibverbs1 ibverbs-utils
软件包名 | 主要功能 |
libibverbs1 | ib verbs动态链接库 |
ibverbs-utils | ibverbs示例程序 |
librdmacm1 | rdmacm动态链接库 |
libibumad3 | ibumad动态链接库 |
ibverbs-providers | ibverbs各厂商用户态驱动(包括RXE) |
rdma-core | 文档及用户态配置文件 |
加载驱动
$ sudo modprobe rdma_rxe
创建支持RDMA协议的逻辑网卡
$ sudo rdma link add rxe0 type rxe netdev enp0s3
查看创建的RXE逻辑接口
$ ibv_devices
device node GUID
------ ----------------
rxe0 0a0027fffe5ac323
$ ibv_devinfo -d rxe0
hca_id: rxe0
transport: InfiniBand (0)
fw_ver: 0.0.0
node_guid: 0a00:27ff:fe5a:c323
sys_image_guid: 0a00:27ff:fe5a:c323
vendor_id: 0xffffff
vendor_part_id: 0
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
$ sudo rdma link show
link rxe0/1 state ACTIVE physical_state LINK_UP netdev enp0s3
$ sudo ibstat
CA 'rxe0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0a0027fffe5ac323
System image GUID: 0x0a0027fffe5ac323
Port 1:
State: Active
Physical state: LinkUp
Rate: 2.5
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0a0027fffe5ac323
Link layer: Ethernet
测试RDMA连通性
Server端
$ sudo rping -s -a 192.168.31.79 -v -C 10
[sudo] password for wq:
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
Client端
$ rping -c -a 192.168.31.79 -v -C 10
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
client DISCONNECT EVENT...
wireshark抓包验证如下
RDMA编程用户手册
手册的CSDN下载链接
RDMA编程用户手册里介绍的API有 初始化、设备操作、Verbs上下文操作、保护域操作、Queue Pair使能、激活Queue Pair操作、事件句柄操作、实验APIs等。
编写测试程序
安装用户态依赖库
$ sudo apt-get install libibverbs-dev
从git上拉取测试示例
$ git clone https://gitee.com/wq897/RDMA-EXAMPLE.git
$ cd RDMA-EXAMPLE/01/ && make
运行服务端程序
$ sudo ./service
------------------------------------------------
Device name : "(null)"
IB port : 1
TCP port : 19875
------------------------------------------------
waiting on port 19875 for TCP connection
TCP connection was established
searching for IB devices in host
found 1 device(s)
device not specified, using first one found: rxe0
going to send the message: 'SEND operation '
MR was registered with addr=0x55d57ea7c340, lkey=0x49b, rkey=0x49b, flags=0x7
QP was created, QP number=0x13
Local LID = 0x0
Remote address = 0x558d59dbb340
Remote rkey = 0x43d
Remote QP number = 0x13
Remote LID = 0x0
failed to modify QP state to RTR
failed to modify QP state to RTR
failed to connect QPs
test result is 1
运行客户端程序
$ sudo ./service 192.168.3.79
servername=192.168.3.79
------------------------------------------------
Device name : "(null)"
IB port : 1
IP : 192.168.3.79
TCP port : 19875
------------------------------------------------
TCP connection was established
searching for IB devices in host
found 1 device(s)
device not specified, using first one found: rxe0
MR was registered with addr=0x558d59dbb340, lkey=0x43d, rkey=0x43d, flags=0x7
QP was created, QP number=0x13
Local LID = 0x0
Remote address = 0x55d57ea7c340
Remote rkey = 0x49b
Remote QP number = 0x13
Remote LID = 0x0
Receive Request was posted
failed to modify QP state to RTR
failed to modify QP state to RTR
failed to connect QPs
test result is 1
上述代码中打印“failed to modify QP state to RTR”错误,debug为ibv_modify_qp() 返回错误码为22错误,即ibv_modify_qp的参数struct ibv_qp_attr attr错误。