DPDK实现的用户态协议栈
- 背景
- NIC与DPDK的比较
- 环境配置
- Windowe下配置静态IP表
- 代码实现
- 总结
背景
DPDK接管NIC之后,接收到的数据都是原始数据,要实现一个协议栈就必须解析协议包和打包协议包,DPDK提供了丰富的API可以使用。
以UDP协议为例,一个UDP协议包就包含了以太网头、ip头、udp头,之后才是内容。
以太网头由14个字节组成,包含了:源MAC地址(smac,占6字节)、目的MAC地址(dmac,占6字节)、协议类型(占2字节)。
6 bytes 6 bytes 2 bytes
+------------+------------+--------+
| smac | dmac | type |
+------------+------------+--------+
IP头包含源ip、目的ip等,总共20字节。
NIC与DPDK的比较
DPDK相比NIC,它可以最大程度的使用网卡的性能。
NIC | DPDK |
---|---|
不支持huge page,主要是4K的page | 支持huge page,最高可达到4G |
中断方式接收数据,适合少量数据情况 | 轮询的方式接收数据,适合接收大数据 |
CPU参与数据拷贝 | 通过DMA的方式零拷贝数据 |
所以,DPDK可以使网卡性能最大,而NIC不可以的原因:NIC不支持huge page,以4K的页利用中断的方式触发接收数据,数据传输时需要CPU参与拷贝,这是比不上DPDK的。
环境配置
(1)导出dpdk环境变量。
cd dpdk路径
# 如 dpdk/dpdk-stable-19.08.2/
# 切换root权限
sudo su
export RTE_SDK=dpdk路径
export RTE_TARGET=x86_64-native-linux-gcc
(2)配置dpdk。
./usertools/dpdk-setup.sh
依次执行:
43(加载DPDK UIO 模块,即插入driver)
44(加载VFIO模块,也是一种driver)
45(加载KNI模块,将一些数据写回内核)
46(设置巨页,可以不需要频繁页交换,512)
47(设置巨页,可512)
49(执行之前需要eth0 down掉,执行sudo ifconfig eth0 down,使绑定dpdk)pci地址=对应eth0的(如0000:03:00.0)
60(退出)
Windowe下配置静态IP表
需要管理员权限。
(1)查看要加入的静态表接口
arp -a
示例结果如下,可以看到0x13这个位置,后面步骤用到。
接口: 192.168.2.130 --- 0x13
Internet 地址 物理地址 类型
192.168.0.20 00-17-16-07-b1-14 动态
192.168.0.25 00-00-74-f8-0f-65 动态
192.168.0.60 00-1e-67-6e-d4-c8 动态
192.168.0.62 00-15-5d-00-29-01 动态
192.168.0.80 00-00-5e-00-01-82 动态
192.168.0.116 04-d4-c4-8f-03-d7 动态
192.168.0.120 90-09-d0-0a-39-8b 动态
192.168.0.128 18-c0-4d-5e-30-05 动态
192.168.0.150 90-23-b4-b8-62-63 动态
192.168.0.152 b8-cb-29-b1-82-5b 动态
192.168.0.180 0c-c4-7a-79-21-8a 动态
192.168.2.42 30-5a-3a-5a-63-cd 动态
192.168.2.154 00-0e-c6-5c-39-34 动态
192.168.2.227 18-c0-4d-de-e8-9d 动态
192.168.3.111 30-b4-9e-76-e6-60 动态
192.168.3.166 2c-56-dc-dc-d5-45 动态
192.168.4.191 d8-5e-d3-20-7a-53 动态
192.168.5.0 18-c0-4d-9b-65-fb 动态
192.168.7.31 fc-aa-14-a2-e7-4a 动态
192.168.7.98 18-c0-4d-de-dd-be 动态
192.168.7.146 00-0c-29-39-a8-c4 动态
192.168.7.234 18-c0-4d-cc-b7-da 动态
192.168.7.248 d4-5d-64-d2-b7-23 动态
192.168.7.253 50-81-40-f3-ed-90 动态
192.168.8.1 70-8c-b6-ee-02-12 动态
192.168.8.11 00-11-04-01-19-4d 动态
192.168.8.17 00-11-04-01-01-c5 动态
192.168.11.12 d4-5d-64-3c-5c-fa 动态
192.168.11.21 e0-70-ea-f1-0b-77 动态
192.168.11.45 0c-9d-92-85-52-d4 动态
192.168.11.92 40-8d-5c-a8-08-00 动态
192.168.11.95 04-42-1a-eb-b5-00 动态
192.168.11.138 00-0e-c6-80-04-fa 动态
192.168.11.202 98-29-a6-65-c9-2c 动态
192.168.11.225 18-c0-4d-57-59-58 动态
192.168.16.124 18-c0-4d-50-1e-da 动态
192.168.17.140 d8-5e-d3-2a-56-78 动态
192.168.17.174 70-5a-0f-4d-c7-e8 动态
192.168.17.196 00-24-1d-9c-f2-15 动态
192.168.17.199 38-d5-47-1c-5c-fb 动态
192.168.20.188 e4-e7-49-ff-f0-9c 动态
192.168.255.255 ff-ff-ff-ff-ff-ff 静态
224.0.0.2 01-00-5e-00-00-02 静态
224.0.0.22 01-00-5e-00-00-16 静态
224.0.0.251 01-00-5e-00-00-fb 静态
224.0.0.252 01-00-5e-00-00-fc 静态
224.0.1.60 01-00-5e-00-01-3c 静态
224.0.6.151 01-00-5e-00-06-97 静态
224.100.100.100 01-00-5e-64-64-64 静态
224.200.200.200 01-00-5e-48-c8-c8 静态
229.111.112.12 01-00-5e-6f-70-0c 静态
233.233.233.233 01-00-5e-69-e9-e9 静态
234.200.200.200 01-00-5e-48-c8-c8 静态
239.102.144.50 01-00-5e-66-90-32 静态
239.192.152.143 01-00-5e-40-98-8f 静态
239.193.3.64 01-00-5e-41-03-40 静态
239.193.4.69 01-00-5e-41-04-45 静态
239.193.5.133 01-00-5e-41-05-85 静态
239.193.21.194 01-00-5e-41-15-c2 静态
239.193.21.222 01-00-5e-41-15-de 静态
239.193.21.241 01-00-5e-41-15-f1 静态
239.255.102.18 01-00-5e-7f-66-12 静态
239.255.255.250 01-00-5e-7f-ff-fa 静态
239.255.255.251 01-00-5e-7f-ff-fb 静态
239.255.255.253 01-00-5e-7f-ff-fd 静态
239.255.255.254 01-00-5e-7f-ff-fe 静态
(2)查看适配器
netsh i i show in
示例结果如下:
Idx Met MTU 状态 名称
--- ---------- ---------- ------------ ---------------------------
1 75 4294967295 connected Loopback Pseudo-Interface 1
19 35 1500 connected 以太网 2
5 35 1500 connected VMware Network Adapter VMnet1
15 35 1500 connected VMware Network Adapter VMnet8
39 35 1500 connected VMware Network Adapter VMnet2
可以看到,上面的0x13=19对应的网络接口是以太网。
(3)新添静态IP
netsh -c i i add neighbors 19 192.168.7.199 38-d5-47-1c-5c-fb
注意要确定MAC地址的正确性。
(4)检查是否添加成功
arp -a
(5)如果需要清除静态表,执行:
netsh i i delete neighbors 接口号
# 比如18就是接口号
代码实现
示例代码主要是DPDK实现了UDP协议栈收发,发送的数据即为收到的数据。
(dpdk_udp.c)
#include <rte_eal.h>
#include <rte_ethdev.h>
#include <rte_mbuf.h>
#include <stdio.h>
#include <arpa/inet.h>
#define ENABLE_SEND 1
#define NUM_MBUFS (4096-1)
#define BURST_SIZE 32
int gDpdkPortId = 0; //
static const struct rte_eth_conf port_conf_default = {
.rxmode = {.max_rx_pkt_len = RTE_ETHER_MAX_LEN }
};
#if ENABLE_SEND
// sender
static uint32_t gSrcIp;
static uint32_t gDstIp;
static uint16_t gSrcPort;
static uint32_t gDstPort;
static uint8_t gSrcMac[RTE_ETHER_ADDR_LEN];
static uint8_t gDstMac[RTE_ETHER_ADDR_LEN];
#endif
//
static void ng_init_port(struct rte_mempool *mbuf_pool) {
//1 count avail
uint16_t nb_sys_ports= rte_eth_dev_count_avail(); //
if (nb_sys_ports == 0) {
rte_exit(EXIT_FAILURE, "No Supported eth found\n");
}
//1
struct rte_eth_dev_info dev_info;
rte_eth_dev_info_get(gDpdkPortId, &dev_info); //
//1
const int num_rx_queues = 1;
const int num_tx_queues = 1;
struct rte_eth_conf port_conf = port_conf_default;
rte_eth_dev_configure(gDpdkPortId, num_rx_queues, num_tx_queues, &port_conf);
//1 rx queue setup
if (rte_eth_rx_queue_setup(gDpdkPortId, 0 , 1024, rte_eth_dev_socket_id(gDpdkPortId),NULL, mbuf_pool) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup RX queue\n");
}
#if ENABLE_SEND
struct rte_eth_txconf txq_conf = dev_info.default_txconf;
txq_conf.offloads = port_conf.rxmode.offloads;
if (rte_eth_tx_queue_setup(gDpdkPortId, 0 , 1024,
rte_eth_dev_socket_id(gDpdkPortId), &txq_conf) < 0) {
rte_exit(EXIT_FAILURE, "Could not setup TX queue\n");
}
#endif
//1 start
if (rte_eth_dev_start(gDpdkPortId) < 0 ) {
rte_exit(EXIT_FAILURE, "Could not start\n");
}
rte_eth_promiscuous_enable( gDpdkPortId);
}
#if ENABLE_SEND
static int ng_encode_udp_pkt(uint8_t *msg, unsigned char *data, uint16_t total_len) {
// encode
// 1 ethhdr
struct rte_ether_hdr *eth = (struct rte_ether_hdr *)msg;
rte_memcpy(eth->s_addr.addr_bytes, gSrcMac, RTE_ETHER_ADDR_LEN);
rte_memcpy(eth->d_addr.addr_bytes, gDstMac, RTE_ETHER_ADDR_LEN);
eth->ether_type = htons(RTE_ETHER_TYPE_IPV4);
// 2 iphdr
struct rte_ipv4_hdr *ip = (struct rte_ipv4_hdr *)(msg + sizeof(struct rte_ether_hdr));
ip->version_ihl = 0x45;
ip->type_of_service = 0;
ip->total_length = htons(total_len - sizeof(struct rte_ether_hdr));
ip->packet_id = 0;
ip->fragment_offset = 0;
ip->time_to_live = 64; // ttl = 64
ip->next_proto_id = IPPROTO_UDP;
ip->src_addr = gSrcIp;
ip->dst_addr = gDstIp;
ip->hdr_checksum = 0;
ip->hdr_checksum = rte_ipv4_cksum(ip);
// 3 udphdr
struct rte_udp_hdr *udp = (struct rte_udp_hdr *)(msg + sizeof(struct rte_ether_hdr) + sizeof(struct rte_ipv4_hdr));
udp->src_port = gSrcPort;
udp->dst_port = gDstPort;
uint16_t udplen = total_len - sizeof(struct rte_ether_hdr) - sizeof(struct rte_ipv4_hdr);
udp->dgram_len = htons(udplen);
rte_memcpy((uint8_t*)(udp+1), data, udplen);
udp->dgram_cksum = 0;
udp->dgram_cksum = rte_ipv4_udptcp_cksum(ip, udp);
struct in_addr addr;
addr.s_addr = gSrcIp;
printf(" --> src: %s:%d, ", inet_ntoa(addr), ntohs(gSrcPort));
addr.s_addr = gDstIp;
printf("dst: %s:%d\n", inet_ntoa(addr), ntohs(gDstPort));
return 0;
}
static struct rte_mbuf * ng_send(struct rte_mempool *mbuf_pool, uint8_t *data, uint16_t length) {
// mempool --> mbuf
const unsigned total_len = length + 42;
struct rte_mbuf *mbuf = rte_pktmbuf_alloc(mbuf_pool);
if (!mbuf) {
rte_exit(EXIT_FAILURE, "rte_pktmbuf_alloc\n");
}
mbuf->pkt_len = total_len;
mbuf->data_len = total_len;
uint8_t *pktdata = rte_pktmbuf_mtod(mbuf, uint8_t*);
ng_encode_udp_pkt(pktdata, data, total_len);
return mbuf;
}
#endif
int main(int argc, char *argv[]) {
if (rte_eal_init(argc, argv) < 0) {
rte_exit(EXIT_FAILURE, "Error with EAL init\n");
}
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create("mbuf pool", NUM_MBUFS,
0, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_socket_id());
if (mbuf_pool == NULL) {
rte_exit(EXIT_FAILURE, "Could not create mbuf pool\n");
}
ng_init_port(mbuf_pool);
while (1) {
struct rte_mbuf *mbufs[BURST_SIZE];
unsigned num_recvd = rte_eth_rx_burst(gDpdkPortId, 0, mbufs, BURST_SIZE);
if (num_recvd > BURST_SIZE) {
rte_exit(EXIT_FAILURE, "Error receiving from eth\n");
}
unsigned i = 0;
for (i = 0;i < num_recvd;i ++) {
struct rte_ether_hdr *ehdr = rte_pktmbuf_mtod(mbufs[i], struct rte_ether_hdr*);
if (ehdr->ether_type != rte_cpu_to_be_16(RTE_ETHER_TYPE_IPV4)) {
continue;
}
struct rte_ipv4_hdr *iphdr = rte_pktmbuf_mtod_offset(mbufs[i], struct rte_ipv4_hdr *,
sizeof(struct rte_ether_hdr));
if (iphdr->next_proto_id == IPPROTO_UDP) {
struct rte_udp_hdr *udphdr = (struct rte_udp_hdr *)(iphdr + 1);
#if ENABLE_SEND // echo
// mac exchange
rte_memcpy(gDstMac, ehdr->s_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
rte_memcpy(gSrcMac, ehdr->d_addr.addr_bytes, RTE_ETHER_ADDR_LEN);
// ip exchange
rte_memcpy(&gSrcIp, &iphdr->dst_addr, sizeof(uint32_t));
rte_memcpy(&gDstIp, &iphdr->src_addr, sizeof(uint32_t));
// port exchange
rte_memcpy(&gSrcPort, &udphdr->dst_port, sizeof(uint16_t));
rte_memcpy(&gDstPort, &udphdr->src_port, sizeof(uint16_t));
#endif
uint16_t length = ntohs(udphdr->dgram_len);
*((char*)udphdr + length) = '\0';
struct in_addr addr;
addr.s_addr = iphdr->src_addr;
printf("src: %s:%d, ", inet_ntoa(addr), udphdr->src_port);
addr.s_addr = iphdr->dst_addr;
printf("dst: %s:%d, %s\n", inet_ntoa(addr), udphdr->src_port,
(char *)(udphdr+1));
#if ENABLE_SEND
struct rte_mbuf *txbuf = ng_send(mbuf_pool, (unsigned char*)(udphdr+1), length);
rte_eth_tx_burst(gDpdkPortId, 0, &txbuf, 1);
#endif
rte_pktmbuf_free(mbufs[i]);
} else {
}
}
}
}
Makefle:
# binary name
APP = dpdk_udp
# all source are stored in SRCS-y
SRCS-y := dpdk_udp.c
# Build using pkg-config variables if possible
ifeq ($(shell pkg-config --exists libdpdk && echo 0),0)
all: shared
.PHONY: shared static
shared: build/$(APP)-shared
ln -sf $(APP)-shared build/$(APP)
static: build/$(APP)-static
ln -sf $(APP)-static build/$(APP)
PKGCONF=pkg-config --define-prefix
PC_FILE := $(shell $(PKGCONF) --path libdpdk)
CFLAGS += -O3 $(shell $(PKGCONF) --cflags libdpdk)
LDFLAGS_SHARED = $(shell $(PKGCONF) --libs libdpdk)
LDFLAGS_STATIC = -Wl,-Bstatic $(shell $(PKGCONF) --static --libs libdpdk)
build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
build/$(APP)-static: $(SRCS-y) Makefile $(PC_FILE) | build
$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_STATIC)
build:
@mkdir -p $@
.PHONY: clean
clean:
rm -f build/$(APP) build/$(APP)-static build/$(APP)-shared
test -d build && rmdir -p build || true
else
ifeq ($(RTE_SDK),)
$(error "Please define RTE_SDK environment variable")
endif
# Default target, detect a build directory, by looking for a path with a .config
RTE_TARGET ?= $(notdir $(abspath $(dir $(firstword $(wildcard $(RTE_SDK)/*/.config)))))
include $(RTE_SDK)/mk/rte.vars.mk
总结
- DPDK不保证数据不丢失,网卡有什么数据DPDK就接收什么数据,组织了什么数据就发送什么数据;DPDK是不能保证数据传输过程中的是否丢包问题(链路路由器等的问题);数据丢失与否由协议决定。
- DPDK关于内存利用率的考量,DPDK主要是为大量IO吞吐量设计的内存池,通过牺牲内存来提升包的吞吐性能,达到大的吞吐量;DPDK是为提升网卡性能而生的。
- 一般对内存池的设计,至少要有两种以上的方案,因为要考虑某些情况下的取舍问题,这也是一种设计理念(对用途有一定充分的思考)。