之前的文章描述了如何运行Benchmark,但是那个是基于TCP的。现在想要跑一个基于RoCEv2的结果。虚拟机上没有支持infiniband的网卡,那就用Soft RoCE了。
Soft-RoCE的安装和调试
- 系统版本信息
admin@osu-1:~$ uname -a
Linux osu-1 5.11.0-44-generic #48~20.04.2-Ubuntu SMP Tue Dec 14 15:36:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- 安装rdma-core和verbs应用
admin@osu-1:~$ sudo apt install rdma-core ibverbs-utils -y
- 基于已有网口ens8添加ib端口,命名为ib5
admin@osu-1:~$ sudo rdma link add ib5 type rxe netdev ens8
admin@osu-1:~$ rdma link show
link ib5/1 state ACTIVE physical_state LINK_UP netdev ens8
安装调试MPI
- 支持MPI有很多选择:openmpi/mpich/mvapich
- 经过各种测试和挫折,最后选择mvapich2,谁让它跟OSU Micro Benchmark是一家的呢
- 提前安装编译过程中需要的软件
admin@osu-1:~$ sudo apt install byacc -y
- 获取源码
admin@osu-1:~$ wget http://mvapich.cse.ohio-state.edu/download/mvapich/mv2/mvapich2-2.3.7-1.tar.gz
- 解压后进入目录
admin@osu-1:~$ tar zxvf mvapich2-2.3.7-1.tar.gz
admin@osu-1:~$ cd mvapich2-2.3.7-1/
admin@osu-1:~/mvapich2-2.3.7-1$
- configure的时候,注意要带的参数
admin@osu-1:~/mvapich2-2.3.7-1$ ./configure --with-device=ch3:mrail --with-rdma=gen2
- 然后编译安装
admin@osu-1:~/mvapich2-2.3.7-1$ make -j$(nproc)
admin@osu-1:~/mvapich2-2.3.7-1$ sudo make install
- Benchmark已经同步编译好了
admin@osu-1:~/mvapich2-2.3.7-1$ cd osu_benchmarks/mpi/pt2pt/
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$ ls -lt
total 320
-rwxrwxr-x 1 admin admin 6332 11月 17 10:40 osu_multi_lat
-rwxrwxr-x 1 admin admin 6342 11月 17 10:40 osu_latency_mt
-rwxrwxr-x 1 admin admin 6312 11月 17 10:40 osu_latency
-rwxrwxr-x 1 admin admin 6262 11月 17 10:40 osu_bw
-rwxrwxr-x 1 admin admin 6342 11月 17 10:40 osu_latency_mp
-rwxrwxr-x 1 admin admin 6302 11月 17 10:40 osu_mbw_mr
-rwxrwxr-x 1 admin admin 6282 11月 17 10:40 osu_bibw
-rw-rw-r-- 1 admin admin 11904 11月 17 10:40 osu_bibw.o
-rw-rw-r-- 1 admin admin 18072 11月 17 10:40 osu_mbw_mr.o
-rw-rw-r-- 1 admin admin 16872 11月 17 10:40 osu_latency_mt.o
-rw-rw-r-- 1 admin admin 11456 11月 17 10:40 osu_bw.o
-rw-rw-r-- 1 admin admin 10976 11月 17 10:40 osu_latency_mp.o
-rw-rw-r-- 1 admin admin 9688 11月 17 10:40 osu_latency.o
-rw-rw-r-- 1 admin admin 9872 11月 17 10:40 osu_multi_lat.o
-rw-rw-r-- 1 admin admin 28374 11月 17 10:23 Makefile
-rw-r--r-- 1 admin admin 28795 5月 24 01:46 Makefile.in
-rw-r--r-- 1 admin admin 1446 5月 17 2022 Makefile.am
-rw-r--r-- 1 admin admin 13925 5月 17 2022 osu_bibw.c
-rw-r--r-- 1 admin admin 13046 5月 17 2022 osu_bw.c
-rw-r--r-- 1 admin admin 9926 5月 17 2022 osu_latency.c
-rw-r--r-- 1 admin admin 7763 5月 17 2022 osu_latency_mp.c
-rw-r--r-- 1 admin admin 12654 5月 17 2022 osu_latency_mt.c
-rw-r--r-- 1 admin admin 19056 5月 17 2022 osu_mbw_mr.c
-rw-r--r-- 1 admin admin 10070 5月 17 2022 osu_multi_lat.c
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$
- 确保mpi的路径加入到PATH
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$ which mpirun
/usr/local/bin/mpirun
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$ PATH=$PATH:/usr/local/bin
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$
运行
- 另外一台虚拟做完全相同的配置,包括安装路径都要相同,两台虚拟机并可以通过5.5.5.3和5.5.5.4互相ping通
- 运行osu_latency,开头有一些WARNING,先不管
admin@osu-1:~/mvapich2-2.3.7-1/osu_benchmarks/mpi/pt2pt$ mpirun_rsh -np 2 5.5.5.3 5.5.5.4 MV2_USE_RoCE=1 MV2_IBA_HCA=ib5 ./osu_latency
[osu-1:mpi_rank_0][rdma_find_network_type] Unable to find the numa process is bound to. Disabling process placement aware hca mapping.
[osu-1:mpi_rank_0][mv2_get_hca_type] **********************WARNING***********************
[osu-1:mpi_rank_0][mv2_get_hca_type] Failed to automatically detect the HCA architecture.
[osu-1:mpi_rank_0][mv2_get_hca_type] This may lead to subpar communication performance.
[osu-1:mpi_rank_0][mv2_get_hca_type] ****************************************************
[osu-1:mpi_rank_0][mv2_get_hca_type] **********************WARNING***********************
[osu-1:mpi_rank_0][mv2_get_hca_type] Failed to automatically detect the HCA architecture.
[osu-1:mpi_rank_0][mv2_get_hca_type] This may lead to subpar communication performance.
[osu-1:mpi_rank_0][mv2_get_hca_type] ****************************************************
[osu-1:mpi_rank_0][mv2_get_hca_type] **********************WARNING***********************
[osu-1:mpi_rank_0][mv2_get_hca_type] Failed to automatically detect the HCA architecture.
[osu-1:mpi_rank_0][mv2_get_hca_type] This may lead to subpar communication performance.
[osu-1:mpi_rank_0][mv2_get_hca_type] ****************************************************
[osu-1:mpi_rank_0][rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does not fully support the HCA found on the system (try with other build options)
[osu-1:mpi_rank_0][mv2_new_get_hca_type] **********************WARNING***********************
[osu-1:mpi_rank_0][mv2_new_get_hca_type] Failed to automatically detect the HCA architecture.
[osu-1:mpi_rank_0][mv2_new_get_hca_type] This may lead to subpar communication performance.
[osu-1:mpi_rank_0][mv2_new_get_hca_type] ****************************************************
[osu-2:mpi_rank_1][rdma_find_network_type] Unable to find the numa process is bound to. Disabling process placement aware hca mapping.
[osu-2:mpi_rank_1][rdma_open_hca] Unknown HCA type: this build of MVAPICH2 does not fully support the HCA found on the system (try with other build options)
[osu-1:mpi_rank_0][rdma_param_handle_heterogeneity] All nodes involved in the job were detected to be homogeneous in terms of processors and interconnects. Setting MV2_HOMOGENEOUS_CLUSTER=1 can improve job startup performance on such systems. The following link has more details on enhancing job startup performance. http://mvapich.cse.ohio-state.edu/performance/job-startup/.
[osu-1:mpi_rank_0][rdma_param_handle_heterogeneity] To suppress this warning, please set MV2_SUPPRESS_JOB_STARTUP_PERFORMANCE_WARNING to 1
# OSU MPI Latency Test v5.9
# Size Latency (us)
0 139.61
1 144.72
2 141.35
4 140.04
8 139.94
16 140.42
32 139.10
64 137.50
128 142.40
256 143.07
512 140.62
1024 143.64
2048 175.03
4096 222.74
- 同时在另外一台上对ens8做tcpdump,可以抓到UDP的dest_port为1791的报文,正是RoCEv2报文
10:51:43.782588 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 222: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 180
10:51:43.782725 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 62: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 20
10:51:43.782857 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 222: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 180
10:51:43.782865 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 62: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 20
10:51:43.782885 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 222: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 180
10:51:43.783040 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 62: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 20
10:51:43.783146 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 222: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 180
10:51:43.783154 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 62: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 20
10:51:43.783173 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 222: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 180
10:51:43.783312 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 62: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 20
10:51:43.783423 52:54:00:3c:a8:a3 > 52:54:00:28:f8:36, ethertype IPv4 (0x0800), length 222: 5.5.5.3.63843 > 5.5.5.4.4791: UDP, length 180
10:51:43.783431 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 62: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 20
10:51:43.783451 52:54:00:28:f8:36 > 52:54:00:3c:a8:a3, ethertype IPv4 (0x0800), length 222: 5.5.5.4.63843 > 5.5.5.3.4791: UDP, length 180
- 如果报文写入文件并用wireshark解析,可以看到是RoCEv2的RC报文