问题:
在超算平台或高性能集群上运行并行程序使用命令mpirun -np ,出现“no active ports detected”
具体使用的命令如下:
Participant2="Solid"
Solver2="linear_elasticity"
nprocS=4 # jie notes:24
# Run
echo " Starting the ${Participant2} participant with np=${nprocS} in parallel..."
/usr/bin/time mpirun -np ${nprocS} ./${Solver2} ./${Participant2}/linear_elasticity.prm 2>&1 | tee log.solid
提示如下警告:
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.
…………
[llms01:1783182] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[llms01:1783182] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
虽然已经发现了OpenFabrics设备,但没有一个端口处于"active"状态。这样可能导致并行程序无法正常运行,请仔细检查!”
解决方法:
由于之前性能测试,上面这个代码加了并行进程数目,结果效率一直没有变化,现在想来应该是这个原因,也就是虽然开了4个进程,但是只有一个工作,其他三个不在活跃状态。
用ibstat命令,发现确实都是disabled状态。
解决方法:
以管理员身份执行命令行:
/etc/init.d/openibd restart
/etc/init.d/opensmd restart
然后,再用
ibstat
检查端口的状态,端口进入 " active" 状态,恢复正常了。
再尝试运行并行程序,先前出现的“警告信息”消失了。 OH,YEAH!
参考链接:在超算平台或高性能集群上运行并行程序,出现“no active ports detected” _there is at least non-excluded one openfabrics dev-CSDN博客