背景
新到的intel 40ge网卡用于传输数据,报错
tx_timeout recovery unsuccessfule device is non-recoverable
网卡信息
root@gz-111:~# lspci |grep -i eth
23:00.0 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
23:00.1 Ethernet controller: Intel Corporation Ethernet Controller XL710 for 40GbE QSFP+ (rev 02)
思路
翻阅了一下像是ubuntu的kernel某个版本bug导致,不使用内核中自带的驱动,从官网重新下载驱动安装
步骤
报错日志
[Wed Nov 6 15:44:36 2024] NETDEV WATCHDOG: enp35s0f0 (i40e): transmit queue 8 timed out
[Wed Nov 6 15:44:36 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 8, NTC: 0x57, HWB: 0x90, NTU: 0x90, TAIL: 0x90, INT: 0x0
[Wed Nov 6 15:44:36 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 1, txqueue 8
[Wed Nov 6 15:47:35 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 118, NTC: 0x140, HWB: 0x62, NTU: 0x62, TAIL: 0x62, INT: 0x0
[Wed Nov 6 15:57:08 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 2, txqueue 7
[Wed Nov 6 15:57:27 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 7, NTC: 0xf9, HWB: 0x15b, NTU: 0x15b, TAIL: 0x15b, INT: 0x0
[Wed Nov 6 15:57:27 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 3, txqueue 7
[Wed Nov 6 15:57:30 2024] i40e 0000:23:00.0 enp35s0f0: NIC Link is Down
[Wed Nov 6 15:57:31 2024] i40e 0000:23:00.0 enp35s0f0: NIC Link is Up, 40 Gbps Full Duplex, Flow Control: None
[Wed Nov 6 15:57:48 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 37, NTC: 0x1a1, HWB: 0x68, NTU: 0x68, TAIL: 0x68, INT: 0x0
[Wed Nov 6 15:57:48 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 1, txqueue 37
[Wed Nov 6 15:58:09 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 37, NTC: 0x85, HWB: 0x6f, NTU: 0x6f, TAIL: 0x6f, INT: 0x0
[Wed Nov 6 15:59:13 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 3, txqueue 95
[Wed Nov 6 15:59:16 2024] i40e 0000:23:00.0 enp35s0f0: NIC Link is Down
[Wed Nov 6 15:59:17 2024] i40e 0000:23:00.0 enp35s0f0: NIC Link is Up, 40 Gbps Full Duplex, Flow Control: None
[Wed Nov 6 15:59:33 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout: VSI_seid: 390, Q 36, NTC: 0x13a, HWB: 0x172, NTU: 0x172, TAIL: 0x172, INT: 0x0
[Wed Nov 6 15:59:33 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery level 4, txqueue 36
[Wed Nov 6 15:59:33 2024] i40e 0000:23:00.0 enp35s0f0: tx_timeout recovery unsuccessful, device is in non-recoverable state.
临时解决:
使用ethtool软重置网卡
ethtool -t enp35s0f0
下载网卡驱动
tar xvzf i40e-2.9.21.tar.gz ; cd i40e-2.9.21
cd src
make install
rmmod i40e
insmod i40e.ko
root@gz-111:/opt/i40e-2.26.8# lsmod |grep i40
i40e 786432 0
验证版本
root@gz-111:/opt/i40e-2.26.8# modinfo i40e
filename: /lib/modules/6.2.0-26-generic/updates/drivers/net/ethernet/intel/i40e/i40e.ko
version: 2.26.8
license: GPL
description: Intel(R) 40-10 Gigabit Ethernet Connection Network Driver
author: Intel Corporation, <e1000-devel@lists.sourceforge.net>
srcversion: E8C54FBxxx52718C
alias: pci:v0000808xxxBsv*sd*bc*sc*i*
depends:
retpoline: Y
name: i40e
vermagic: 6.2.0-26-generic SMP preempt mod_unload modversions
parm: debug:Debug level (0=none,...,16=all) (int)
parm: l4mode:L4 cloud filter mode: 0=UDP,1=TCP,2=Both,-1=Disabled(default) (int)
日志
[Wed Nov 6 16:43:59 2024] i40e 0000:23:00.0 enp35s0f0: offline testing starting
[Wed Nov 6 16:43:59 2024] i40e 0000:23:00.0 enp35s0f0: testing finished
[Wed Nov 6 16:51:18 2024] i40e 0000:23:00.1: i40e_ptp_stop: removed PHC on enp35s0f1
[Wed Nov 6 16:51:23 2024] i40e 0000:23:00.0: i40e_ptp_stop: removed PHC on enp35s0f0
[Wed Nov 6 16:52:15 2024] i40e 0000:23:00.0 enp35s0f0: renamed from eth0
[Wed Nov 6 16:52:15 2024] i40e 0000:23:00.0 enp35s0f0: NIC Link is Up, 40 Gbps Full Duplex, Flow Control: None
[Wed Nov 6 16:52:15 2024] i40e 0000:23:00.1 enp35s0f1: renamed from eth0
[Wed Nov 6 16:52:16 2024] IPv6: ADDRCONF(NETDEV_CHANGE): enp35s0f0: link becomes ready
会断网一下,因为删除了模块,
再次使用多进程上传数据,未出现报错。
reference
驱动下载地址
https://www.intel.cn/content/www/cn/zh/download/18026/intel-network-adapter-driver-for-pcie-40-gigabit-ethernet-network-connections-under-linux.html
安装文档
https://www.intel.cn/content/www/cn/zh/docs/programmable/683040/1-1/installing-the-xl710-driver.html
bug地址
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1779756
源码
https://lore.kernel.org/netdev/20220816182751.2534028-3-anthony.l.nguyen@intel.com/
https://sbexr.rabexc.org/latest/sources/c8/c897a10ea1d787.html