一篇文章搞定time_wait状态的诸多问题

今天聊聊 TIME_WAIT。

如果看过本文之后，你能够对如下“夺命连环问”做到胸中自有沟壑，则我心甚慰：

你觉得一台机器上看到多少 TIME_WAIT 属于不正常状态？
你觉得出现 TIME_WAIT 相关异常后，会对应用程序造成什么样的影响？
你觉得在什么场景下需要对 TIME_WAIT 相关内核参数进行调整，如何调整？
2MSL 在你的目标机器上具体对应多长时间？
仅从 TIME_WAIT 使用情况，如何估算目标机器可能的 QPS 上限？
处于 TIME_WAIT 状态的连接会占用什么资源？
TIME_WAIT 的数量上限受什么参数影响？

0x01 出现的位置

我们知道，由于 TCP 协议的全双工工作模式，一个 socket 的关闭需要四次挥手来完成：

主动关闭的一方调用 close()，底层协议栈发送 FIN 包
被动关闭的一方收到 FIN 包后，底层协议栈自动回复 ACK，进入 CLOSE_WAIT 状态，之后回调给业务层，等待其“后续操作”；而主动关闭的一方在收到 ACK 后会进入 FIN_WAIT_2 状态，继续的等待接收对端的 FIN 包；
被动关闭的一方在完成业务层“后续操作”后（所有数据发送完毕），调用 close() 关闭本端 socket；此时，底层协议栈发出 FIN 并等待 ACK，同时进入进入 LAST_ACK 状态；
主动关闭的一方收到 FIN 包，底层协议栈自动回复 ACK；之后，主动关闭连接的一方进入 TIME_WAIT 状态；而被动关闭的一方进入 CLOSED 状态
进入 TIME_WAIT 状态的 socket 需要等待 2MSL 时间后，才会进入 CLOSED 状态

上面描述的是“常规关闭”，还存在一种并不常见的“同时关闭”，会在连接两侧同时出现 TIME_WAIT 。

TCP nomal close 示意图

TCP simultaneous close 示意图

鉴于“同时关闭”本质上算是正常关闭的特例，且属于小概率事件，因此我们只要关注 TIME_WAIT 对主动关闭一侧的影响即可。

0x02 存在的意义

上面提到“进入 TIME_WAIT 状态的 socket 需要等待 2MSL 时间后，才会进入 CLOSED 状态”，那么 MSL 是什么呢？

MSL 是指报文段的最大生存时间，如果报文段在 MSL 时间内还没有被接收，则会被丢弃。关于 MSL 的大小，RFC 793 协议中给出的建议是两分钟，不过实际上不同的操作系统可能有不同的设置，例如在 Centos 7.6.1810 的 3.10 内核版本上就是 60 秒。

实践中，你可以通过 ss -natp -e |grep TIME-WAIT 命令进行确认，可以发现 timer 的取值区间确实在 0~60 之间。

为什么主动关闭的一方不直接进入 CLOSED 状态，而是要进入 TIME_WAIT 状态，停留 2MSL 时长呢？

这是因为 TCP 是建立在不可靠网络上的“可靠的”协议。

具体来说，主动关闭的一方收到被动关闭的一方发出的 FIN 包后，回应 ACK 包，同时进入 TIME_WAIT 状态，但是因为网络原因，主动关闭的一方发送的这个 ACK 包很可能会延迟，从而触发被动关闭的一方重传 FIN 包。极端情况下，这一去一回，就是两倍的 MSL 时长。

如果主动关闭的一方跳过 TIME_WAIT 直接进入 CLOSED 状态，或者在 TIME_WAIT 停留的时长不足 2MSL，那么当被动关闭的一方早先发出的延迟包（数据包 or FIN 包）到达后，就可能出现类似下面的问题：

旧的 TCP 连接已经不存在了，收到延迟包后，系统此时只能返回 RST 包
新的 TCP 连接被建立起来了，收到延迟包后，可能会干扰新连接

不管是哪种情况都会让 TCP 不再可靠，所以 TIME_WAIT 状态的存在是极其必要的。

0x03 最著名问题

聊完 TIME_WAIT 存在的必要性，接着谈谈擅动 TIME_WAIT 可能导致的问题。

先说结论：在一个连接没有进入 CLOSED 状态之前，这个连接是不能被重用的！

说明

严格来讲，包含 TIME_WAIT 在内的所有 state 都是针对 socket 的，但很多时候，我们会混淆“连接”和 socket 的差别；
CLOSED is fictional because it represents the state when there is no TCB, and therefore, no connection.

这个结论没问题，但可能要稍微解释一下：

一个 TCP 连接是由 [src ip, src port, dst ip, dst port] 四元组决定的
当一个 TCP 连接的主动关闭侧 socket 仍处于 TIME_WAIT 状态时，上述四元组不能被重用
从客户端的角度出发，很多时候，四元组中的 [src port, dst ip, dst port] 都是固定不变的，因此不能重用的关键其实是 src port 。

所以，如果问“与 TIME_WAIT 相关的最有名的问题”是什么，首先要想到的就是源端口耗尽问题

我们可以参考一下 haproxy 官方博客针对这个问题的说明

说明

TCP source port exhaustion (the famous high number of sockets in TIME_WAIT).
Any system has around 64K TCP source ports available to get connected to a remote IP:port. Once a combination of "source IP:port => dst IP:port" is in use, it can't be re-used.
Since the source port is unavailable for the system for 2MSL (per 1 min), this means that over 1000 requests per seconds you're in danger of TCP source port exhaustion: 64000 (available ports) / 60 (number of seconds in 1 minute) = 1,066.67.
How to avoid TCP source port exhaustion?
Increasing source port range (net.ipv4.ip_local_port_range) , By default, on a Linux box, you have around 28K source ports available (for a single destination IP:port)
Allow usage of source port in TIME_WAIT (net.ipv4.tcp_tw_reuse和net.ipv4.tcp_tw_recycle)
Using multiple IPs to get connected to a single server
In HAProxy configuration, you can precise on the server line the source IP address to use to get connected to a server, so just add more server lines with different IPs.
Use persistant connections

不得不说，haproxy 官方已经将所有的办法都提供出来了：

首先，能用长连接就少用短连接
其次，如果能使用 N 个 ip 作为 src ip，那么你的可用源端口数量将增大 N 倍
再次，如果前两项做不到，则可以根据你的实际应用场景，调节可用源端口范围 (net.ipv4.ip_local_port_range) 和最大可用 TIME_WAIT 数量 (net.ipv4.tcp_max_tw_buckets)
最后，还可使能针对 TIME_WAIT 的重用和回收特性 (net.ipv4.tcp_tw_reuse和net.ipv4.tcp_tw_recycle) 实现“复用”

一般人看到这就已经开始动手实践了，然后发现确实“解决”了问题，但一段时间之后，会发现似乎又有其他问题出现……

0x04 其他问题

上面已说，在一定条件下，过量的 TIME_WAIT 会导致源端口耗尽问题。而 TIME_WAIT 的最大可用数量取决于 net.ipv4.ip_local_port_range 和 net.ipv4.tcp_max_tw_buckets 两者中的小值，因此这里就有“权衡”问题需要讨论：

tcp_max_tw_buckets 是控制系统整体可用 TIME_WAIT 数量的，通常会建议设置成较大的值，例如 15w 左右，而 ip_local_port_range 影响的是四元组中 src port 的可选范围，默认 2.8w 左右，最大可设置为 6w 左右；上面“取决于小值”是针对四元组中仅有 src port 是变量的情况，而实际查看一台机器上 TIME_WAIT 分布的时候，src port、dst ip 和 dst port 均为变量；
当 ip_local_port_range 被用尽后，则在建立新连接时，将会看到“无可用端口”错误，此时你应该正处于“源端口耗尽”的问题之中；
当 tcp_max_tw_buckets 被超限后，主动关闭 socket 的一方将跳过 TIME_WAIT 状态，直接进入 CLOSED 状态，此时会让 TCP 变得“不再可靠”，/var/log/message 中会有日志 "TCP: time wait bucket table overflow" 输出，此时你不一定处于“源端口耗尽”问题之中；当被动关闭的一方早先发出的延迟包到达后，就可能出现类似下面的问题：
旧 TCP 连接已经不存在了，系统此时只能返回 RST 包
新 TCP 连接被建立起来了，延迟包可能干扰新的连接

从上面的讨论可知：ip_local_port_range 可调节范围相对有限，最多 6w 左右可用，如果你的业务就是这么“变态”，又不打算整改，那确实也没啥好办法了；tcp_max_tw_buckets 可以设置为更高的值，但如果这个值被你设置的很小，甚至小过 ip_local_port_range 的范围，讲真，我觉得你并不清楚自己在干啥，因为系统中运行的不止你的“变态”业务，还有其他正常业务也在跑，他们也需要一定的份额，如果你不给他们留余地，又怎能要求人家正常提供服务呢？

OK，假设 ip_local_port_range 和 tcp_max_tw_buckets 都已被调整成了合理值，“源端口耗尽问题”已经不再困扰你，但 tcp_max_tw_buckets 依旧被耗尽，这时该怎么办？

咳咳，内核参数 net.ipv4.tcp_tw_reuse 和 net.ipv4.tcp_tw_recycle 该登场了

0x05 复用问题

在讨论内核参数 net.ipv4.tcp_tw_reuse 和 net.ipv4.tcp_tw_recycle 之前，首先要了解一下 net.ipv4.tcp_timestamps

tcp_timestamps 的引入

先看看 RFC 是怎么说的

TCP Extensions for High Performance (RFC1323)

The timestamps are used for two distinct mechanisms: RTTM (Round Trip Time Measurement) and PAWS (Protect Against Wrapped Sequences).

PAWS uses the TCP Timestamps option to protect against old duplicates from the same connection.

PAWS operates within a single TCP connection, using state that is saved in the connection control block.

PAWS assumes that every received TCP segment (including data and ACK segments) contains a timestamp SEG.TSval whose values are monotone non-decreasing in time. The basic idea is that a segment can be discarded as an old duplicate if it is received with a timestamp SEG.TSval less than some timestamp recently received on this connection.

Reducing the TIME-WAIT State Using TCP Timestamps (RFC6191)

For the purpose of PAWS, the timestamps sent on a connection are required to be monotonically increasing. While there is no requirement that timestamps are monotonically increasing across TCP connections, the generation of timestamps such that they are monotonically increasing across connections between the same two endpoints allows the use of timestamps for improving the handling of SYN segments that are received while the corresponding four-tuple is in the TIME-WAIT state. That is, the Timestamps option could be used to perform heuristics to determine whether to allow the creation of a new incarnation of a connection that is in the TIME-WAIT state.

Reducing the TIME-WAIT State Using TCP Timestamps

In a number of scenarios, a socket pair may need to be reused while the corresponding four-tuple is still in the TIME-WAIT state in a remote TCP peer. For example, a client accessing some service on a host may try to create a new incarnation of a previous connection, while the corresponding four-tuple is still in the TIME-WAIT state at the remote TCP peer (the server). This may happen if the ephemeral port numbers are being reused too quickly, either because of a bad policy of selection of ephemeral ports, or simply because of a high connection rate to the corresponding service. In such scenarios, the establishment of new connections that reuse a four-tuple that is in the TIME-WAIT state would fail.

In order to avoid this problem, when a connection request is received with a four-tuple that is in the TIME-WAIT state, the connection request may be accepted if the sequence number of the incoming SYN segment is greater than the last sequence number seen on the previous incarnation of the connection (for that direction of the data transfer). The goal of this requirement is to prevent the overlap of the sequence number spaces of the old and new incarnations of the connection so that segments from the old incarnation are not accepted as valid by the new incarnation.

The same policy may be extrapolated to TCP timestamps. That is, when a connection request is received with a four-tuple that is in the TIME-WAIT state, the connection request could be accepted if the timestamp of the incoming SYN segment is greater than the last timestamp seen on the previous incarnation of the connection (for that direction of the data transfer).

Both the ISN (Initial Sequence Number) and the Timestamps option (if present) of the incoming SYN segment are included in the heuristics performed for allowing a high connection-establishment rate.

从上述 RFC 信息可知：

timestamps 的引入是为了供 RTTM 和 PAWS 使用
PAWS 要求 timestamps 在单 TCP 连接上必须单调递增
在单 TCP 连接上，PAWS 利用 timestamps 实现了保护新 connection 不受 old duplicates 影响的能力
在复用 TIME-WAIT state 四元组的策略上，原本只有 ISN 可用，现在又多了 timestamps 可用

当前你只需知道，net.ipv4.tcp_timestamps 默认为启用，并且 tcp_tw_reuse 和 tcp_tw_recycle 都依赖 tcp_timestamps 就够了

tcp_tw_reuse + tcp_timestamps 组合

tcp_tw_reuse 的作用：允许将处于 TIME_WAIT 状态的 socket 被新连接复用，只要从协议角度判定是安全的。

说明

By default, when both tcp_tw_reuse and tcp_tw_recycle are disabled, the kernel will make sure that sockets in TIME_WAIT state will remain in that state long enough -- long enough to be sure that packets belonging to future connections will not be mistaken for late packets of the old connection.

When you enable tcp_tw_reuse, sockets in TIME_WAIT state can be used before they expire, and the kernel will try to make sure that there is no collision regarding TCP sequence numbers. If you enable tcp_timestamps (a.k.a. PAWS, for Protection Against Wrapped Sequence Numbers), it will make sure that those collisions cannot happen. However, you need TCP timestamps to be enabled on both ends.

tcp_tw_reuse 设置的是内核变量 sysctl_tcp_tw_reuse，而这个变量仅在 tcp_twsk_unique() 函数中使用。该函数的调用路径有且仅有一个：tcp_v4_connect->inet_hash_connect->__inet_check_established->twsk_unique->twsk_unique。也就是说tcp_tw_reuse 仅在作为客户端调用 connect 主动发起连接时起作用。

适用场景：某业务需要不断的通过“短连接”连接其他服务器，并且总是自己先关闭连接（TIME_WAIT 在自身这边），关闭后又不断的重新连接对方。

小结：

依赖于连接双方对 tcp_timestamps 的支持（这个条件基本上都会满足）。
基于 tcp_tw_reuse + timestamp 组合实现的“复用”优化效果仅对 outbound 连接有作用，即以客户端角色连接服务端时，能够对自身一侧产生的 TIME_WAIT 的 socket 进行安全复用。

tcp_tw_recycle + tcp_timestamps 组合

tcp_tw_recycle 的作用：当开启了这个配置后，在大多数情况下，内核会快速的回收处于 TIME_WAIT 状态的 socket 的相关数据。

为什么说“大多数情况下”？简单来说，在设置 net.ipv4.tcp_tw_recycle=1 后，内核回收 TIME_WAIT 状态 socket 的时间就从 2MSL 变成了 rto 的值，而 rto = (icsk->icsk_rto << 2) - (icsk->icsk_rto >> 1) = 3.5 * icsk->icsk_rto ，rto 这个值是根据 RTT 动态计算的，在网络比较好的情况下，rto 的值会远小于 2MSL (即 TCP_TIMEWAIT_LEN），从而达到加速的目的；但是如果在网络比较差的情况下，也就是说客户端和服务器端往返的时间比较长的情况下，rto 的值有可能会大于 2MSL ，这种情况下反而适得其反。

当 TIME_WAIT 状态超过 rto 时间后，内核会 recycle 掉该 socket 对应的 TCP 四元组信息；之后若有来自相同 ip 的新数据包在 TCP_PAWS_MSL 内（60s）到达，只要其时间戳晚于内核记录的时间戳，就会被丢掉。

说明

When you enable tcp_tw_recycle, the kernel becomes much more aggressive, and will make assumptions on the timestamps used by remote hosts. It will track the last timestamp used by each remote host having a connection in TIME_WAIT state, and allow to re-use a socket if the timestamp has correctly increased. However, if the timestamp used by the host changes (i.e. warps back in time), the SYN packet will be silently dropped, and the connection won't establish (you will see an error similar to "connect timeout").

适用场景：man 7 tcp 中已经明确指出，不推荐使能该选项，因为会导致问题

问题具体表现为：如果客户端处于 NAT 的网络（多个客户端，同一个 IP 出口的网络环境），如果配置了 tcp_tw_recycle ，就可能在一个 RTO 的时间内，只能有一个客户端和自己连接成功，因为不同客户端发包的时间戳很可能是不一致的，造成服务端直接把时间戳小的 SYN 包丢掉，不回复 SYN,ACK 给客户端，从而导致客户端多次重传 SYN 包。

另外，Linux 从 4.12 内核版本开始移除了 tcp_tw_recycle 配置

说明

The net.ipv4.tcp_tw_recycle has been removed from Linux 4.12 on 2017.

小结：

依赖于连接双方对 tcp_timestamps 的支持。
该配置主要影响 inbound 连接场景下的 TIME_WAIT 回收，即做为被连接的一方，接受来自远端的连接，并且自身会主动关闭连接，最终 TIME_WAIT 状态的 socket 处于自身一侧。
鉴于 Linux 自己都选择了将 tcp_tw_recycle 从内核中移除，我们当然也没有理由继续使用了。

0x06 总结

在配置 TIME_WAIT 相关内核选项时会涉及权衡策略问题，且大多和使用场景有关；若要自行定制化调整，则必须搞清楚利弊；一般来说，针对 TCP 连接关闭问题，最好的对待方式就是维持 TCP 协议中规定的、应有的 TIME_WAIT 状态，然后努力在业务层通过连接池等方式实现连接复用（或者直接在协议使用上进行改进，如改为使用 https，http/1.1 或 http/2.0 等），进而减少 TIME_WAIT 的出现数量；
假设服务器 A 上 tcp_max_tw_buckets=180000 ，2MSL 为 1min，TIME_WAIT 数量维持在 12w 左右，那么大致估算出单机 QPS 为 2000；更进一步，还可以分析一下 12w TIME_WAIT 的分布情况，找出占比最大的 src port 有哪些，并和由 ip_local_port_range 决定的值进行比对，确定是否已接近上限，以便采取措施进行调整；
假设服务器 B 上 tcp_max_tw_buckets=65536 ，与非核心服务 X 相关的 TIME_WAIT 数量在 2w~4w 左右，占比 30%~60% 左右；当调整 tcp_max_tw_buckets=180000 后，和 X 相关的 TIME_WAIT 数量变为 5.2w 左右，占比 40% 左右；由此可知，非核心业务 X 对服务器 B 上的 TIME_WAIT 资源始终保持 40% 左右占比；如果服务器 B 上同时运行着采用短连接通信方式的核心业务 A ，则有理由建议：要么优化服务 B 短连接使用现状，减少 TIME_WAIT 占比，要么将非核心业务 B 迁移到其他机器上，保证核心业务可用资源的充足；
65536 数量的 TIME_WAIT 真心不算多，6.4w TIME_WAIT 连接大概占用 16.4M 内存；一般来讲，15w 以下不太需要考虑性能问题；
若通过调整系统参数使得 TIME_WAIT 时间缩短，甚至由于 tw buckets 不足导致直接跳过了 TIME_WAIT 状态，则有很高的机率，造成数据错乱，或者短暂性的连接失败；具体来讲，TCP 四次挥手时最后一个 ACK 包丢失，若停留在 TIME_WAIT 上的时间被缩短，则主动关闭的一方很快就进入了 CLOSED 状态，如果此时新建一个连接，源随机端口如果被复用，在 connect 发送 SYN 包后，由于被动方仍认为这条连接四元组还在等待 ACK ，但是却收到了 SYN ，则被动方会回复 RST ；

如果一定要我提供一个推荐配置，我会建议

net.ipv4.tcp_max_tw_buckets = 150000  // 甚至可以更大
net.ipv4.ip_local_port_range = 32768 61000 // 按需调整
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_tw_reuse = 1