SO_REUSEPORT socket选项介绍以及在nginx上的配置

SO_REUSEPORT (reuseport) 是网络的一个选项设置：

它能开启内核功能：网络链接分配内核负载均衡，该功能允许多个进程/线程 bind/listen 相同的 IP/PORT，提升了新链接的分配性能。
reuseport 也是内核解决惊群问题的优秀方案：每个进程可以 bind/listen 相同的 IP/PORT，相当于每个进程拥有独立的 listen socket 的完全队列，避免了共享 listen socket 的资源争抢，提升了并发的吞吐。内核通过哈希算法，将新链接相对均衡地分配到各个开启了 reuseport 属性的进程，所以资源的负载均衡得到解决。

nginx 开启 reuseport 功能后，性能有立竿见影的提升，我们结合 tcp 协议分析 nginx 的 reuseport 功能。

一、SO_REUSEPORT介绍和原理：

1、什么是SO_REUSEPORT：

Socket options
    The socket options listed below can be set by using setsockopt(2)
    and read with getsockopt(2) with the socket level set to
    SOL_SOCKET for all sockets.  Unless otherwise noted, optval is a
    pointer to an int.
...
    SO_REUSEPORT (since Linux 3.9)
                Permits multiple AF_INET or AF_INET6 sockets to be bound
                to an identical socket address.  This option must be set
                on each socket (including the first socket) prior to
                calling bind(2) on the socket.  To prevent port hijacking,
                all of the processes binding to the same address must have
                the same effective UID.  This option can be employed with
                both TCP and UDP sockets.

                For TCP sockets, this option allows accept(2) load
                distribution in a multi-threaded server to be improved by
                using a distinct listener socket for each thread.  This
                provides improved load distribution as compared to
                traditional techniques such using a single accept(2)ing
                thread that distributes connections, or having multiple
                threads that compete to accept(2) from the same socket.

                For UDP sockets, the use of this option can provide better
                distribution of incoming datagrams to multiple processes
                (or threads) as compared to the traditional technique of
                having multiple processes compete to receive datagrams on
                the same socket.

简单总结：

允许多个线程/进程绑定到相同ip:port的套接字地址；这个选项必须设置在socket上调用 bind(2)方法之前；此外，为了防止端口劫持，绑定到同一地址的所有进程必须具有相同的有效 UID。
对于 TCP 套接字，此选项允许 accept(2) 加载通过以下方式改进多线程服务器中的分布为每个线程使用不同的侦听器套接字。这个提供改进的负载分配相比传统方式更好，例如：使用单个 accept(2)ing 分配连接的线程，或具有多个竞争从同一个socket来accept（2）的线程。

2、SO_REUSEPORT解决了什么问题？

我们先看看 2013 年 3.9+ 版本内核提交的这个 Linux 内核功能补丁的注释。

soreuseport: TCP/IPv4 implementation
Allow multiple listener sockets to bind to the same port.

Motivation for soresuseport would be something like a web server
binding to port 80 running with multiple threads, where each thread
might have it's own listener socket.  This could be done as an
alternative to other models: 1) have one listener thread which
dispatches completed connections to workers. 2) accept on a single
listener socket from multiple threads.  In case #1 the listener thread
can easily become the bottleneck with high connection turn-over rate.
In case #2, the proportion of connections accepted per thread tends
to be uneven under high connection load (assuming simple event loop:
while (1) { accept(); process() }, wakeup does not promote fairness
among the sockets.  We have seen the  disproportion to be as high
as 3:1 ratio between thread accepting most connections and the one
accepting the fewest.  With so_reusport the distribution is
uniform.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
 master
 v5.13 
…
 v3.9-rc1
@davem330
Tom Herbert authored and davem330 committed on 24 Jan 2013
1 parent 055dc21 commit da5e36308d9f7151845018369148201a5d28b46d

reuseport 选项主要解决了两个问题：

（A 图）单个 listen socket 遇到的性能瓶颈。
（B 图）单个 listen socket 多个线程同时 accept，但是多个线程资源分配不均。

其实它还解决了一个很重要的问题：

在 tcp 多线程场景中，（B 图）服务端如果所有新链接只保存在一个 listen socket 的全链接队列中，那么多个线程去这个队列里获取（accept）新的链接，势必会出现多个线程对一个公共资源的争抢，争抢过程中，大量资源的损耗。

而（C 图）有多个 listener 共同 bind/listen 相同的 IP/PORT，也就是说每个进程/线程有一个独立的 listener，相当于每个进程/线程独享一个 listener 的全链接队列，不需要多个进程/线程竞争某个公共资源，能充分利用多核，减少竞争的资源消耗，效率自然提高了。

3、原理：

TCP 客户端链接服务端，第一次握手，服务端被动收到第一次握手 SYN 包，内核就通过哈希算法，将客户端的链接分派到内核半链接队列，三次握手成功后，再将这个链接从半链接队列移动到某个 listener 的全链接队列中，提供 accept 获取。如下图：使用了OS_REUSEPORT选项后，服务端被动第一次握手，查找合适的 listener，详看源码（Linux 5.0.1）。

二、nginx上使用SO_REUSEPORT选项：

2013 年 Linux 内核添加了 reuseport 功能后，nginx 在 2015 年，1.9.1 版本也增加对应功能的支持，nginx 开启 reuseport 功能后，性能是原来的 2-3 倍，效果可谓立竿见影！（见官网：Socket Sharding in NGINX OSS Release 1.9.1）

1、开启reuseport：

# nginx.conf
# vim /usr/local/nginx/conf/nginx.conf
# 启动 4 个子进程。
worker_processes  4;
http {
    ...
    server {
        listen 80 reuseport;
        server_name localhost;
        ...
    }
    ...
}

查看master和worker进程：

查看master和worker进程 LISTEN 80 端口情况。

因为配置文件设置了 worker_processes 4 需要启动 4 个子进程， nginx 进程发现配置文件关键字 listen 后添加了 reuseport 关键字，那么主进程先创建 4 个 socket 并设置 SO_REUSEPORT 选项，然后进行 bind 和 listen。

当 fork 子进程时，子进程拷贝了父进程的这 4 个 socket，所以你看到每个子进程都有相同 LISTEN 的 socket fd（7，8，9，10）。

如果没有配置reuseport，查看进程情况和端口情况如下：

2、网络图：

nginx 是多进程模型，Linux 环境下一般使用 epoll 事件驱动。

探索惊群 ⑥ - nginx - reuseport

3、性能对比：

3.1）nginx的锁模式和共享模式（reuseport）：

为了让SO_REUSEPORT socket选项起作用（共享模式），应为HTTP或TCP（流模式）通信选项内的listen项直接引入新近的reuseport参数，就像下例这样：

http {
    server {
      listen 80 reuseport;
      server_name localhost;
    }

    stream {
      server {
        listen 88 reuseport;
      }
    }
}

引用reuseport参数后，对引用的socket，accept_mutex参数将会无效，因为互斥量（mutex）对reuseport来说是多余的。对没有使用reuseport的端口，设置accept_mutex仍然是有价值的。accept_mutex默认是开启的，下面提供两个Nginx Core模块互斥锁的指令。

1）accept_mutex

Syntax: accept_mutex on | off; 
Default:accept_mutex on; 
Context:events

互斥锁，就是各个worker接受用户请求的负载均衡锁，默认启用，表示让各个worker轮流地，序列化地响应用户请求；如果关闭那么所有的worker进程都会接收一个新的请求，如果连接数量不高的情况下，这么做只是会浪费系统资源。

2）lock_file

Syntax: lock_file file; 
Default:lock_file logs/nginx.lock; 
Context:main

既然启动了负载均衡锁，那么就需要指定一个锁文件了。nginx使用锁机制来实现accept_mutex和序列化访问共享内存。

3.2）实验对比：

在一个36核的AWS实例运行wrk基准测试工具，测试4个NGINX工作进程。为了减少网络的影响，客户端和NGINX都运行在本地，并且让NGINX返回OK字符串而不是一个文件。我比较三种NGINX配置：默认（等同于accept_mutex on ），accept_mutex off和reuseport。如图所示，reuseport的每秒请求是其余的两到三倍，同时延迟和延迟标准差也是减少的。

我又运行了另一个相关的性能测试——客户端和NGINX分别在不同的机器上且NGINX返回一个HTML文件。如下表所示，用reuseport减少的延迟和之前的性能测试相似，延迟的标准差减少的更为显著（接近十分之一）。其他结果（没有显示在表格中）同样令人振奋。使用reuseport ，负载被均匀分离到了worker进程。在默认条件下（等同于 accept_mutex on），一些worker分到了较高百分比的负载，而用accept_mutex off所有worker都受到了较高的负载。

在这些性能测试中，连接请求的速度是很高的，但是请求不需要大量的处理。其他的基本的测试应该指出——当应用流量符合这种场景时 reuseport 也能大幅提高性能。（reuseport 参数在 mail 上下文环境下不能用在 listen 指令下，例如email，因为email流量一定不会匹配这种场景。）我们鼓励你先测试而不是直接大规模应用。关于测试NGNIX性能的一些技巧，看看Konstantin Pavlov在nginx2014大会上的演讲。

Nginx listen reuseport参数带来的性能提升 – 运维那点事