0.背景介绍
某服务domain.com.cn 之前DNS解析到服务真实地址10.1.1.11,后面需要对用户登录增加黑名单功能,于是在openresty针对服务domain.com.cn的特性完成了黑名单功能。黑名单功能已经上线几个月,但是DNS从服务真实地址10.1.1.11切换到openresty10.1.4.2 一直遇到些问题,直到半个多月前才已经完成DNS切换,openresty lua脚本运行半个多月都没有问题。直到某一天执行了openresty -s reload,发生了一个生产小事故。
1.网络架构
client------>F5(domain.com.cn DNS解析至F5)------>openresty(宿主机:docker容器,openresty部署在docker容器中)
2.事故描述
openresty上面部署了黑名单lua脚本,脚本里面有根据cookie解析用户名称的代码,有些老外用户名称比较特殊,解析失败,前期没有记录详细的日志,这次在lua脚本中加了日志,在docker容器中执行bin/openresty -s reload。事故出现了,浏览器访问域名(http://domain.com.cn),显示502bad getway。
3.分析过程
只是重新reload了一下,服务怎么不可用了,telnet 服务ip 端口是通的,问题很有可能出现在openresty网关上面。查看nginx的error日志,并没有发现lua脚本的错误,全都是下面这些报错。
2023/04/18 15:19:28 [crit] 786#0: *100069488 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:30 [crit] 786#0: *100069620 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:30 [crit] 788#0: *100069623 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:32 [crit] 787#0: *100070474 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:32 [crit] 788#0: *100070476 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:32 [crit] 789#0: *100070478 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:43 [crit] 788#0: *100075175 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:43 [crit] 787#0: *100075176 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:45 [crit] 788#0: *100076749 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:48 [crit] 787#0: *100077950 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:48 [crit] 786#0: *100078001 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:50 [crit] 789#0: *100078390 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
2023/04/18 15:19:50 [crit] 788#0: *100078392 connect() to 10.1.1.21:8888 failed (99: Cannot assign requested address) while connecting to upstream, client: 10.1.1.29, server: vip.sdns.cn,
在openresty的宿主机执行命令:netstat -ae | wc -l
发现达到了5w+,此时联系网络同事,他也反馈了同样的问题:连接数暴增,如下图。
出现这么多暴增的连接,首先想到的就是死循环访问的问题。
排查宿主机hosts文件,发现运维同事配置了本地解析地址:10.1.1.11 domain.com.cn
10.1.1.11是服务的真实地址,并且nginx的反向代理配置为:proxy_pass http://domain.com.cn;
如果domain.com.cn没有正确解析到服务真实地址10.1.1.11,而是F5地址10.1.1.21,那就刚好构成死循环访问了。于是立刻在docker容器中ping domain.com.cn发现却是10.1.1.21,于是定位到问题。
问题的原因在于:
①openresty宿主机10.1.4.2 设置了hosts文件:10.1.1.11 domain.com.cn,
domain.com.cn这个域名 dns解析到f5:10.1.1.21 , f5又会转发到10.1.4.2,这是构成死循环的前提条件。
②宿主机的hosts文件设置内容并没有同步到docker容器中,这个是通过在docker容器执行openresty -s reload 命令触发了dns缓存更新,看下图:
③至于说DNS切换F5(F5再转发到openrestry)已经上线半个多月为啥没问题,因为,openresty黑名单代码已经跑了很久了长达几个月,上次openresty容器 里面执行reload命令的时候,dns服务器解析还没有切换f5。
4.后续优化
①DNS直接解析到openresty,省去F5这一层,两台ng配置Keepalive。
②proxy_pass 域名;使用动态DNS解析,即
http {
server {
listen 80;
set $domain doamin.com.cn;
location / {
resolver 8.8.8.8 valid=600s;
proxy_pass http://$domain;
}
}
}