前言
《虚拟局域网(VLAN)》一文中描述了虚拟网卡、虚拟网桥的作用,以及通过iptables实现了vlan联网,其实学习到这里自然就会联想到目前主流的容器技术:Docker,因此接下来打算研究一下Docker的桥接网络与此有何异同。
猜测
众所周知,Docker有host、bridge、none三种网络模式,这里我们仅分析桥接(bridge)模式。有了上一篇文章的基础,bridge这个概念我们应该已经熟悉了,bridge网桥是一种基于mac地址在数据链路层进行数据交换的一个虚拟交换机。
所以我们现在可以大胆的进行猜测:Docker也是基于此模式实现了内部网络通信。
- 猜测一:Docker引擎在创建容器的时候会自动为容器创建一对虚拟网卡(veth)并为其分配私有ip,然后将veth一端连接在docker0网桥中,另一端连接在容器的内部网络中
- 猜测二:Docker同样利用iptables的nat能力将容器内流量转发至互联网实现通信。
求证
检查主机网卡列表
检查docker容器及网卡列表,观察是否存在docker网桥以及veth。
shell
# 查看本机正在运行的coekr容器(mysql、redis、halo、debian)
[root@VM-8-10-centos ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
56ffaf39316a debian "bash" 23 hours ago Up 7 minutes debian
c8a273ce122e halohub/halo:1.5.3 "/bin/sh -c 'java -X…" 5 months ago Up 47 hours 0.0.0.0:8090->8090/tcp, :::8090->8090/tcp halo
d09fcfa7de0f redis "docker-entrypoint.s…" 12 months ago Up 5 weeks 0.0.0.0:8805->6379/tcp, :::8805->6379/tcp redis
87a2192f6db4 mysql:5.7 "docker-entrypoint.s…" 2 years ago Up 5 weeks 0.0.0.0:3306->3306/tcp, :::3306->3306/tcp, 33060/tcp mysql
# 检查主机网卡列表(确认docker0、veth存在)
[root@VM-12-15-centos ~]# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 52:54:00:b3:6f:20 brd ff:ff:ff:ff:ff:ff
altname enp0s5
altname ens5
3: br-67cf5bfe7a5c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:c5:07:22:c7 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:38:d6:1b:ea brd ff:ff:ff:ff:ff:ff
5: br-9fd151a807e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:35:7f:ed:76 brd ff:ff:ff:ff:ff:ff
315: vethf2afb37@if314: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 3a:06:f0:8d:06:f6 brd ff:ff:ff:ff:ff:ff link-netnsid 12
317: veth1ec30f9@if316: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-9fd151a807e7 state UP mode DEFAULT group default
link/ether 4a:ad:1a:b0:5a:5f brd ff:ff:ff:ff:ff:ff link-netnsid 0
319: vethc408286@if318: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 26:b0:3c:f4:c5:5b brd ff:ff:ff:ff:ff:ff link-netnsid 1
321: veth68fb8c6@if320: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 96:ca:a9:42:f8:a8 brd ff:ff:ff:ff:ff:ff link-netnsid 9
323: veth6dba394@if322: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 92:1c:5e:9c:a2:b3 brd ff:ff:ff:ff:ff:ff link-netnsid 4
325: veth1509ed0@if324: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether fa:22:33:da:12:e0 brd ff:ff:ff:ff:ff:ff link-netnsid 11
329: vethef1dbac@if328: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether aa:db:d2:10:36:60 brd ff:ff:ff:ff:ff:ff link-netnsid 3
331: veth69d3e7d@if330: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 86:45:d0:0e:6b:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 5
335: veth98588ae@if334: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether 86:59:55:39:17:ad brd ff:ff:ff:ff:ff:ff link-netnsid 7
349: vetha84d717@if348: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-67cf5bfe7a5c state UP mode DEFAULT group default
link/ether ee:7f:d2:27:15:83 brd ff:ff:ff:ff:ff:ff link-netnsid 6
354: veth1@if355: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br-mybridge state UP mode DEFAULT group default qlen 1000
link/ether 72:c8:9e:24:a6:a3 brd ff:ff:ff:ff:ff:ff link-netns n1
356: br-mybridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 72:c8:9e:24:a6:a3 brd ff:ff:ff:ff:ff:ff
使用ip link查看本机网卡列表,可以发现宿主机存在一个名为docker0的虚拟网桥,且虚拟网桥下有四对虚拟网卡分别对应 debian、halo、redis、mysql四个docker容器。
检查网桥ip及Docker内部容器的网络通信
shell
# docker0默认网桥的IP地址为172.17.0.1/16
[root@VM-8-10-centos ~]# ip addr show docker0
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:6f:d7:19:7e brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:6fff:fed7:197e/64 scope link
valid_lft forever preferred_lft forever
# 检查桥接网络内部容器的ip地址(分别为172.17.0.2/16、172.17.0.3/16、172.17.0.4/16、172.17.0.5/16)
[root@VM-8-10-centos ~]# docker network inspect bridge
[
{
"Name": "bridge",
"Id": "2dc75e446719be8cad37e1ea9ae7d1385fcc728b8177646a3c62929c2b289e94",
"Created": "2024-04-24T09:46:14.399901891+08:00",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.0/16",
"Gateway": "172.17.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"56ffaf39316ac9f776c6b3e2a8a79e9f42dfab42aa1f7de7525bd26c686defaa": {
"Name": "debian",
"EndpointID": "47dd9441d4a4c8b09afea3bca23652b80ba35e6baa13d44ec21ec89522e722a6",
"MacAddress": "02:42:ac:11:00:05",
"IPv4Address": "172.17.0.5/16",
"IPv6Address": ""
},
"87a2192f6db48c9bf2996bf25c79d4c18c3ae2975cac9d55e7fdfdcec03f896b": {
"Name": "mysql",
"EndpointID": "00b93de23c5abf2ed1349bac1c2ec93bf7ed516370dabf23348b980f19cfaa9c",
"MacAddress": "02:42:ac:11:00:02",
"IPv4Address": "172.17.0.2/16",
"IPv6Address": ""
},
"c8a273ce122ef5479583908f40898141a90933a3c41c8028dc7966b9af4c465d": {
"Name": "halo",
"EndpointID": "ba8ef83c80f3edb6e7987c95ae6d56816a1fc00d07e8bb2bfbb0f19ef543badf",
"MacAddress": "02:42:ac:11:00:04",
"IPv4Address": "172.17.0.4/16",
"IPv6Address": ""
},
"d09fcfa7de0f2a7b3ef7927a7e53a8a53fb93021b119b1376fe4616381c5a57c": {
"Name": "redis",
"EndpointID": "afbc9128f7d27becfbf64e843a92d36ce23800cd42c131e550abea7afb6a131e",
"MacAddress": "02:42:ac:11:00:03",
"IPv4Address": "172.17.0.3/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "1500"
},
"Labels": {}
}
]
# 进入debian容器测试内部网络通信和互联网通信
[root@VM-8-10-centos ~]# docker exec -it debian /bin/bash
root@56ffaf39316a:/# ping 172.17.0.1
PING 172.17.0.1 (172.17.0.1) 56(84) bytes of data.
64 bytes from 172.17.0.1: icmp_seq=1 ttl=64 time=0.071 ms
64 bytes from 172.17.0.1: icmp_seq=2 ttl=64 time=0.036 ms
--- 172.17.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.036/0.053/0.071/0.017 ms
root@56ffaf39316a:/# ping 172.17.0.3
PING 172.17.0.3 (172.17.0.3) 56(84) bytes of data.
64 bytes from 172.17.0.3: icmp_seq=1 ttl=64 time=0.067 ms
64 bytes from 172.17.0.3: icmp_seq=2 ttl=64 time=0.047 ms
--- 172.17.0.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.047/0.057/0.067/0.010 ms
root@56ffaf39316a:/# ping baidu.com
PING baidu.com (39.156.66.10) 56(84) bytes of data.
64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=1 ttl=247 time=59.0 ms
64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=2 ttl=247 time=55.4 ms
--- baidu.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 55.400/57.221/59.043/1.821 ms
小结
通过shell的结果分析:docker0网桥的ip为172.17.0.1/16,docker0各子网通信正常,并且通过ping baidu.com检查了互联网通信也正常。因此可以得出docker桥接模式与前一章中vlan模式是一致的,都是通过一个虚拟网桥实现了内部网络的通信。
Docker容器与互联网进行通信
在上一章节中不小心留了个坑,因为firewalld在iptables中内置了很多的规则,所以对于流量的分析很不友好,所以我索性直接关闭了firewalld,但是紧接着就发现这样做有一个副作用:firewalld关闭后,iptables也会被清空。当时不觉得有什么影响,现在仔细回想了一下vlan之所以能够连接互联网,很大一部分原因是利用了iptables的nat功能,iptables被清空,意味着nat功能被关闭了,所以利用此功能的应用会失去网络连接。下面使用shell命令来模拟并分析此现象。
shell
# 关闭firewalld
[root@VM-8-10-centos ~]# systemctl stop firewalld
# 检查iptables
[root@VM-8-10-centos ~]# iptables -nvL
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
[root@VM-8-10-centos ~]# iptables -t nat -nvL
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
# 检查debian容器互联网连接情况
[root@VM-8-10-centos ~]# docker exec -it debian /bin/bash
root@56ffaf39316a:/# ping baidu.com
PING baidu.com (110.242.68.66) 56(84) bytes of data.
--- baidu.com ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4000ms
# 检查内部网络连接情况
root@56ffaf39316a:/# ping 172.17.0.1
PING 172.17.0.1 (172.17.0.1) 56(84) bytes of data.
64 bytes from 172.17.0.1: icmp_seq=1 ttl=64 time=0.041 ms
64 bytes from 172.17.0.1: icmp_seq=2 ttl=64 time=0.046 ms
--- 172.17.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.041/0.043/0.046/0.002 ms
root@56ffaf39316a:/# ping 172.17.0.2
PING 172.17.0.2 (172.17.0.2) 56(84) bytes of data.
64 bytes from 172.17.0.2: icmp_seq=1 ttl=64 time=0.081 ms
64 bytes from 172.17.0.2: icmp_seq=2 ttl=64 time=0.055 ms
--- 172.17.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.055/0.068/0.081/0.013 ms
通过清空iptables发现docker容器内部确实丢失了互联网连接,但是没有影响内部网络的通信。
手动添加nat记录恢复Docker容器与互联网的通信
# 添加snat记录
[root@VM-8-10-centos ~]# iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# 检查debian容器互联网连接情况
[root@VM-8-10-centos ~]# docker exec -it debian /bin/bash
root@56ffaf39316a:/# ping baidu.com
PING baidu.com (39.156.66.10) 56(84) bytes of data.
64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=1 ttl=247 time=55.8 ms
64 bytes from 39.156.66.10 (39.156.66.10): icmp_seq=2 ttl=247 time=55.4 ms
--- baidu.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 55.386/55.610/55.834/0.224 ms
个人总结: docker容器与互联网进行通信时确实依赖iptables,且行为上与vlan几乎一致,因此我认为Docker其实是vlan+iptables一种高级应用。
思考
docker容器内的网络通信是否也基于二层协议进行数据交换?
基于之前对vlan的了解,明白了bridge是一种工作在"数据链路层",根据mac地址交换数据帧的虚拟交换机,既然工作在二层,那么意味着它在进行数据交换时是没有ip概念的,仅仅是按照mac地址转发数据帧。既然如此,那么即使删除了它的ip地址和路由表,应该也可以完成数据交换。
[root@VM-8-10-centos ~]# ip addr del 172.17.0.1/16 dev docker0
[root@VM-8-10-centos ~]# ip addr show docker0
3: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:6f:d7:19:7e brd ff:ff:ff:ff:ff:ff
inet6 fe80::42:6fff:fed7:197e/64 scope link
valid_lft forever preferred_lft forever
[root@VM-8-10-centos ~]# docker exec -it debian /bin/bash
root@56ffaf39316a:/# ping 172.17.0.3
PING 172.17.0.3 (172.17.0.3) 56(84) bytes of data.
64 bytes from 172.17.0.3: icmp_seq=1 ttl=64 time=0.066 ms
64 bytes from 172.17.0.3: icmp_seq=2 ttl=64 time=0.049 ms
--- 172.17.0.3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.049/0.057/0.066/0.008 ms
iptables与路由表有何联系和区别?谁决定了流量的出口网卡?
学习vlan的时候就存在一个疑惑:**虚拟网桥进行互联网通信时,将流入网桥的流量转发到出口网卡是由谁决定的?**当时做vlan的nat通信时,因为需要在iptables中配置FORWARD及NAT规则,自然而然的会认为是iptables实现的。如此的话,那么路由表存在的意义又是什么呢?**所以到底是iptables实现了流量转发,还是路由表(ip route)实现了流量转发?**或者具体点讲:是谁将流量从docker0网卡转发到eth0网卡?
具体过程需要深入分析iptables的工作原理,这里就不再赘述了,直接给出个人结论仅供参考。
个人结论:路由表不对流量做任何更改,仅仅用来确定数据包的出口网卡,iptables可以对ip数据包进行过滤、修改、转发,但最终还是由路由表确定出口网卡。
即使没有snat,数据包是不是应该也可以到达对方网络?
在互联网中基于ip协议进行通信的流量都会被标注源地址和目的地址,目的地址决定了流量应该如何发送给对方主机,源地址决定了其他主机如何区分数据包是由谁发送的。而SNAT的核心概念是通过转换源地址的方式进行工作的,这是否意味着即使不配置snat,数据包依然可以到达对方网络,只是对方网络无法回复。
# 假设我有两台具有公网ipv4地址的云服务器xxx.xxx.xxx.xx1和xxx.xxx.xxx.xx2。xx1局域网内有另一台主机x10
# xx1主机
# 使用snat将源ip由xx1转换为xx2
[root@VM-8-10-centos ~]# iptables -t nat -A POSTROUTING -s xx1 -o eth0 -j SNAT --to-source xxx.xxx.xxx.x10
# 监听eth0网卡的icmp数据包
[root@VM-8-10-centos ~]# tcpdump -i eth0 -p icmp -nv | grep x10
# xx2主机
# 监听eth0网卡的icmp数据包
[root@VM-8-10-centos ~]# tcpdump -i eth0 -p icmp -nv
根据tcpdump抓包分析xx1确实发送了源地址为x10的数据包,但是从xx2主机的监听结果看并没有收到来自xx1或来自x10发送的数据包。或许是数据包在中途路由的过程中被丢弃了,又或者是我理解错了??