Docker Bridge Conflicts with Host Network

Docker Bridge Conflicts with Host Network - docker

Docker seems to be creating a bridge after a container starts running that then conflicts with my host network. This is not the default bridge docker0, but rather another bridge that is created after a container has started. I am able to configure the default bridge according to the older user guide link https://docs.docker.com/v17.09/engine/userguide/networking/default_network/custom-docker0/, however, I do not know how to configure this other bridge so it does not conflict with 172.17.
This current issue is then that my container cannot access other systems on the host network when this bridge becomes active.
Any ideas?
Version of docker:
Version 18.03.1-ce-mac65 (24312)
This is the bridge that gets created. Sometimes it is not 172.17, but sometimes it is.
br-f7b50f41d024 Link encap:Ethernet HWaddr 02:42:7D:1B:05:A3
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0

When docker networks are created (e.g. using docker network create or indirectly through docker-compose) without explicitly specifying a subnet range, dockerd allocates a new /16 network, starting from 172.N.0.0/16, where N is a number that is incremented (e.g. N=17, N=18, N=19, N=20, ...). A given N is skipped if a docker network (a custom one, or the default docker bridge) already exists in the range.
You can specify explicitly a safe IP range when creating a docker bridge (i.e. one that excludes the host ips in your network) on the CLI. But usually bridge networks are created automatically by docker-compose with default blocks. To exclude these IPs reliably would require modifying every docker-compose.yaml file you encounter. It's bad practice to include host-specific things inside a compose file.
Instead, you can play with the networks that docker considers allocated, to force dockerd to "skip" subnets. I'm outlining three methods below:
Method #0 -- configure the pool of ips in the daemon config
If your docker version is recent enough (TODO check minimum version), and you have permissions to configure the docker daemon's command line arguments, you can try passing --default-address-pool ARG options to the dockerd command. Ex:
# allocate /24 subnets with the given CIDR prefix only.
# note that this prefix excludes 172.17.*
--default-address-pool base=172.24.0.0/13,size=24
You can add this setting in one of the etc files: /etc/default/docker, or in /etc/sysconfig/docker, depending on your distribution. There is also a way to set this parameter in daemon.json (see syntax)
Method #1 -- create a dummy placeholder network
You can prevent the entire 172.17.0.0/16 from being used by dockerd (in future bridge networks) by creating a very small docker network anywhere inside 172.17.0.0/16.
Find 4 consecutive IPs in 172.17.* that you know are not in use in your host network, and sacrifice them in a "tombstone" docker bridge. Below, I'm assuming the ips 172.17.253.0, 172.17.253.1, 172.17.253.2, 172.17.253.3 (i.e. 172.17.253.0/30) are unused in your host network.
docker network create --driver=bridge --subnet 172.17.253.0/30 tombstone
# created: c48327b0443dc67d1b727da3385e433fdfd8710ce1cc3afd44ed820d3ae009f5
Note the /30 suffix here, which defines a block of 4 different IPs. In theory, the smallest valid network subnet should be a /31 which consists of a total of 2 IPs (network identifier + broadcast). Docker asks for a /30 minimum, probably to account for a gateway host, and another container. I picked .253.0 arbitrarily, you should pick something that's not in use in your environment. Also note that the identifier tombstone is nothing special, you can rename it to anything that will help you remember why it's there when you find it again several months later.
Docker will modify your routing table to send traffic for these 4 IPs to go through that new bridge instead of the host network:
# output of route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Note: Traffic for 172.17.253.{0,1,2,3} goes through the tombstone docker bridge just created (br-c4832...). Traffic for any other IP in the 172.17.* would go through the default route (host network). My docker bridge (docker0) is on 172.20.0.1, which may appear unusual -- I've modified bip in /etc/docker/daemon.json to do that. See this page for more details.
The twist: if there exists a bridge occupying even a subportion of a /16, new bridges created will skip that range. If we create new docker networks, we can see that the rest of 172.17.0.0/16 is skipped, because the range is not entirely available.
docker network create foo_test
# c9e1b01f70032b1eff08e48bac1d5e2039fdc009635bfe8ef1fd4ca60a6af143
docker network create bar_test
# 7ad5611bfa07bda462740c1dd00c5007a934b7fc77414b529d0ec2613924cc57
The resulting routing table:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-c9e1b01f7003
172.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7ad5611bfa07
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Notice that the rest of the IPs in 172.17.0.0/16 have not been used. The new networks reserved .18. and .19.. Sending traffic to any of your conflicting IPs outside that tombstone network would go via your host network.
You would have to keep that tombstone network around in docker, but not use it in your containers. It's a dummy placeholder network.
Method #2 -- bring down the conflicting bridge network
If you wish to temporarily avoid the IP conflict, you can bring the conflicting docker bridge down using ip: ip link set dev br-xxxxxxx down (where xxxxxx represents the name of the bridge network from route -n or ip link show). This will have the effect of removing the corresponding bridge routing entry in the routing table, without modifying any of the docker metadata.
This is arguably not as good as the method above, because you'd have to bring down the interface possibly every time dockerd starts, and it would interfere with your container networking if there was any container using that bridge.
If method 1 stops working in the future (e.g. because docker tries to be smarter and reuse unused parts of an ip block), you could combine both approaches: e.g. create a large tombstone network with the entire /16, not use it in any container, and then bring its corresponding br-x device down.
Method #3 -- reconfigure your docker bridge to occupy a subportion of the conflicting /16
As a slight variation of the above, you could make the default docker bridge overlap with a region of 172.17.*.* that is not used in your host network. You can change the default docker bridge subnet by changing the bridge ip (i.e. bip key) in /etc/docker/daemon.json (See this page for more details). Just make it a subregion of your /16, e.g. in a /24 or smaller.
I've not tested this, but I presume any new docker network would skip the remainder of 172.17.0.0/16 and allocate an entirely different /16 for each new bridge.

The bridge was created from docker-compose, which can be configured within the compose file.
Answer found here: Docker create two bridges that corrupts my internet access

Related

Docker-compose "ports": listen on multiple IP addresses / IP range

Instead of listening to a single IP address like e.g. localhost:
ports:
- "127.0.0.1:80:80"
I want the container to only listen to a local network, i.e. e.g.:
ports:
- "10.0.0.0/16:80:80"
ERROR: The Compose file './docker-compose.yml' is invalid because:
services.SERVICE.ports contains an invalid type, it should be a number, or an object
Is this possible?
I don't want to use things like swarm mode etc., yet.
If IP range is not supported, maybe at least multiple IP addresses like 10.0.0.2 and 10.0.0.3?
ERROR: for CONTAINER Cannot start service SERVICE: driver failed programming external connectivity on endpoint CONTAINER (...): Error starting userland proxy: listen tcp 10.0.0.3:80: bind: cannot assign requested address
ERROR: for SERVICE Cannot start service SERVICE: driver failed programming external connectivity on endpoint CONTAINER (...): Error starting userland proxy: listen tcp 10.0.0.3:80: bind: cannot assign requested address
Or is it not even supported to listen to 10.0.0.3 ?
The host machine is connected to 10.0.0.0/16:
> ifconfig
ens10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.0.0.2 netmask 255.255.255.255 broadcast 10.0.0.2
inet6 f**0::8**0:ff:f**9:b**7 prefixlen 64 scopeid 0x20<link>
ether **:00:00:**:**:** txqueuelen 1000 (Ethernet)

Listening to a single IP address seems not correct. The service is listening at an IP address.
Let's say your VM has two network interfaces (ethernet cards):
Network 1 → subnet: 10.0.0.0/24 and IP 10.0.0.100
Network 2 → subnet: 10.0.1.0/24 and IP 10.0.1.200
If you set 127.0.0.1:80:80 that means that your service listening at 127.0.0.1's (localhost) port 80.
If you want to access service from 10.0.0.0/24 subnet you should set 10.0.0.100:80:80 and use http://10.0.0.100:80 address to be able connect your container from external hosts
If you want to access service from multiple networks simultaneously you can bind the container port to multiple ports, where the IP is the connection source IP):
ports:
- 10.0.0.100:80:80
- 10.0.1.200:80:80
- 127.0.0.1:80:80
And don't forget to open 80 port at VM's firewall, if a firewall exists and restricts that network

I think you misunderstood this field.
When you map 127.0.0.1:80:80 you will map interface 127.0.0.1 from your host to your container.
In the case of the 127.0.0.1 you can only access it from inside your host.
When you map 10.0.0.3:80:80 you will map interface 10.0.0.3 from your host to your container. And all ip who can access 10.0.0.3 will have acces to your docker container mapping.
But in anycase this field will not do any filtering about who access this container
EDIT: After your modification i've seen my misunderstood about your question.
You want docker to create "bridge interface" to not share the ip of your host.
I don't think this is possible when using the port mapping

If you give Compose ports: (or docker run -p) an IP address, it must be a specific known IP address of a host interface, or 0.0.0.0 for "all interfaces". The Docker daemon gives this specific IP address to a bind(2) call, which takes an address and not a network, and follows the rules in ip(7) for IPv4.
With the output you show, you can only bind containers to 10.0.0.2. If you want to use other IP addresses on the same network, you also need to assign them to the host; see for example How can I (from CLI) assign multiple IP addresses to one interface? on Ask Ubuntu, and then you can bind a container to the newly-added address.
If your system is on multiple physical networks, you can have any number of ports: so long as the host address and host port are unique. In particular you can have multiple ports: that all forward to the same container port.
ports:
# make this visible to the external load balancer on port 80
- '192.168.17.2:80:3000'
# also make this visible to the internal network also on port 80
- '10.0.0.2:80:3000'
# and the management network but on port 3000
- '10.99.0.36:3000:3000'
Again, the host must already have these IP addresses in the ifconfig output.

Is it possible to customize swarm port? If so, how to do this?

according to docker doc:
The following ports must be available. On some systems, these ports are open by default.
TCP port 2377 for cluster management communications
TCP and UDP port 7946 for communication among nodes
UDP port 4789 for overlay network traffic
so if these 3 default ports are not avaiavle on hosts, how to customized these ports?

The following options are available in 19.03 (just released):
$ docker swarm init --help
Usage: docker swarm init [OPTIONS]
Initialize a swarm
Options:
--advertise-addr string Advertised address (format: <ip|interface>[:port])
--autolock Enable manager autolocking (requiring an unlock key to start a stopped manager)
--availability string Availability of the node ("active"|"pause"|"drain") (default "active")
--cert-expiry duration Validity period for node certificates (ns|us|ms|s|m|h) (default 2160h0m0s)
--data-path-addr string Address or interface to use for data path traffic (format: <ip|interface>)
--data-path-port uint32 Port number to use for data path traffic (1024 - 49151). If no value is set or is set to 0, the default port (4789) is used.
--default-addr-pool ipNetSlice default address pool in CIDR format (default [])
--default-addr-pool-mask-length uint32 default address pool subnet mask length (default 24)
--dispatcher-heartbeat duration Dispatcher heartbeat period (ns|us|ms|s|m|h) (default 5s)
--external-ca external-ca Specifications of one or more certificate signing endpoints
--force-new-cluster Force create a new cluster from current state
--listen-addr node-addr Listen address (format: <ip|interface>[:port]) (default 0.0.0.0:2377)
--max-snapshots uint Number of additional Raft snapshots to retain
--snapshot-interval uint Number of log entries between Raft snapshots (default 10000)
--task-history-limit int Task history retention limit (default 5)
To change the listening port on 2377 and the VXLAN port on 4789, you should be able to run something like:
docker swarm init --listen-addr 0.0.0.0:3377 --data-path-port 5789
I do not believe 7946 is configurable yet.
When joining other nodes to the swarm, you have the following options:
$ docker swarm join --help
Usage: docker swarm join [OPTIONS] HOST:PORT
Join a swarm as a node and/or manager
Options:
--advertise-addr string Advertised address (format: <ip|interface>[:port])
--availability string Availability of the node ("active"|"pause"|"drain") (default "active")
--data-path-addr string Address or interface to use for data path traffic (format: <ip|interface>)
--listen-addr node-addr Listen address (format: <ip|interface>[:port]) (default 0.0.0.0:2377)
--token string Token for entry into the swarm
That lets you adjust the listener address/port. I don't know if data-path-port is a global setting in the entire swarm, that feature was only released GA an hour ago, so it will need some testing to understand how it behaves.
From your comment:
I'd like to know if the docker community will consider to make 7946 configurable
Docker is open source, so you are free to submit PR's to moby/moby, libnetwork, and/or swarmkit. Not sure which repo specifically covers this implementation detail.

ebtable NFLOG cause ARP request drop

Not sure what I'm doing wrong. I have two machines connected back to back with this ebtable rule setup:
ebtables -A OUTPUT -p ARP --arp-op Request --nflog-group 100 -j DROP
I have a process listening on netlink group 100. I have the following setup on both machines /etc/network/interface lets call this one peer1:
auto e101-001-0
iface e101-001-0 inet manual
auto bond100
iface bond100 inet manual
bond-slaves e101-001-0
bond-mode 4
bond-miimon 100
iface bond100.100 inet manual
vlan-raw-device bond100
auto br100
iface br100 inet static
bridge_ports bond100.100
address 10.1.1.1
netmask 255.255.255.0
The peer2 has the same setting except the ip address is 10.1.1.2. When I try to send a ping from peer1 to peer2, I get destination unreachable. Using arp -n I see it is not resolved. I used tcpdump and saw the ARP request was made on peer1 bridge but peer2 did not receive it on it's bridge. There are three ways I know of to get a ping across:
Erase the ebtable rule on peer1.
Don't use a bridge. This works fine if I just ping between physical interface.
Add this line to the etable rules above: --nflog-range 32. I can go as low as 22 and it will still work. It starts to fail around 17.
Anything lower than 32 seems to make it stop working. My questions are:
Is there a default nflog-range size if one is not set explicitly?
Is there a kernel config that affects the nflog-range size?
I know all this is part of the kernel so if it helps I'm using the default Debian Stretch kernel (Debian 4.9.110-3+deb9u4) (no special kernel config). peer2 is running on a modified kernel and is known to be working. If I have two systems with the same specs as peer2, everything works fine, which leads me to believe it maybe a kernel config issue.
Thanks for reading.

Docker 1.9.0 "bridge" versus a custom bridge network results in difference in hosts file and SSH_CLIENT env variable

Let me first explain what I'm trying to do, as there may be multiple ways to solve this. I have two containers in docker 1.9.0:
node001 (172.17.0.2) (sudo docker run --net=<<bridge or test>> --name=node001 -h node001 --privileged -t -i -v /sys/fs/cgroup:/sys/fs/cgroup <<image>>)
node002 (172.17.0.3) (,,)
When I launch them with --net=bridge I get the correct value for SSH_CLIENT when I ssh from one to the other:
[root#node001 ~]# ssh root#172.17.0.3
root#172.17.0.3's password:
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.17.0.3 56194 22
[root#node001 ~]# ping -c 1 node002
ping: unknown host node002
In docker 1.8.3 I could also use the hostnames I supply when I start them, in 1.8.3 that last ping statement works!
In docker 1.9.0 I don't see anything being added in /etc/hosts, and the ping statement fails. This is a problem for me. So I tried creating a custom network...
docker network create --driver bridge test
When I launch the two containers with --net=test I get a different value for SSH_CLIENT:
[root#node001 ~]# ssh root#172.18.0.3
root#172.18.0.3's password:
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.18.0.1 57388 22
[root#node001 ~]# ping -c 1 node002
PING node002 (172.18.0.3) 56(84) bytes of data.
64 bytes from node002 (172.18.0.3): icmp_seq=1 ttl=64 time=0.041 ms
Note that the ip address is not node001's, it seems to represent the docker host itself. The hosts file is correct though, containing:
172.18.0.2 node001
172.18.0.2 node001.test
172.18.0.3 node002
172.18.0.3 node002.test
My current workaround is using docker 1.8.3 with the default bridge network, but I want this to work with future docker versions.
Is there any way I can customize the test network to make it behave similarly to the default bridge network?
Alternatively:
Maybe make the default bridge network write out the /etc/hosts file in docker 1.9.0?
Any help or pointers towards different solutions will be greatly appreciated..
Edit: 21-01-2016
Apparently the problem is fixed in 1.9.1, with bridge in docker 1.8 and with a custom (--net=test) in 1.9.1, now the behaviour is correct:
[root#node001 tmp]# ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.5
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.18.0.3 52162 22
Retried in 1.9.0 to see if I wasn't crazy, and yeah there the problem occurs:
[root#node001 tmp]# ip route
default via 172.18.0.1 dev eth0
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
[root#node002 ~]# env|grep SSH_CLI
SSH_CLIENT=172.18.0.1 53734 22
So after remove/stop/start-ing the instances the IP-addresses were not exactly the same, but it can be easily seen that the ssh_client source ip is not correct in the last code block. Thanks #sourcejedi for making me re-check.

Firstly, I don't think it's possible to change any settings on the default network, i.e. to write /etc/hosts. You apparently can't delete the default networks, so you can't recreate them with different options.
Secondly
Docker is careful that its host-wide iptables rules fully expose containers to each other’s raw IP addresses, so connections from one container to another should always appear to be originating from the first container’s own IP address. docs.docker.com
I tried reproducing your issue with the random containers I've been playing with. Running wireshark on the bridge interface for the network, I didn't see my ping packets. From this I conclude my containers are indeed talking directly to each other; the host was not doing routing and NAT.
You need to check the routes on your client container ip route. Do you have a route for 172.18.0.2/16? If you only have a default route, it could try to send everything through the docker host. And it might get confused and do masquerading as if it was talking with the outside world.
This might happen if you're running some network configuration in your privileged container. I don't know what's happening if you're just booting it with bash though.

Can docker containers be connected to SRIOV virtual functions?

It would be awesome to make use of existing SR-IOV capable NICs. I would like to understand if a docker containers can be attached to Virtual Functions such that they communicate over the NICs hardware bridge (instead of the virtual docker0 bridge).
To be more specific, consider this scenario:
Container A is attached to VF#1
Container B is attached to VF#2
A and B are linked together and when they exchange data it should happen over the hardware bridge on NIC (instead of docker0).
Is the above supported natively in docker?
If not, can pipework help here? (I have heard pipework can do amazing things)
Examples would be very helpful.

Well, I figured that a little modification in pipework script can let us attach VFs to containers. Containers set up this way were able to ping each other without having to create macvlan subinterface or software bridges. This indicates that the hardware bridge in adapter is doing the L2 switching for them.
The change in pipework is basically something like this:
[ "$IFTYPE" = phys ] && {
[ "$VLAN" ] && {
[ ! -d "/sys/class/net/${IFNAME}.${VLAN}" ] && {
ip link add link "$IFNAME" name "$IFNAME.$VLAN" mtu "$MTU" type vlan id "$VLAN"
}
ip link set "$IFNAME" up
IFNAME=$IFNAME.$VLAN
}
# Let's not create the macvlan subinterface
# GUEST_IFNAME=ph$NSPID$CONTAINER_IFNAME
# ip link add link "$IFNAME" dev "$GUEST_IFNAME" mtu "$MTU" type macvlan mode bridge
GUEST_IFNAME=$IFNAME
ip link set "$IFNAME" up
}
ip link set "$GUEST_IFNAME" netns "$NSPID"
ip netns exec "$NSPID" ip link set "$GUEST_IFNAME" name "$CONTAINER_IFNAME"
---
Off-course a neater way would be to add a new argument ("--direct-attach" or something) to the script to treat specified interface differently

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart