Can docker containers be connected to SRIOV virtual functions? - docker

It would be awesome to make use of existing SR-IOV capable NICs. I would like to understand if a docker containers can be attached to Virtual Functions such that they communicate over the NICs hardware bridge (instead of the virtual docker0 bridge).
To be more specific, consider this scenario:
Container A is attached to VF#1
Container B is attached to VF#2
A and B are linked together and when they exchange data it should happen over the hardware bridge on NIC (instead of docker0).
Is the above supported natively in docker?
If not, can pipework help here? (I have heard pipework can do amazing things)
Examples would be very helpful.

Well, I figured that a little modification in pipework script can let us attach VFs to containers. Containers set up this way were able to ping each other without having to create macvlan subinterface or software bridges. This indicates that the hardware bridge in adapter is doing the L2 switching for them.
The change in pipework is basically something like this:
[ "$IFTYPE" = phys ] && {
[ "$VLAN" ] && {
[ ! -d "/sys/class/net/${IFNAME}.${VLAN}" ] && {
ip link add link "$IFNAME" name "$IFNAME.$VLAN" mtu "$MTU" type vlan id "$VLAN"
}
ip link set "$IFNAME" up
IFNAME=$IFNAME.$VLAN
}
# Let's not create the macvlan subinterface
# GUEST_IFNAME=ph$NSPID$CONTAINER_IFNAME
# ip link add link "$IFNAME" dev "$GUEST_IFNAME" mtu "$MTU" type macvlan mode bridge
GUEST_IFNAME=$IFNAME
ip link set "$IFNAME" up
}
ip link set "$GUEST_IFNAME" netns "$NSPID"
ip netns exec "$NSPID" ip link set "$GUEST_IFNAME" name "$CONTAINER_IFNAME"
---
Off-course a neater way would be to add a new argument ("--direct-attach" or something) to the script to treat specified interface differently

Related

Creating a network which allows communication between containers but no internet access

How can I create a docker network using testcontainers which:
allows for all containers in the network to communicate with each
allows for containers to map ports to the host
but does not allow containers to have access to the internet
I have tried to do this using an internal network:
private Network generateInternalNetwork() {
// Consumer which operates on the final CreateNetworkCmd which will be run to
// make sure the 'internal' flag is set.
Consumer<CreateNetworkCmd> cmdModifier = (createNetworkCmd) -> {
createNetworkCmd.withInternal(true);
};
return Network.builder()
.createNetworkCmdModifier(cmdModifier)
.build();
}
However, when I run this I cannot have my port mapped. An exception is thrown:
Caused by: java.lang.IllegalArgumentException: Requested port (8024) is not mapped
If I run it without withInternal(true) it works fine but of course the containers have internet access.
I think you can get what you want by (a) creating normal networks, and then (b) adding a DROP rule to your DOCKER-USER firewall chain:
iptables -I DOCKER-USER -j DROP
In my quick experiment just now, this let me map ports from containers, but prevented the containers from accessing the internet (because this chain is called from the FORWARD chain, to it prevents containers from forwarding traffic through the host to the outide internet).
After spending a few days trying different things I have come up with a hack of a solution that kind-of works:
/**
* Set an invalid DNS for the given container.
* This is done as a workaround so that the container cannot access
* the internet.
*/
void setInvalidDns() {
GenericContainer<?> container = getContainer();
Consumer<CreateContainerCmd> modifier = (cmd) -> {
// Amend the config with the garbage DNS.
String invalidDns = "255.255.255.255";
HostConfig hostConfig = cmd.getHostConfig();
hostConfig.withDns(invalidDns);
cmd.withHostConfig(hostConfig);
};
container.withCreateContainerCmdModifier(modifier);
}
This sets the container's DNS to an invalid IP and then when you try to make a HTTP request in the container it will throw a java.net.ConnectException.

TPROXY compatibility with Docker

I'm trying to understand how TPROXY works in an effort to build a transparent proxy for Docker containers.
After lots of research I managed to create a network namespace, inject an veth interface into it and add TPROXY rules. The following script worked on a clean Ubuntu 18.04.3:
ip netns add ns0
ip link add br1 type bridge
ip link add veth0 type veth peer name veth1
ip link set veth0 master br1
ip link set veth1 netns ns0
ip addr add 192.168.3.1/24 dev br1
ip link set br1 up
ip link set veth0 up
ip netns exec ns0 ip addr add 192.168.3.2/24 dev veth1
ip netns exec ns0 ip link set veth1 up
ip netns exec ns0 ip route add default via 192.168.3.1
iptables -t mangle -A PREROUTING -i br1 -p tcp -j TPROXY --on-ip 127.0.0.1 --on-port 1234 --tproxy-mark 0x1/0x1
ip rule add fwmark 0x1 tab 30
ip route add local default dev lo tab 30
After that I launched a toy Python server from Cloudflare blog:
import socket
IP_TRANSPARENT = 19
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.setsockopt(socket.IPPROTO_IP, IP_TRANSPARENT, 1)
s.bind(('127.0.0.1', 1234))
s.listen(32)
print("[+] Bound to tcp://127.0.0.1:1234")
while True:
c, (r_ip, r_port) = s.accept()
l_ip, l_port = c.getsockname()
print("[ ] Connection from tcp://%s:%d to tcp://%s:%d" % (r_ip, r_port, l_ip, l_port))
c.send(b"hello world\n")
c.close()
And finally by running ip netns exec ns0 curl 1.2.4.8 I was able to observe a connection from 192.168.3.2 to 1.2.4.8 and receive the "hello world" message.
The problem is that it seems to have compatibility issues with Docker. All worked well in a clean environment, but once I start Docker things start to go wrong. It seems like the TPROXY rule was no longer working. Running ip netns exec ns0 curl 192.168.3.1 gave "Connection reset" and running ip netns exec ns0 curl 1.2.4.8 timed out (both should have produced the "hello world" message). I tried restoring all iptables rules, deleting ip routes and rules generated by Docker and shutting down Docker, but none worked even if I didn't configure any networks or containers.
What is happening behind the scenes and how can I get TPROXY working normally?
I traced all processes created by Docker using strace -f dockerd, and looked for lines containing exec. Most commands are iptables commands, which I have already excluded, and the lines with modprobe looked interesting. I loaded these modules one by one and figured out that the module causing the trouble is br_netfilter.
The module enables filtering of bridged packets through iptables, ip6tables and arptables. The iptables part can be disabled by executing echo "0" | sudo tee /proc/sys/net/bridge/bridge-nf-call-iptables. After executing the command, the script worked again without impacting Docker containers.
I am still confused though. I haven't understood the consequences of such a setting. I enabled packet tracing, but it seems that the packets matched the exact same set of rules before and after enabling bridge-nf-call-iptables, but in the former case the first TCP SYN packet got delivered to the Python server, in the latter case the packet got dropped for unknown reasons.
Try running docker with -p 1234
"By default, when you create a container, it does not publish any of its ports to the outside world. To make a port available to services outside of Docker, or to Docker containers which are not connected to the container’s network, use the --publish or -p flag."
https://docs.docker.com/config/containers/container-networking/

Docker Bridge Conflicts with Host Network

Docker seems to be creating a bridge after a container starts running that then conflicts with my host network. This is not the default bridge docker0, but rather another bridge that is created after a container has started. I am able to configure the default bridge according to the older user guide link https://docs.docker.com/v17.09/engine/userguide/networking/default_network/custom-docker0/, however, I do not know how to configure this other bridge so it does not conflict with 172.17.
This current issue is then that my container cannot access other systems on the host network when this bridge becomes active.
Any ideas?
Version of docker:
Version 18.03.1-ce-mac65 (24312)
This is the bridge that gets created. Sometimes it is not 172.17, but sometimes it is.
br-f7b50f41d024 Link encap:Ethernet HWaddr 02:42:7D:1B:05:A3
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
When docker networks are created (e.g. using docker network create or indirectly through docker-compose) without explicitly specifying a subnet range, dockerd allocates a new /16 network, starting from 172.N.0.0/16, where N is a number that is incremented (e.g. N=17, N=18, N=19, N=20, ...). A given N is skipped if a docker network (a custom one, or the default docker bridge) already exists in the range.
You can specify explicitly a safe IP range when creating a docker bridge (i.e. one that excludes the host ips in your network) on the CLI. But usually bridge networks are created automatically by docker-compose with default blocks. To exclude these IPs reliably would require modifying every docker-compose.yaml file you encounter. It's bad practice to include host-specific things inside a compose file.
Instead, you can play with the networks that docker considers allocated, to force dockerd to "skip" subnets. I'm outlining three methods below:
Method #0 -- configure the pool of ips in the daemon config
If your docker version is recent enough (TODO check minimum version), and you have permissions to configure the docker daemon's command line arguments, you can try passing --default-address-pool ARG options to the dockerd command. Ex:
# allocate /24 subnets with the given CIDR prefix only.
# note that this prefix excludes 172.17.*
--default-address-pool base=172.24.0.0/13,size=24
You can add this setting in one of the etc files: /etc/default/docker, or in /etc/sysconfig/docker, depending on your distribution. There is also a way to set this parameter in daemon.json (see syntax)
Method #1 -- create a dummy placeholder network
You can prevent the entire 172.17.0.0/16 from being used by dockerd (in future bridge networks) by creating a very small docker network anywhere inside 172.17.0.0/16.
Find 4 consecutive IPs in 172.17.* that you know are not in use in your host network, and sacrifice them in a "tombstone" docker bridge. Below, I'm assuming the ips 172.17.253.0, 172.17.253.1, 172.17.253.2, 172.17.253.3 (i.e. 172.17.253.0/30) are unused in your host network.
docker network create --driver=bridge --subnet 172.17.253.0/30 tombstone
# created: c48327b0443dc67d1b727da3385e433fdfd8710ce1cc3afd44ed820d3ae009f5
Note the /30 suffix here, which defines a block of 4 different IPs. In theory, the smallest valid network subnet should be a /31 which consists of a total of 2 IPs (network identifier + broadcast). Docker asks for a /30 minimum, probably to account for a gateway host, and another container. I picked .253.0 arbitrarily, you should pick something that's not in use in your environment. Also note that the identifier tombstone is nothing special, you can rename it to anything that will help you remember why it's there when you find it again several months later.
Docker will modify your routing table to send traffic for these 4 IPs to go through that new bridge instead of the host network:
# output of route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Note: Traffic for 172.17.253.{0,1,2,3} goes through the tombstone docker bridge just created (br-c4832...). Traffic for any other IP in the 172.17.* would go through the default route (host network). My docker bridge (docker0) is on 172.20.0.1, which may appear unusual -- I've modified bip in /etc/docker/daemon.json to do that. See this page for more details.
The twist: if there exists a bridge occupying even a subportion of a /16, new bridges created will skip that range. If we create new docker networks, we can see that the rest of 172.17.0.0/16 is skipped, because the range is not entirely available.
docker network create foo_test
# c9e1b01f70032b1eff08e48bac1d5e2039fdc009635bfe8ef1fd4ca60a6af143
docker network create bar_test
# 7ad5611bfa07bda462740c1dd00c5007a934b7fc77414b529d0ec2613924cc57
The resulting routing table:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-c9e1b01f7003
172.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7ad5611bfa07
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Notice that the rest of the IPs in 172.17.0.0/16 have not been used. The new networks reserved .18. and .19.. Sending traffic to any of your conflicting IPs outside that tombstone network would go via your host network.
You would have to keep that tombstone network around in docker, but not use it in your containers. It's a dummy placeholder network.
Method #2 -- bring down the conflicting bridge network
If you wish to temporarily avoid the IP conflict, you can bring the conflicting docker bridge down using ip: ip link set dev br-xxxxxxx down (where xxxxxx represents the name of the bridge network from route -n or ip link show). This will have the effect of removing the corresponding bridge routing entry in the routing table, without modifying any of the docker metadata.
This is arguably not as good as the method above, because you'd have to bring down the interface possibly every time dockerd starts, and it would interfere with your container networking if there was any container using that bridge.
If method 1 stops working in the future (e.g. because docker tries to be smarter and reuse unused parts of an ip block), you could combine both approaches: e.g. create a large tombstone network with the entire /16, not use it in any container, and then bring its corresponding br-x device down.
Method #3 -- reconfigure your docker bridge to occupy a subportion of the conflicting /16
As a slight variation of the above, you could make the default docker bridge overlap with a region of 172.17.*.* that is not used in your host network. You can change the default docker bridge subnet by changing the bridge ip (i.e. bip key) in /etc/docker/daemon.json (See this page for more details). Just make it a subregion of your /16, e.g. in a /24 or smaller.
I've not tested this, but I presume any new docker network would skip the remainder of 172.17.0.0/16 and allocate an entirely different /16 for each new bridge.
The bridge was created from docker-compose, which can be configured within the compose file.
Answer found here: Docker create two bridges that corrupts my internet access

Mesh network with OpenWrt: clients can not ping each other

I am building a WiFi mesh network using Openwrt 802.11s and Tp-Link wr703n mini routers for my final year project. OLSR is running as a routing protocol. I am using Linux.
Total of 4 routers
LAN IP Adress Mac Mesh IP Adress
Node A 192.168.10.1 AO 192.168.5.1
Node B 192.168.11.1 6E 192.168.5.2
Node C 192.168.12.1 42 192.168.5.3
Node D 192.168.13.1 54 192.168.5.4
Above you can see the Lan IP address and the mesh addresses of each router.
So client X is connected to Node A with a cable and a node is assigned the IP address 192.168.10.100. Client Y is connected to D and is assigned the IP addresses 192.168.13.50.
When I try to ping X from Y, I cannot get it to work. Also, I can't ping the mesh IP addresses as well from the operating system terminal. But when I am logged to the OpenWrt via terminal, I am able to ping any IP addresses within the mesh.
I have captured some 802.11s beacon frame which I am adding to the post.
If you look at the very end:
Capability: 0x01
...
.... 0... = Mesh Forwarding: No
...
I feel like that's the problem because I have a previous thesis paper and the student that did that project has that setting to be Yes, and it was working.
So, does anybody have any idea?
Additionally, I checked with Wireshark that OLSR is working perfectly and transmits hello messages, to messages, etc.
One of the routers config files -- OLSRD ----network---wireless (they are all the same except the IP addresses):
root#OpenWrt:/etc/config# cat wireless
config wifi-device 'radio0'
option type 'mac80211'
option macaddr '14:cf:92:3c:67:54'
option hwmode '11ng'
option htmode 'HT20'
list ht_capab 'SHORT-GI-20'
list ht_capab 'SHORT-GI-40'
list ht_capab 'RX-STBC1'
list ht_capab 'DSSS_CCK-40'
option country 'IE'
option channel '11'
option txpower '7'
config wifi-iface
option device 'radio0'
option mesh_id 'mesh_OpenWrt'
option mode 'mesh'
option network 'mesh'
option encryption 'none'
root#OpenWrt:/etc/config# cat network
config interface 'loopback'
option ifname 'lo'
option proto 'static'
option ipaddr '127.0.0.1'
option netmask '255.0.0.0'
config interface 'lan'
option ifname 'eth0'
option type 'bridge'
option proto 'static'
option netmask '255.255.255.0'
option ipaddr '192.168.13.1'
option gateway '192.168.5.4'
config interface 'mesh'
option _orig_ifname 'wlan0'
option _orig_bridge 'false'
option proto 'static'
option ipaddr '192.168.5.4'
option netmask '255.255.255.0'
root#OpenWrt:/etc/config# cat olsrd
config olsrd
option IpVersion '4'
option FIBMetric 'flat'
option LinkQualityLevel '2'
option LinkQualityAlgorithm 'etx_ff'
option OlsrPort '698'
option Willingness '3'
option NatThreshold '1.0'
config LoadPlugin
option library 'olsrd_arprefresh.so.0.1'
config LoadPlugin
option library 'olsrd_dyn_gw.so.0.5'
config LoadPlugin
option library 'olsrd_httpinfo.so.0.1'
option port '1978'
list Net '0.0.0.0 0.0.0.0'
config LoadPlugin
option library 'olsrd_nameservice.so.0.3'
config LoadPlugin
option library 'olsrd_txtinfo.so.0.1'
option accept '0.0.0.0'
config Interface
option ignore '0'
option Mode 'mesh'
option interface 'mesh'
config InterfaceDefaults
option Mode 'mesh'
I believe there will be one bridge interface, br-lan and two interfaces wlan0
, wlan1
In NODE A:
Add these two interfaces wlan0, wlan1 into the bridge br-lan.
wlan0<----[br-lan]--->wlan1
wlan0 make as a mesh point.
wlan1 make as AP.
Make the changes in /etc/cofig/network
option type 'bridge'
option proto 'static'
option netmask '255.255.255.0'
option ipaddr '192.168.13.1'
3. Run the dhcp server on br-lan of NodeA
Make the changes in /etc/config/network of other Nodes same as below:
option proto 'dhcp'
Now all NodeB,NodeC,NodeD are in same DHCP subnet IP series of NodeA.
192.168.13.x, DHCP clients are running on all NodeB/C/D and DHCP server is running NodeA.
It will resolve your end to end PING issue.
Another approach if you want to access the internet to all nodes.
Setup should be like this:
ISP<----ETH--->wan[NodeA]-wlan0<---mesh-->wlan0-[NodeB]<---mesh-->wlan0-[NodeC]<---mesh--->wlan0-[NodeD]-wlan1 <---wifi--->sta/pc
All nodes will get DHCP IP, in every br-lan of nodes we need to run dhcp client.
NodeA
wan interface eth0.2
-Add all interface eth0.2, wlan0, wlan1 into bridge br-lan.
- Make the changes in /etc/config/network
option type 'bridge'
option proto 'dhcp'
# option netmask '255.255.255.0' /* comment this line */
# option ipaddr '192.168.13.1' /* comment this line */
Rest of the nodes will same as previous.
This will resolve your end to end ping issue, even every nodes and STA has access to internet.

route all traffic over gre tunnel

I have an openvswitch sw1 with subnet 10.207.39.0/24 that has lxc containers attached and I have the same on another physical server and I have successfully connected these using a GRE tunnel. However, the lxc containers have additional ports on additional openvswitches, e.g. sw4 with subnet 192.220.39.0/24 and I want to push that traffic over the single gre tunnel on sw1 because there is only one physical interface and it's not possible to have multiple gre tunnels on each openvswitch with the same physical interface IP addr endpoints. Is it possible to push the traffic on the other openvswitches over the gre tunnel on sw1? Or is there a better way to connect multiple subnets in lxc containers on two physical hosts? Thanks.
I solved this "myself" - with help from two links provided below - (after sleeping on it and relentless google searches over several frustrating days).
I realize the solution is pretty simple and would be clear to a networking professional. I am an Oracle DBA and only know as much networking as I need to work with orabuntu-lxc software, LXC containers, and Oracle software, so please keep that in mind if the below is "obvious" - it wasn't obvious to me in my network ignorance.
I got the clue on how to solve the actual steps from this blog post:
http://www.cnblogs.com/popsuper1982/p/3800548.html
I confirmed that any subnet should be routable over a GRE tunnel from this blog post (which gave me hope to keep working towards a solution):
https://supportforums.adtran.com/thread/1408
In particular the author stated in the adtran comment that "GRE tunnels have no limitation on the types of traffic which can traverse it. It can route multiple subnets without multiple tunnels."
That post told me that the solution was likely a routing solution and that only one GRE tunnel would be needed for this use case.
Note that this feature of "no limitation" on the types of traffic is great for Oracle RAC because we need to be able to send multicast over the GRE tunnel for RAC.
This use case:
I am building an Oracle RAC infrastructure to run in LXC Linux containers. I have a public network 10.207.39.0/24 on openvswitch sw1 and a private RAC interconnect network 192.220.39.0/24 on openvswitch sw4. I want to be able to build the RAC in LXC linux containers that span multiple physical hosts and so I created a GRE tunnel to connect the 10.207.39.1 tunnel endpoint on colossus to 10.207.39.5 tunnel endpoint on guardian.
Here is the setup details:
Host "guardian":
LAN wireless physical network interface: wlp4s0 (IP 192.168.1.11)
sw1 10.207.39.5
sw4 192.220.39.5
Host "colossus":
LAN wireless physical network interface: wlp4s0 (IP 192.168.1.15)
sw1 10.207.39.1
sw4 192.220.39.1
Step 1:
Create GRE tunnel between sw1 openvswitches on both physical hosts with physical wireless LAN network interface end points:
Host "guardian": Create gre tunnel phys hosts (guardian --> colossus).
sudo ovs-vsctl add-port sw1 gre0 -- set interface gre0 type=gre options:remote_ip=192.168.1.15
Host "colossus": Create gre tunnel phys hosts (colossus --> guardian).
sudo ovs-vsctl add-port sw1 gre0 -- set interface gre0 type=gre options:remote_ip=192.168.1.11
Step 2:
Route the 192.220.39.0/24 network over the established GRE tunnel as shown below:
Host "guardian": route 192.220.39.0/24 openvswitch sw4 over GRE tunnel:
sudo route add -net 192.220.39.0/24 gw 10.207.39.5 dev sw1
Host "colossus": route 192.220.39.0/24 openvswitch sw4 over GRE tunnel:
sudo route add -net 192.220.39.0/24 gw 10.207.39.1 dev sw1
Note: To add additional subnets repeat step 2 for each subnet.
Note on MTU:
Also, you have to allow for GRE encapsulation in MTU if you want to ssh over these tunnels.
Therefore in the above example for the main GRE tunnel connecting the hosts, we need MTU to be set to 1420 to allow 80 for the GRE header.
MTU on the LXC container virtual interfaces on the sw1 switches need to be set to MTU=1420 in the LXC container config files.
MTU on the LXC container virtual interfaces on the sw4 switches need to be set to MTU=1420 in the LXC container config files.
Note that the MTU on the openvswitches sw1 and sw4 should automatically set to the MTU on the LXC intefaces as long as ALL LXC virtual interfaces are set to the new lower MTU values, so explicitly setting MTU on the openvswitches sw1 and sw4 themselves should not be necessary.
If run into issues still with SSH over the tunnels, but ping works cross-hosts cross-containers, then re-check all MTU settings on the virtual interfaces and openvswitches and recheck.

Resources