ebtable NFLOG cause ARP request drop - netfilter

Not sure what I'm doing wrong. I have two machines connected back to back with this ebtable rule setup:
ebtables -A OUTPUT -p ARP --arp-op Request --nflog-group 100 -j DROP
I have a process listening on netlink group 100. I have the following setup on both machines /etc/network/interface lets call this one peer1:
auto e101-001-0
iface e101-001-0 inet manual
auto bond100
iface bond100 inet manual
bond-slaves e101-001-0
bond-mode 4
bond-miimon 100
iface bond100.100 inet manual
vlan-raw-device bond100
auto br100
iface br100 inet static
bridge_ports bond100.100
address 10.1.1.1
netmask 255.255.255.0
The peer2 has the same setting except the ip address is 10.1.1.2. When I try to send a ping from peer1 to peer2, I get destination unreachable. Using arp -n I see it is not resolved. I used tcpdump and saw the ARP request was made on peer1 bridge but peer2 did not receive it on it's bridge. There are three ways I know of to get a ping across:
Erase the ebtable rule on peer1.
Don't use a bridge. This works fine if I just ping between physical interface.
Add this line to the etable rules above: --nflog-range 32. I can go as low as 22 and it will still work. It starts to fail around 17.
Anything lower than 32 seems to make it stop working. My questions are:
Is there a default nflog-range size if one is not set explicitly?
Is there a kernel config that affects the nflog-range size?
I know all this is part of the kernel so if it helps I'm using the default Debian Stretch kernel (Debian 4.9.110-3+deb9u4) (no special kernel config). peer2 is running on a modified kernel and is known to be working. If I have two systems with the same specs as peer2, everything works fine, which leads me to believe it maybe a kernel config issue.
Thanks for reading.

Related

IPv6 binding error: Cannot assign requested address

As I understand, IPv6 addresses are allocated in blocks. Each machine gets a range of IPv6 address and any IPv6 address in that range would point to it.
Basis for this assumption:
https://stackoverflow.com/a/15266701/681671
The /64 is the prefix length. It is the number of bits in the address
that is fixed. So a /64 indicates that the first 64 bits of the
128-bit IPv6 address are fixed. The remaining bits (64 in this case)
are flexible, and you can use all of them. This means that when your
ISP gives you a /64 they are giving you 264 addresses (that is
18,446,744,073,709,551,616 addresses).
Edit: I confirmed using Wireshark that the packets sent to any IP in that /64 range do get routed to my server.
Looking at this line from ifconfig output
inet6 2a01:2e8:d2c:e24c::1 prefixlen 64 scopeid 0x0<global>
I conclude that all IPv6 addresses with 2a01:2e8:d2c:e24c prefix will point to my machine.
However I am unable to bind any service to any IPv6 address other than
2a01:2e8:d2c:e24c:0000:0000:0000:0001
nc -l 2a01:2e8:d2c:e24c:0000:0000:0000:0002 80 Does not work
nc -l 2a01:2e8:d2c:e24c:0000:0000:0001:0001 80 Does not work
nc -l 2a01:2e8:d2c:e24c:1000:0000:0000:0001 80 Does not work
nc -l 2a01:2e8:d2c:e24c:0000:0000:0000:0001 80 Only this works
nc -l <IP> <PORT> opens up a simple TCP server on the specified IP and port.
The error I get is nc: Cannot assign requested address
I want to run multiple instances of a service on same port but different IPv6 addresses. Since public IPv6 address are abundantly available to each machine, I thought of utilizing the same.
ifconfig:
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 88.77.66.55 netmask 255.255.255.255 broadcast 88.77.66.55
inet6 fe80::9300:ff:fe33:64c1 prefixlen 64 scopeid 0x20<link>
inet6 2a01:2e8:d2c:e24c::1 prefixlen 64 scopeid 0x0<global>
ether 96:00:00:4e:31:e4 txqueuelen 1000 (Ethernet)
RX packets 26788391 bytes 21199864639 (21.1 GB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 21940989 bytes 20045216536 (20.0 GB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
OS: Ubuntu 18.04
VPS Host: Hetzner
I am actually trying to run multiple nginx docker containers mapped to port 80 on different IPv6 addresses of the host. That is when I encountered the problem. The nc -l test is just to simplify the problem description.
I conclude that all IPv6 addresses with 2a01:2e8:d2c:e24c prefix will point to my machine
That assumption is wrong. The prefix length has the same meaning as the IPv4 netmask. It determines which addresses are on your local network, not which addresses belong to your local host.
This is all you need:
ip route add local 2a01:2e8:d2c:e24c::/64 dev lo
Credit: Can I bind a (large) block of addresses to an interface?
To re-iterate and expand upon Sander's answer:
You must bind each individual IP address to the nic, network interface card, before it considers the traffic to send up the stack.
Wireshark sets the nic in promiscuous mode,i.e. send all traffic received.
There is a practical limit to how many IP addresses can be assigned on a system, MUCH less than the 2^64 required by the op post. Storing the addresses alone would be more than any system's memory.
Unlike IPV4 with its, 127.0.0.0/8, 2^24 loopback addresses, IPV6 only defines a single address 0::1/128.
The only practical solution would be to treat the entire IPV6 subnet as a "reverse" NAT using IP masquerading(NAT). This solution would require a second instance acting as the NAT "router". The rules would rewrite the destination addresses to a single address/port.

Unnecessary Network Commands

I'm running a virtual machine on GCE and Centos 7. I've configured the machine with two network interfaces. When doing so, the user is required to enter the following commands to configure eth1 (every interface except eth0 requires this approach). On my machine, eth1's gateway is 10.140.0.1.
sudo ifconfig eth1 10.140.0.2 netmask 255.255.255.255 broadcast 10.140.0.2 mtu 1430
sudo echo "1 rt1" | sudo tee -a /etc/iproute2/rt_tables # (sudo su - first if permission denied)
sudo ip route add 10.140.0.1 src 10.140.0.2 dev eth1
sudo ip route add default via 10.140.0.1 dev eth1 table rt1
sudo ip rule add from 10.140.0.2/20 table rt1
sudo ip rule add to 10.140.0.2/20 table rt1
I have used the above with success, but the configuration is not persistent. I know it's possible to do so, but I first need to fully understand what the above is actually doing (breaking my problem into smaller parts).
sudo ifconfig eth1 10.140.0.2 netmask 255.255.255.255 broadcast 10.140.0.2 mtu 1430
This command seems to be telling eth1 at 10.140.0.2 to broadcast on the same internal IP. It's also setting MTU to 1430, which is strange because the other interfaces are set to 1460. Is this command really needed?
sudo echo "1 rt1" | sudo tee -a /etc/iproute2/rt_tables # (sudo su - first if permission denied)
From what I read, this command is appending "1 rt1" to the file rt_tables. If this is run once, does it need to be run each time the network comes up? Seems like it only needs to be run once.
sudo ip route add 10.140.0.1 src 10.140.0.2 dev eth1
sudo ip route add default via 10.140.0.1 dev eth1 table rt1
sudo ip rule add from 10.140.0.2/20 table rt1
sudo ip rule add to 10.140.0.2/20 table rt1
I know these commands add non-persistent rules and routes to the network configuration. Once I know the answers to the above, I will come back to the approach of making this persistent.
Referring to your question on Google group thread, as I had mentioned in the post:
IP routes and IP rules needs to be persistent routes to avoid the routes being lost after VM reboot or network services restart. Depending upon the operating system configuration files required to make the routes persistent can be different. Here is a stackexchange thread for CentOS 7, mentioning files: "/etc/sysconfig/network-scripts/route-ethX" and "/etc/sysconfig/network/scripts/rule-ethX" to keep the IP route and rule peristent. Here is the CentOS documentation for the persistent static routes.

Docker Bridge Conflicts with Host Network

Docker seems to be creating a bridge after a container starts running that then conflicts with my host network. This is not the default bridge docker0, but rather another bridge that is created after a container has started. I am able to configure the default bridge according to the older user guide link https://docs.docker.com/v17.09/engine/userguide/networking/default_network/custom-docker0/, however, I do not know how to configure this other bridge so it does not conflict with 172.17.
This current issue is then that my container cannot access other systems on the host network when this bridge becomes active.
Any ideas?
Version of docker:
Version 18.03.1-ce-mac65 (24312)
This is the bridge that gets created. Sometimes it is not 172.17, but sometimes it is.
br-f7b50f41d024 Link encap:Ethernet HWaddr 02:42:7D:1B:05:A3
inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
When docker networks are created (e.g. using docker network create or indirectly through docker-compose) without explicitly specifying a subnet range, dockerd allocates a new /16 network, starting from 172.N.0.0/16, where N is a number that is incremented (e.g. N=17, N=18, N=19, N=20, ...). A given N is skipped if a docker network (a custom one, or the default docker bridge) already exists in the range.
You can specify explicitly a safe IP range when creating a docker bridge (i.e. one that excludes the host ips in your network) on the CLI. But usually bridge networks are created automatically by docker-compose with default blocks. To exclude these IPs reliably would require modifying every docker-compose.yaml file you encounter. It's bad practice to include host-specific things inside a compose file.
Instead, you can play with the networks that docker considers allocated, to force dockerd to "skip" subnets. I'm outlining three methods below:
Method #0 -- configure the pool of ips in the daemon config
If your docker version is recent enough (TODO check minimum version), and you have permissions to configure the docker daemon's command line arguments, you can try passing --default-address-pool ARG options to the dockerd command. Ex:
# allocate /24 subnets with the given CIDR prefix only.
# note that this prefix excludes 172.17.*
--default-address-pool base=172.24.0.0/13,size=24
You can add this setting in one of the etc files: /etc/default/docker, or in /etc/sysconfig/docker, depending on your distribution. There is also a way to set this parameter in daemon.json (see syntax)
Method #1 -- create a dummy placeholder network
You can prevent the entire 172.17.0.0/16 from being used by dockerd (in future bridge networks) by creating a very small docker network anywhere inside 172.17.0.0/16.
Find 4 consecutive IPs in 172.17.* that you know are not in use in your host network, and sacrifice them in a "tombstone" docker bridge. Below, I'm assuming the ips 172.17.253.0, 172.17.253.1, 172.17.253.2, 172.17.253.3 (i.e. 172.17.253.0/30) are unused in your host network.
docker network create --driver=bridge --subnet 172.17.253.0/30 tombstone
# created: c48327b0443dc67d1b727da3385e433fdfd8710ce1cc3afd44ed820d3ae009f5
Note the /30 suffix here, which defines a block of 4 different IPs. In theory, the smallest valid network subnet should be a /31 which consists of a total of 2 IPs (network identifier + broadcast). Docker asks for a /30 minimum, probably to account for a gateway host, and another container. I picked .253.0 arbitrarily, you should pick something that's not in use in your environment. Also note that the identifier tombstone is nothing special, you can rename it to anything that will help you remember why it's there when you find it again several months later.
Docker will modify your routing table to send traffic for these 4 IPs to go through that new bridge instead of the host network:
# output of route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Note: Traffic for 172.17.253.{0,1,2,3} goes through the tombstone docker bridge just created (br-c4832...). Traffic for any other IP in the 172.17.* would go through the default route (host network). My docker bridge (docker0) is on 172.20.0.1, which may appear unusual -- I've modified bip in /etc/docker/daemon.json to do that. See this page for more details.
The twist: if there exists a bridge occupying even a subportion of a /16, new bridges created will skip that range. If we create new docker networks, we can see that the rest of 172.17.0.0/16 is skipped, because the range is not entirely available.
docker network create foo_test
# c9e1b01f70032b1eff08e48bac1d5e2039fdc009635bfe8ef1fd4ca60a6af143
docker network create bar_test
# 7ad5611bfa07bda462740c1dd00c5007a934b7fc77414b529d0ec2613924cc57
The resulting routing table:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.5.1 0.0.0.0 UG 0 0 0 eth1
172.17.253.0 0.0.0.0 255.255.255.252 U 0 0 0 br-c48327b0443d
172.18.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-c9e1b01f7003
172.19.0.0 0.0.0.0 255.255.0.0 U 0 0 0 br-7ad5611bfa07
172.20.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.5.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
Notice that the rest of the IPs in 172.17.0.0/16 have not been used. The new networks reserved .18. and .19.. Sending traffic to any of your conflicting IPs outside that tombstone network would go via your host network.
You would have to keep that tombstone network around in docker, but not use it in your containers. It's a dummy placeholder network.
Method #2 -- bring down the conflicting bridge network
If you wish to temporarily avoid the IP conflict, you can bring the conflicting docker bridge down using ip: ip link set dev br-xxxxxxx down (where xxxxxx represents the name of the bridge network from route -n or ip link show). This will have the effect of removing the corresponding bridge routing entry in the routing table, without modifying any of the docker metadata.
This is arguably not as good as the method above, because you'd have to bring down the interface possibly every time dockerd starts, and it would interfere with your container networking if there was any container using that bridge.
If method 1 stops working in the future (e.g. because docker tries to be smarter and reuse unused parts of an ip block), you could combine both approaches: e.g. create a large tombstone network with the entire /16, not use it in any container, and then bring its corresponding br-x device down.
Method #3 -- reconfigure your docker bridge to occupy a subportion of the conflicting /16
As a slight variation of the above, you could make the default docker bridge overlap with a region of 172.17.*.* that is not used in your host network. You can change the default docker bridge subnet by changing the bridge ip (i.e. bip key) in /etc/docker/daemon.json (See this page for more details). Just make it a subregion of your /16, e.g. in a /24 or smaller.
I've not tested this, but I presume any new docker network would skip the remainder of 172.17.0.0/16 and allocate an entirely different /16 for each new bridge.
The bridge was created from docker-compose, which can be configured within the compose file.
Answer found here: Docker create two bridges that corrupts my internet access

route all traffic over gre tunnel

I have an openvswitch sw1 with subnet 10.207.39.0/24 that has lxc containers attached and I have the same on another physical server and I have successfully connected these using a GRE tunnel. However, the lxc containers have additional ports on additional openvswitches, e.g. sw4 with subnet 192.220.39.0/24 and I want to push that traffic over the single gre tunnel on sw1 because there is only one physical interface and it's not possible to have multiple gre tunnels on each openvswitch with the same physical interface IP addr endpoints. Is it possible to push the traffic on the other openvswitches over the gre tunnel on sw1? Or is there a better way to connect multiple subnets in lxc containers on two physical hosts? Thanks.
I solved this "myself" - with help from two links provided below - (after sleeping on it and relentless google searches over several frustrating days).
I realize the solution is pretty simple and would be clear to a networking professional. I am an Oracle DBA and only know as much networking as I need to work with orabuntu-lxc software, LXC containers, and Oracle software, so please keep that in mind if the below is "obvious" - it wasn't obvious to me in my network ignorance.
I got the clue on how to solve the actual steps from this blog post:
http://www.cnblogs.com/popsuper1982/p/3800548.html
I confirmed that any subnet should be routable over a GRE tunnel from this blog post (which gave me hope to keep working towards a solution):
https://supportforums.adtran.com/thread/1408
In particular the author stated in the adtran comment that "GRE tunnels have no limitation on the types of traffic which can traverse it. It can route multiple subnets without multiple tunnels."
That post told me that the solution was likely a routing solution and that only one GRE tunnel would be needed for this use case.
Note that this feature of "no limitation" on the types of traffic is great for Oracle RAC because we need to be able to send multicast over the GRE tunnel for RAC.
This use case:
I am building an Oracle RAC infrastructure to run in LXC Linux containers. I have a public network 10.207.39.0/24 on openvswitch sw1 and a private RAC interconnect network 192.220.39.0/24 on openvswitch sw4. I want to be able to build the RAC in LXC linux containers that span multiple physical hosts and so I created a GRE tunnel to connect the 10.207.39.1 tunnel endpoint on colossus to 10.207.39.5 tunnel endpoint on guardian.
Here is the setup details:
Host "guardian":
LAN wireless physical network interface: wlp4s0 (IP 192.168.1.11)
sw1 10.207.39.5
sw4 192.220.39.5
Host "colossus":
LAN wireless physical network interface: wlp4s0 (IP 192.168.1.15)
sw1 10.207.39.1
sw4 192.220.39.1
Step 1:
Create GRE tunnel between sw1 openvswitches on both physical hosts with physical wireless LAN network interface end points:
Host "guardian": Create gre tunnel phys hosts (guardian --> colossus).
sudo ovs-vsctl add-port sw1 gre0 -- set interface gre0 type=gre options:remote_ip=192.168.1.15
Host "colossus": Create gre tunnel phys hosts (colossus --> guardian).
sudo ovs-vsctl add-port sw1 gre0 -- set interface gre0 type=gre options:remote_ip=192.168.1.11
Step 2:
Route the 192.220.39.0/24 network over the established GRE tunnel as shown below:
Host "guardian": route 192.220.39.0/24 openvswitch sw4 over GRE tunnel:
sudo route add -net 192.220.39.0/24 gw 10.207.39.5 dev sw1
Host "colossus": route 192.220.39.0/24 openvswitch sw4 over GRE tunnel:
sudo route add -net 192.220.39.0/24 gw 10.207.39.1 dev sw1
Note: To add additional subnets repeat step 2 for each subnet.
Note on MTU:
Also, you have to allow for GRE encapsulation in MTU if you want to ssh over these tunnels.
Therefore in the above example for the main GRE tunnel connecting the hosts, we need MTU to be set to 1420 to allow 80 for the GRE header.
MTU on the LXC container virtual interfaces on the sw1 switches need to be set to MTU=1420 in the LXC container config files.
MTU on the LXC container virtual interfaces on the sw4 switches need to be set to MTU=1420 in the LXC container config files.
Note that the MTU on the openvswitches sw1 and sw4 should automatically set to the MTU on the LXC intefaces as long as ALL LXC virtual interfaces are set to the new lower MTU values, so explicitly setting MTU on the openvswitches sw1 and sw4 themselves should not be necessary.
If run into issues still with SSH over the tunnels, but ping works cross-hosts cross-containers, then re-check all MTU settings on the virtual interfaces and openvswitches and recheck.

Docker 1.9.0 "bridge" versus a custom bridge network results in difference in hosts file and SSH_CLIENT env variable

Let me first explain what I'm trying to do, as there may be multiple ways to solve this. I have two containers in docker 1.9.0:
node001 (172.17.0.2) (sudo docker run --net=<<bridge or test>> --name=node001 -h node001 --privileged -t -i -v /sys/fs/cgroup:/sys/fs/cgroup <<image>>)
node002 (172.17.0.3) (,,)
When I launch them with --net=bridge I get the correct value for SSH_CLIENT when I ssh from one to the other:
[root#node001 ~]# ssh root#172.17.0.3
root#172.17.0.3's password:
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.17.0.3 56194 22
[root#node001 ~]# ping -c 1 node002
ping: unknown host node002
In docker 1.8.3 I could also use the hostnames I supply when I start them, in 1.8.3 that last ping statement works!
In docker 1.9.0 I don't see anything being added in /etc/hosts, and the ping statement fails. This is a problem for me. So I tried creating a custom network...
docker network create --driver bridge test
When I launch the two containers with --net=test I get a different value for SSH_CLIENT:
[root#node001 ~]# ssh root#172.18.0.3
root#172.18.0.3's password:
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.18.0.1 57388 22
[root#node001 ~]# ping -c 1 node002
PING node002 (172.18.0.3) 56(84) bytes of data.
64 bytes from node002 (172.18.0.3): icmp_seq=1 ttl=64 time=0.041 ms
Note that the ip address is not node001's, it seems to represent the docker host itself. The hosts file is correct though, containing:
172.18.0.2 node001
172.18.0.2 node001.test
172.18.0.3 node002
172.18.0.3 node002.test
My current workaround is using docker 1.8.3 with the default bridge network, but I want this to work with future docker versions.
Is there any way I can customize the test network to make it behave similarly to the default bridge network?
Alternatively:
Maybe make the default bridge network write out the /etc/hosts file in docker 1.9.0?
Any help or pointers towards different solutions will be greatly appreciated..
Edit: 21-01-2016
Apparently the problem is fixed in 1.9.1, with bridge in docker 1.8 and with a custom (--net=test) in 1.9.1, now the behaviour is correct:
[root#node001 tmp]# ip route
default via 172.17.0.1 dev eth0
172.17.0.0/16 dev eth0 proto kernel scope link src 172.17.0.5
[root#node002 ~]# env | grep SSH_CLIENT
SSH_CLIENT=172.18.0.3 52162 22
Retried in 1.9.0 to see if I wasn't crazy, and yeah there the problem occurs:
[root#node001 tmp]# ip route
default via 172.18.0.1 dev eth0
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.3
[root#node002 ~]# env|grep SSH_CLI
SSH_CLIENT=172.18.0.1 53734 22
So after remove/stop/start-ing the instances the IP-addresses were not exactly the same, but it can be easily seen that the ssh_client source ip is not correct in the last code block. Thanks #sourcejedi for making me re-check.
Firstly, I don't think it's possible to change any settings on the default network, i.e. to write /etc/hosts. You apparently can't delete the default networks, so you can't recreate them with different options.
Secondly
Docker is careful that its host-wide iptables rules fully expose containers to each other’s raw IP addresses, so connections from one container to another should always appear to be originating from the first container’s own IP address. docs.docker.com
I tried reproducing your issue with the random containers I've been playing with. Running wireshark on the bridge interface for the network, I didn't see my ping packets. From this I conclude my containers are indeed talking directly to each other; the host was not doing routing and NAT.
You need to check the routes on your client container ip route. Do you have a route for 172.18.0.2/16? If you only have a default route, it could try to send everything through the docker host. And it might get confused and do masquerading as if it was talking with the outside world.
This might happen if you're running some network configuration in your privileged container. I don't know what's happening if you're just booting it with bash though.

Resources