Docker-CE 19.03.8
Swarm init
Setup: 1 Manager Node nothing more.
We deploy many new stacks per day and sometime i see the following line:
evel=error msg="Failed to allocate network resources for node sdlk0t6pyfb7lxa2ie3w7fdzr" error="could not find network allocator state for network qnkxurc5etd2xrkb53ry0fu59" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
After 2 to 3 days. No new services can be started because there are no free VIPs.
I see the following line in my logs:
level=error msg="Could not parse VIP address while releasing"
level=error msg="error deallocating vip" error="invalid CIDR address: " vip.addr= vip.network=oqcsj99taftdu3b0t3nrgbgy1
level=error msg="Event api.EventUpdateTask: Failed to get service idid0u7vjuxf2itpv8n31da57 for task 6vnc8jdkgxwxqbs3ixly2i6u4 state NEW: could not find service idid0u7vjuxf2itpv8n31da57" module=node ...
level=error msg="Event api.EventUpdateTask: Failed to get service sbjb7nk0wk31c2ayg8x898fhr for task noo21whnbwkyijnqavseirfg0 state NEW: could not find service sbjb7nk0wk31c2ayg8x898fhr" module=node ...
level=error msg="Failed to find network y73pnq85mjpn1pon38pdbtaw2 on node sdlk0t6pyfb7lxa2ie3w7fdzr" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
We tried to investigate this by using the debug mode.
Here are some lines that get to me:
level=debug msg="Remove interface veth84e7185 failed: Link not found"
level=debug msg="Remove interface veth64c3a65 failed: Link not found"
level=debug msg="Remove interface vethf1703f1 failed: Link not found"
level=debug msg="Remove interface vethe069254 failed: Link not found"
level=debug msg="Remove interface veth2b81763 failed: Link not found"
level=debug msg="Remove interface veth0bf3390 failed: Link not found"
level=debug msg="Remove interface veth2ed04cc failed: Link not found"
level=debug msg="Remove interface veth0bc27ef failed: Link not found"
level=debug msg="Remove interface veth444343f failed: Link not found"
level=debug msg="Remove interface veth036acf9 failed: Link not found"
level=debug msg="Remove interface veth62d7977 failed: Link not found"
and
level=debug msg="Request address PoolID:10.0.0.0/24 App: ipam/default/data, ID: GlobalDefault/10.0.0.0/24, DBIndex: 0x0, Bits: 256, Unselected: 60, Sequence: (0xf7dfeeee, 1)->(0xedddddb7, 1)->(0x77777777, 3)->(0x77777775, 1)->(0x77ffffff, 1)->(0xffd55555, 1)->end Curr:233 Serial:true PrefAddress:<
When the UNSELECTED part goes to 0 no new containers can be deployed. They are stuck in the NEW state.
Has anyone expirenced something like this? Or can someone help me?
We believe, that the problem has to do something with the release of the 10.0.0.0/24 (our ingress) addresses.
Did you tried to stop and re- start the docker demon?
sudo service docker stop
sudo service docker start
Also, you may find it useful to have a look at the magnificent documentation on https://dockerswarm.rocks/
I usually use this sequence to update a service
export DOMAIN=xxxx.xxxxx.xxx
docker stack rm $service_name
export NODE_ID=$(docker info -f '{{.Swarm.NodeID}}')
# export environment vars if needed
# update data if needed
docker node update --label-add $service_name.$service_name-data=true $NODE_ID
docker stack deploy -c $service_name.yml $service_name
If you see your container stuck in NEW state, probably your are affected by this problem: https://github.com/moby/moby/issues/37338 reported by cintiadr:
Docker stack fails to allocate IP on an overlay network, and gets stuck in NEW current state #37338
Reproducing it:
Create a swarm cluster (1 manager, 1 worker). I created AWS t2.large Amazon linux instances, installed docker using their docs, version 18.06.1-ce.
# Deploy a new overlay network from a stack (docker-network.yml)
$ ./deploy-network.sh
Deploy 60 identical services attaching to that network - 3 replicas each - from stacks (docker-network.yml)
$ ./deploy-services.sh
You can verify that all services are happily running.
Now let's bring the worker down.
Run:
docker node update --availability drain <node id> && docker node rm --force <node id>
Note: drain is an async operation (something I wasn't aware), so to reproduce this use case you shouldn't wait for the drain to complete
Create a new worker (completely new node/machine), and join the cluster.
You are going to see that very few services are actually able to start. All other will be continuously being rejected due to no IP available.
In past versions (17 I believe), the containers wouldn't be rejected (but rather be stuck in NEW).
How to avoid that problem?
If you drain and patiently wait for all the containers to be terminated before removing the node, it appears that this problem is completely avoided.
Related
I have a docker image that I can run properly in my local VM. Everything runs fine.
I save the image and load it in the prod server.
I can see the image by using docker images
next I try to run it with docker run -p 9191:9191 myservice
It hangs and eventually times out.
The log shows the following:
time="2018-08-15T16:14:35.058232400-07:00" level=debug msg="HCSShim::CreateContainer succeeded id=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152 handle=
39044064"
time="2018-08-15T16:14:35.058232400-07:00" level=debug msg="libcontainerd: Create() id=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152, Calling start()"
time="2018-08-15T16:14:35.058232400-07:00" level=debug msg="libcontainerd: starting container b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:14:35.058232400-07:00" level=debug msg="HCSShim::Container::Start id=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:18:25.393050900-07:00" level=debug msg="Result: {\"Error\":-2147023436,\"ErrorMessage\":\"This operation returned because the timeout period expired.\
"}"
time="2018-08-15T16:18:25.394050800-07:00" level=error msg="libcontainerd: failed to start container: container b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db7
9e4152 encountered an error during Start: failure in a Windows system call: This operation returned because the timeout period expired. (0x5b4)"
time="2018-08-15T16:18:25.394050800-07:00" level=debug msg="HCSShim::Container::Terminate id=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:18:25.394050800-07:00" level=debug msg="libcontainerd: cleaned up after failed Start by calling Terminate"
time="2018-08-15T16:18:25.394050800-07:00" level=error msg="Create container failed with error: container b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152
encountered an error during Start: failure in a Windows system call: This operation returned because the timeout period expired. (0x5b4)"
time="2018-08-15T16:18:25.424053800-07:00" level=debug msg="attach: stdout: end"
time="2018-08-15T16:18:25.425055000-07:00" level=debug msg="attach: stderr: end"
time="2018-08-15T16:18:25.427054100-07:00" level=debug msg="Revoking external connectivity on endpoint boring_babbage (b20f403df0ed25ede9152f77eb0f8e049677f1279b68862a25b
b9e2ab94babfb)"
time="2018-08-15T16:18:25.459087300-07:00" level=debug msg="[DELETE]=>[/endpoints/31e66619-5b57-47f2-9256-bbba54510e3b] Request : "
time="2018-08-15T16:18:25.548068700-07:00" level=debug msg="Releasing addresses for endpoint boring_babbage's interface on network nat"
time="2018-08-15T16:18:25.548068700-07:00" level=debug msg="ReleaseAddress(172.25.224.0/20, 172.25.229.142)"
time="2018-08-15T16:18:25.561064000-07:00" level=debug msg="WindowsGraphDriver Put() id b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:18:25.561064000-07:00" level=debug msg="hcsshim::UnprepareLayer flavour 1 layerId b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:18:25.566074800-07:00" level=debug msg="hcsshim::UnprepareLayer succeeded flavour 1 layerId=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db7
9e4152"
time="2018-08-15T16:18:25.566074800-07:00" level=debug msg="hcsshim::DeactivateLayer Flavour 1 ID b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e4152"
time="2018-08-15T16:18:25.668075600-07:00" level=debug msg="hcsshim::DeactivateLayer succeeded flavour=1 id=b928213e42b2103cd33b676ed08a15529be10fffcfd7e709af86df5db79e41
52"
I can see when it is trying to create the container and then it fails.
But why?
added more information
I finally found out how to check the server status for running container and I am getting this error message:
So it means the server doesn't have a network gateway?
How can I fix this problem?
Still keep looking
More information
I did delete all NAT and create a new one, so the online check passed now.
However, I still encounter other errors and can't run the image.
Something in the virtual network is wrong, I just can't find the right information to fix it.... :(
When executing a docker swarm join command (as manager), I face the following error:
Error response from daemon: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = "transport: x509: certificate is not valid for any names, but wanted to match swarm-manager"
Joining the same swarm, but as worker, works flawless.
The logfiles show me the following items:
kmo#GETSTdock-app01 ~ $ sudo tail -f /var/log/upstart/docker.log
time="2018-07-06T09:18:17.890620199+02:00" level=info msg="Listening for connections" addr="[::]:2377" module=node node.id=7j75bmugpf8k2o0onta1yp4zy proto=tcp
time="2018-07-06T09:18:17.892234469+02:00" level=info msg="manager selected by agent for new session: { 10.130.223.107:2377}" module=node/agent node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:17.892364019+02:00" level=info msg="waiting 0s before registering session" module=node/agent node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.161362606+02:00" level=error msg="fatal task error" error="cannot create a swarm scoped network when swarm is not active" module=node/agent/taskmanager node.id=7j75bmugpf8k2o0onta1yp4zy service.id=p3ng4om2m8rl7ygoef18ayohp task.id=weaubf3qj5goctlh2039sjvdg
time="2018-07-06T09:18:18.162182077+02:00" level=error msg="fatal task error" error="cannot create a swarm scoped network when swarm is not active" module=node/agent/taskmanager node.id=7j75bmugpf8k2o0onta1yp4zy service.id=6sl9y5rcov6htwbyvm504ewh2 task.id=j3foc6rjszuqszj41qyqb6mpe
time="2018-07-06T09:18:18.184847516+02:00" level=info msg="Stopping manager" module=node node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.184993569+02:00" level=info msg="Manager shut down" module=node node.id=7j75bmugpf8k2o0onta1yp4zy
time="2018-07-06T09:18:18.185020917+02:00" level=info msg="shutting down certificate renewal routine" module=node/tls node.id=7j75bmugpf8k2o0onta1yp4zy node.role=swarm-manager
time="2018-07-06T09:18:18.185163663+02:00" level=error msg="cluster exited with error: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = \"transport: x509: certificate is not valid for any names, but wanted to match swarm-manager\""
time="2018-07-06T09:18:18.185492995+02:00" level=error msg="Handler for POST /v1.37/swarm/join returned error: manager stopped: can't initialize raft node: rpc error: code = Internal desc = connection error: desc = \"transport: x509: certificate is not valid for any names, but wanted to match swarm-manager\""
I face similar problems when I join as worker, and then attempt to promote the node to a manager node.
Docker version = 18.03.1
OS = Ubuntu 14.04 LTS
Anybody an idea how to resolve this?
For me, I had to open port 2377 in the joining manager node's firewall; that seemed to do the trick. I'm not sure if this is best practice, as I'm still a noob with Docker Swarm: but add it to the list of things to try if you have this issue.
This may or may not work, but you can try
On manager run:
docker swarm leave --force
Recreate the swarm using:
docker swarm init --advertise-addr [ip-address for initial manager]
Then try to add managers using the advertised address
Also you can try:
Comment out the proxy from the docker proxy define file /etc/systemd/system/docker.service.d/docker.conf or /etc/systemd/system/docker.service.d/docker_proxy.conf
reload the deamon with
systemctl daemon-reload
Re-excute docker swarm join --token manager
Docker for Windows Server
Windows Server version 1709, with containers
Docker version 17.06.2-ee-6, build e75fdb8
Swarm mode (worker node, part of swarm with ubuntu masters)
After containers connected to an overlay network started intermittently losing their network adapters, I restarted the machine. Now daemon will not start. Below is the last lines of output from running docker -D.
Please let me know how to fix this.
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option Experimental: false"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultDriver: nat"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultNetwork: nat"
time="2018-05-15T15:10:06.734183700Z" level=info msg="Restoring existing overlay networks from HNS into docker"
time="2018-05-15T15:10:06.735174400Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:12:06.789120400Z" level=debug msg="Network (d4d37ce) restored"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (4114b6e) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (819eb70) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.797124900Z" level=debug msg="Endpoint (ade55ea) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (d0054fc) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (e2af8d8) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.854118500Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:14:06.860654000Z" level=debug msg="start clean shutdown of all containers with a 15 seconds timeout..."
Error starting daemon: Error initializing network controller: hnsCall failed in Win32: Server execution failed (0x80080005)
Here is complete set of steps to completely rebuild all docker issues withing swarm host. Sometimes only some steps are sufficient (specifically hns part), so you can try those first.
Remove all docker services and user-defined networks (so all docker networks except `nat` and `none`
Leave the swarm cluster (docker swarm leave --force)
Stop the docker service (PS C:\> stop-service docker)
Stop the HNS service (PS C:\> stop-service hns)
In regedit, delete all of the registry keys under these paths:
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\SwitchList
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\NicList
Now go to Device Manager, and disable then remove all network adapters that are “Hyper-V Virtual Ethernet…” adapters
Now rename your HNS.data file (the goal is to effectively “delete” it by renaming it):
C:\ProgramData\Microsoft\Windows\HNS\HNS.data
Also rename C:\ProgramData\docker folder (the goal is to effectively “delete” it by renaming it)
C:\ProgramData\docker
Now reboot your machine
After stopping docker it refused to start again. It complaint that another bridge called docker0 already exists:
level=warning msg="devmapper: Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored."
level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
level=info msg="Graph migration to content-addressability took 0.00 seconds"
level=info msg="Firewalld running: false"
level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: cannot create network fa74b0de61a17ffe68b9a8f7c1cd698692fb56f6151a7898d66a30350ca0085f (docker0): conflicts with network bb9e0aab24dd1f4e61f8e7a46d4801875ade36af79d7d868c9a6ddf55070d4d7 (docker0): networks have same bridge name"
docker.service: Main process exited, code=exited, status=1/FAILURE
Failed to start Docker Application Container Engine.
docker.service: Unit entered failed state.
docker.service: Failed with result 'exit-code'.
Deleting the bridge with ip link del docker0 and then starting docker leads to the same result with another id.
For me I downgraded my OS (Centos Atomic Host in this case) and came across this error message. The docker of the older Centos Atomic was 1.9.1. I did not have any running docker containers or images pulled before running the downgrade.
I simply ran the below and docker was happy again:
sudo rm -rf /var/lib/docker/network
sudo systemctl start docker
More info.
The Problem seems to be in /var/docker/network/. There are a lot of sockets stored that reference the bridge by its old id. To solve the Problem you can delete all sockets, delete the interface and then start docker but all your container will refuse to work since their sockets are gone. In my case I did not care about my stateless containers anyway so this fixed the problem:
ip link del docker0
rm -rf /var/docker/network/*
mkdir /var/docker/network/files
systemctl start docker
# delete all containers
docker ps -a | cut -d' ' -f 1 | xargs -n 1 echo docker rm -f
# recreate all containers
It may sound obvious, but you may want to consider rebooting, especially if there was some major system update recently.
Worked for me, since I didn't reboot my VM after installing some kernel updates, which probably led to many network modules being left in an undefined state.
I want to fire up 2 docker containers on the same interface, so I tried the following from the docker docs:
First container:
bash-4.1$ docker run -ti --name=target ubuntu /bin/bash
root#45edefd42404:/#
Second container:
bash-4.1$ docker run -ti --rm --net=container:target ubuntu /bin/bash
setup networking failed to setns current network namespace: invalid argumentFATA[0002] Error response from daemon: Cannot start container ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461: setup networking failed to setns current network namespace: invalid argument
I've googled for failures related to setns and can't find anything relevant. Is there anyplace else I can look to debug this?
My docker daemon log contains this related to the failure (full log https://gist.github.com/paulweb515/990a1a9edeef1e73b752);
time="2015-04-23T09:17:59-04:00" level="error" msg="Warning: error unmounting device ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461: UnmountDevice: device not-mounted id ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461\n"
time="2015-04-23T09:17:59-04:00" level="info" msg="+job log(die, ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461, ubuntu:14.04)"
time="2015-04-23T09:17:59-04:00" level="info" msg="-job log(die, ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461, ubuntu:14.04) = OK (0)"
Cannot start container ba28e4f14f4b3c2d7b94aa4b0cca8f5b70e6b354842818fe77b31885acc77461: setup networking failed to setns current network namespace: invalid argument