The Divio app or CLI "doctor" reports a DNS failure - docker

The Divio app shows an error when setting up the local Docker container because (according to 'divio doctor') DNS resolution inside Docker doesn't work.
I've setup an Ubuntu 18.1 VBox VM on a W10 host to serve as a Divio local development box. DNS resolution was never a problem when running Docker on the host though.
I added "dns": [8.8.8.8] to /etc/docker/daemon.json to get DNS to work from the terminal.
The following command returns the correct answer:
docker run busybox nslookup control.divio.com
Server: 8.8.8.8
Address: 8.8.8.8:53
Non-authoritative answer:
Name: control.divio.com
Address: 217.150.252.173
Anyone has an idea how to fix this?

What's happening is this: the command executed inside the container to test for DNS resolution (nslookup control.divio.com) has a 5 second timeout.
Your command (docker run busybox nslookup control.divio.com does just the same thing - but without the timeout.
For whatever reason, it's taking longer than 5 seconds to get a response, hence the failure in the first case.
It's not entirely clear why this sometimes happens, with no obvious reason - DNS resolution should not take so long.
You can disable this test though, by adding docker-server-dns to skip_doctor_checks in the ~/.aldryn file. See the Divio Cloud documentation for details.
Update 8th January 2019
The Divio App has been updated (to version 0.13.1) that you will be offered when you next launch it, along with the Divio CLI (to version 3.3.10) which if you use outside the Divio Shell can be installed with pip install --upgrade divio-cli.
In this update the way the lookup works has been changed to mitigate the effects of network problems when it does the DNS check (it now does a more restricted check).
You should now be able to re-enable the disabled docker-server-dns test in the ~/.aldryn file.
Update 8th March 2019
To complicate matters, it turns out that the Busybox image used to run these tests has changed in recent versions and it's quite difficult to ensure that the commands used in the test will work with whatever version of Busybox the user happens to have.
Running docker pull busybox will update the image, and for many users this has solved the issues. Some users will be able to reinstate the tests described above that would fail.

Related

Docker connectivity issues (to Azure DevOps Services from self hosted Linux Docker agent)

I am looking for some advice on debugging some extremely painful Docker connectivity issues.
In particular, for an Azure DevOps Services Git repository, I am running a self-hosted (locally) dockerized Linux CI (setup according to https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux), which has been working fine for a few months now.
All this runs on a company network, and since last week the network connection of my docker container became highly unstable:
Specifically it intermittently looses network connection, which is also visible via the logs of the Azure DevOps agent, which then keeps trying to reconnect.
This especially happens while downloading Git LFS objects. Enabling extra traces via GIT_TRACE=1 highlights a lot of connection failures and retries:
trace git-lfs: xfer: failed to resume download for "SHA" from byte N: expected status code 206, received 200. Re-downloading from start
During such a LFS pull / fetch, sometimes the container even stops responding as a docker container list command only responds:
Error response from daemon: i/o timeout
As a result the daemon cannot recover on its own, and needs a manual restart (to get back up the CI).
Also I see remarkable differences in network performance:
Manually cloning the same Git repository (including LFS objects, all from scratch) in container instances (created from the same image) on different machines, takes less than 2mins on my dev laptop machine (connected from home via VPN), while the same operation easily takes up to 20minutes (!) on containers running two different Win10 machines (company network, physically located in offices, hence no VPN.
Clearly this is not about the host network connection itself, since cloning on the same Win10 hosts (company network/offices) outside of the containers takes only 14seconds!
Hence I am suspecting some network configuration issues (e.g. sth with the Hyper-V vEthernet Adapter? Firewall? Proxy? or whichever other watchdog going astray?), but after three days of debugging, I am not quite sure how to further investigate this issue, as I am running out of ideas and expertise. Any thoughts / advice / hints?
I should add that LFS configuration options (such as lfs.concurrenttransfers and lfs.basictransfersonly) did not really help, similarly for git config http.version (or just removing some larger files)
UPDATE
it does not actually seem to be about the self-hosted agent but a more general docker network cfg issue within my corporate network.
Running the following works consistently fast on my VPN machine (running from home):
docker run -it
ubuntu bash -c "apt-get update; apt-get install -y wget; start=$SECONDS;
wget http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso;
echo Duration: $(( SECONDS - start )) seconds"
Comparision with powershell download (on the host):
$start=Get-Date
$(New-Object
net.webclient).Downloadfile("http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso",
"e:/temp/lubuntu-18.04-alternate-amd64.iso")
'Duration: {0:mm}
min {0:ss} sec' -f ($(Get-Date)-$start)
Corporate network
Docker: 1560 seconds (=26 min!)
Windows host sys: Duration: 00 min 15 sec
Dev laptop (VPN, from home):
Docker: 144 seconds (=2min 24sec)
Windows host sys: Duration: 02 min 16 sec
Looking at the issues discussed in https://github.com/docker/for-win/issues/698 (and proposed workaround which didn't work for me), it seems to be a non-trivial problem with Windows / hyper-v ..
The whole issue "solved itself" when my company decided to finally upgrade from Win10 1803 to 1909 (which comes with WSL, replacing Hyper-V) .. 😂
Now everything runs supersmoothly (I kept running these tests for almost 20 times)

Docker build hangs immediatly (with minikube and ubuntu, with not many files in the directory)

Upon a fresh start of my ubuntu (on a virtualbox vm), I can build my images normally. Then very inconstantly, it can be the next time I try to build, or the 10th time, it will hang forever after running the command docker build .
Dockerfiles are in directories with 5~10 other files (which eliminates the issue with massive file amount slowing down docker while trying to locate the Dockerfile, as seen on other posts)
If I try to build for a new, very simple, Dockerfile (to eliminate any syntax error), it will also hang whenever it hangs with my project's Dockerfiles.
Beside, I am running minikube --driver=none and my images are used for deployments in kubernetes. (with none driver it's not required to run eval $(minikube docker-env) )
The only reliable fix is to stop the vm on which my ubuntu is running, start it again, and it will consistently allow me to build my images at least one time, then the issue comes back inconsistently.
This fix is quite inconvenient as I need to stop everything I am doing and it takes a bit of time.
I have tried to run docker system prune and to delete all the images already built.
What log could I check to find an issue going on when the build hangs ?
Any idea of the origin of the issue ?
Thanks a lot !
Ok this bug was viscous.
Sometimes when I need to check how my nginx server behave in one of the containers, I open the VM's graphical interface and pop firefox to have a look.
I only figured today that firefox prompt a pop-up after a while, asking for the admin password in order to access the keychain. And turns out docker do not build anything until this pop-up is open. Closing it or filling the password fixed my issue...
On another terminal window, please check the Docker Daemon Log using sudo journalctl -fu docker.service at the time of build command hanging. Also, you can check the list of running processes during the build execution.

Docker fails on changed GCP virtual machine?

I have a problem with Docker that seems to happen when I change the machine type of a Google Compute Platform VM instance. Images that were fine fail to run, fail to delete, and fail to pull, all with various obscure messages about missing keys (this on Linux), duplicate or missing layers, and others I don't recall.
The errors don't always happen. One that occurred just now, with an image that ran a couple hundred times yesterday on the same setup, though before a restart, was:
$ docker run --rm -it mbloore/model:conda4.3.1-aq0.1.9
docker: Error response from daemon: layer does not exist.
$ docker pull mbloore/model:conda4.3.1-aq0.1.9
conda4.3.1-aq0.1.9: Pulling from mbloore/model
Digest: sha256:4d203b18fd57f9d867086cc0c97476750b42a86f32d8a9f55976afa59e699b28
Status: Image is up to date for mbloore/model:conda4.3.1-aq0.1.9
$ docker rmi mbloore/model:conda4.3.1-aq0.1.9
Error response from daemon: unrecognized image ID sha256:8315bb7add4fea22d760097bc377dbc6d9f5572bd71e98911e8080924724554e
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
$
So it thinks it has no images, but the Docker folders are full of files, and it does know some hashes. It looks like some index has been damaged.
I restarted that instance, and then Docker seemed to be normal again without any special action on my part.
The only workarounds I have found so far are to restart and hope, or to delete several large Docker directories, and recreate them empty. Then after a restart and pull and run works again. But I'm now not sure that it always will.
I am running with Docker version 17.05.0-ce on Debian 9. My images were built with Docker version 17.03.2-ce on Amazon Linux, and are based on the official Ubuntu image.
Has anyone had this kind of problem, or know a way to reset the state of Docker without deleting almost everything?
Two points:
1) It seems that changing the VM had nothing to do with it. On some boots Docker worked, on others not, with no change in configuration or contents.
2) At Google's suggestion I installed Stackdriver monitoring and logging agents, and I haven't had a problem through seven restarts so far.
My first guess is that there is a race condition on startup, and adding those agents altered it in my favour. Of course, I'd like to have a real fix, but for now I don't have the time to pursue the problem.

Re-running docker-compose in Windows says network configuration changed

I have docker-compose version 1.11.2 on Windows and using a version 2.1 docker-compose.yml but whenever I try to run something like docker-compose up or docker-compose run a subsequent time, I get an error that the network needs to be recreated because configuration options changed (even if I didn't change anything). I can docker network rm to remove the network, but from other documentation and posts about docker-compose on Linux it seems this is unnecessary.
I can reproduce this reliably but can't really find any further information. Can anyone explain why I keep getting errors to recreate the network (using a transparent driver to download some stuff when building the image, but even using the nat driver gives me a similar error) or at least how to work around it? One of my scenarios is to be able to use docker-compose run on one of the services a couple of times on the same machine as part of cloud build/test.
Turns out this was a bug and was fixed in a subsequent update several weeks ago. I was told by one of the Docker developers that Windows 10 Creators Update was required as well.

Docker swarm mode load balancing

I've set up a docker swarm mode cluster, with two managers and one worker. This is on Centos 7. They're on machines dkr1, dkr2, dkr3. dkr3 is the worker.
I was upgrading to v1.13 the other day, and wanted zero downtime. But it didn't work exactly as expected. I'm trying to work out the correct way to do it, since this is one of the main goals, of having a cluster.
The swarm is in 'global' mode. That is, one replica per machine. My method for upgrading was to drain the node, stop the daemon, yum upgrade, start daemon. (Note that this wiped out my daemon config settings for ExecStart=...! Be careful if you upgrade.)
Our client/ESB hits dkr2, which does its load balancing magic over the swarm. dkr2 which is the leader. dkr1 is 'reachable'
I brought down dkr3. No issues. Upgraded docker. Brought it back up. No downtime from bringing down the worker.
Brought down dkr1. No issue at first. Still working when I brought it down. Upgraded docker. Brought it back up.
But during startup, it 404'ed. Once up, it was OK.
Brought down dkr2. I didn't actually record what happened then, sorry.
Anyway, while my app was starting up on dkr1, it 404'ed, since the server hadn't started yet.
Any idea what I might be doing wrong? I would suppose I need a health check of some sort, because the container is obviously ok, but the server isn't responding yet. So that's when I get downtime.
You are correct -- You need to specify a healthcheck to run against your app inside the container in order to make sure it is ready. Your container will not receive traffic until this healtcheck has passed.
A simple curl to an endpoint should suffice. Use the Healthcheck flag in your Dockerfile to specify a healthcheck to perform.
An example of the healthcheck line in a Dockerfile to check if an endpoint returned 200 OK would be:
HEALTHCHECK CMD curl -f 'http://localhost:8443/somepath' || exit 1
If you can't modify your Dockerfile, then you can also specify your healthcheck manually at deployment time using the compose file healthcheck format.
If that's also not possible either and you need to update a running service, you can do a service update and use a combination of the health flags to specify your healthcheck.

Resources