Docker connectivity issues (to Azure DevOps Services from self hosted Linux Docker agent) - docker

I am looking for some advice on debugging some extremely painful Docker connectivity issues.
In particular, for an Azure DevOps Services Git repository, I am running a self-hosted (locally) dockerized Linux CI (setup according to https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux), which has been working fine for a few months now.
All this runs on a company network, and since last week the network connection of my docker container became highly unstable:
Specifically it intermittently looses network connection, which is also visible via the logs of the Azure DevOps agent, which then keeps trying to reconnect.
This especially happens while downloading Git LFS objects. Enabling extra traces via GIT_TRACE=1 highlights a lot of connection failures and retries:
trace git-lfs: xfer: failed to resume download for "SHA" from byte N: expected status code 206, received 200. Re-downloading from start
During such a LFS pull / fetch, sometimes the container even stops responding as a docker container list command only responds:
Error response from daemon: i/o timeout
As a result the daemon cannot recover on its own, and needs a manual restart (to get back up the CI).
Also I see remarkable differences in network performance:
Manually cloning the same Git repository (including LFS objects, all from scratch) in container instances (created from the same image) on different machines, takes less than 2mins on my dev laptop machine (connected from home via VPN), while the same operation easily takes up to 20minutes (!) on containers running two different Win10 machines (company network, physically located in offices, hence no VPN.
Clearly this is not about the host network connection itself, since cloning on the same Win10 hosts (company network/offices) outside of the containers takes only 14seconds!
Hence I am suspecting some network configuration issues (e.g. sth with the Hyper-V vEthernet Adapter? Firewall? Proxy? or whichever other watchdog going astray?), but after three days of debugging, I am not quite sure how to further investigate this issue, as I am running out of ideas and expertise. Any thoughts / advice / hints?
I should add that LFS configuration options (such as lfs.concurrenttransfers and lfs.basictransfersonly) did not really help, similarly for git config http.version (or just removing some larger files)
UPDATE
it does not actually seem to be about the self-hosted agent but a more general docker network cfg issue within my corporate network.
Running the following works consistently fast on my VPN machine (running from home):
docker run -it
ubuntu bash -c "apt-get update; apt-get install -y wget; start=$SECONDS;
wget http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso;
echo Duration: $(( SECONDS - start )) seconds"
Comparision with powershell download (on the host):
$start=Get-Date
$(New-Object
net.webclient).Downloadfile("http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso",
"e:/temp/lubuntu-18.04-alternate-amd64.iso")
'Duration: {0:mm}
min {0:ss} sec' -f ($(Get-Date)-$start)
Corporate network
Docker: 1560 seconds (=26 min!)
Windows host sys: Duration: 00 min 15 sec
Dev laptop (VPN, from home):
Docker: 144 seconds (=2min 24sec)
Windows host sys: Duration: 02 min 16 sec
Looking at the issues discussed in https://github.com/docker/for-win/issues/698 (and proposed workaround which didn't work for me), it seems to be a non-trivial problem with Windows / hyper-v ..

The whole issue "solved itself" when my company decided to finally upgrade from Win10 1803 to 1909 (which comes with WSL, replacing Hyper-V) .. 😂
Now everything runs supersmoothly (I kept running these tests for almost 20 times)

Related

Docker service showing no such image when trying to upgrade service

First of all sorry if I have a bad english.
We have a service that was being upgraded until 26 / September / 2022, via portainer or via terminal on Docker. It was on gitlab registry.
We did not make any changes but we are not able to upgrade it anymore!
How can we debug why this message is appearing?
No such image: registry.gitlab.com/xxxx/xxx/api:1.1.18#sha256:xxxx
Some additional informations:
-We are using docker login before trying to do the service update.
-We can do docker pull registry.gitlab.com/etc/etc (the version)
The problem only occurs when we try to upgrade it as a service.
There is some kind of debug on the service upgrade that can provide some additional information like firewall is blocking or something like this?
docker service update nameofservice
nameofservice
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
overall progress: 0 out of 1 tasks
1/1: preparing [=================================> ]
Until return the error 'no such image'!
I am pretty sure the image exists.
If you are experiencing the same problem, check if you have more nodes, phisical machines or vms running connected to your docker node (docker node ls).
If that is your case, run docker pull gitlabaddressetcetc on the other nodes and check if everything is fine.
I found the message 'No space left on device', so I runned 'df -h' but a lot of space are available for the VM. Anyway I decided to run 'docker prune -f' to see what happens:
So running the 'docker system prune -f' seems to solved my problem, and everything is fine now.
After that I just needed to change the version of the portainer to a invalid one before trying again.

Docker install of AZCore results in authserver+worldserver doesn't exist error

I'm trying to spin up a fresh server using the azerothcore docker installation guide. I have completed all of the early installation steps, up until running the containers. Upon running the containers (for worldserver and authserver) i see the following output from the containers. It appears the destination of the world and auth servers in dist/bin is missing, how may i resolve this issue?
Check your docker settings. Make sure you have enough memory. If containers have low memory they will not finish the compile. Check if you have build issues.

The Divio app or CLI "doctor" reports a DNS failure

The Divio app shows an error when setting up the local Docker container because (according to 'divio doctor') DNS resolution inside Docker doesn't work.
I've setup an Ubuntu 18.1 VBox VM on a W10 host to serve as a Divio local development box. DNS resolution was never a problem when running Docker on the host though.
I added "dns": [8.8.8.8] to /etc/docker/daemon.json to get DNS to work from the terminal.
The following command returns the correct answer:
docker run busybox nslookup control.divio.com
Server: 8.8.8.8
Address: 8.8.8.8:53
Non-authoritative answer:
Name: control.divio.com
Address: 217.150.252.173
Anyone has an idea how to fix this?
What's happening is this: the command executed inside the container to test for DNS resolution (nslookup control.divio.com) has a 5 second timeout.
Your command (docker run busybox nslookup control.divio.com does just the same thing - but without the timeout.
For whatever reason, it's taking longer than 5 seconds to get a response, hence the failure in the first case.
It's not entirely clear why this sometimes happens, with no obvious reason - DNS resolution should not take so long.
You can disable this test though, by adding docker-server-dns to skip_doctor_checks in the ~/.aldryn file. See the Divio Cloud documentation for details.
Update 8th January 2019
The Divio App has been updated (to version 0.13.1) that you will be offered when you next launch it, along with the Divio CLI (to version 3.3.10) which if you use outside the Divio Shell can be installed with pip install --upgrade divio-cli.
In this update the way the lookup works has been changed to mitigate the effects of network problems when it does the DNS check (it now does a more restricted check).
You should now be able to re-enable the disabled docker-server-dns test in the ~/.aldryn file.
Update 8th March 2019
To complicate matters, it turns out that the Busybox image used to run these tests has changed in recent versions and it's quite difficult to ensure that the commands used in the test will work with whatever version of Busybox the user happens to have.
Running docker pull busybox will update the image, and for many users this has solved the issues. Some users will be able to reinstate the tests described above that would fail.

Slow install / upgrade through Helm (for Kubernetes)

Our application consists of circa 20 modules. Each module contains a (Helm) chart with several deployments, services and jobs. Some of those jobs are defined as Helm pre-install and pre-upgrade hooks. Altogether there are probably about 120 yaml files, which eventualy result in about 50 running pods.
During development we are running Docker for Windows version 2.0.0.0-beta-1-win75 with Docker 18.09.0-ce-beta1 and Kubernetes 1.10.3. To simplify management of our Kubernetes yaml files we use Helm 2.11.0. Docker for Windows is configured to use 2 CPU cores (of 4) and 8GB RAM (of 24GB).
When creating the application environment for the first time, it takes more that 20 minutes to become available. This seems far to slow; we are probably making an important mistake somewhere. We have tried to improve the (re)start time, but to no avail. Any help or insights to improve the situation would be greatly appreciated.
A simplified version of our startup script:
#!/bin/bash
# Start some infrastructure
helm upgrade --force --install modules/infrastructure/chart
# Start ~20 modules in parallel
helm upgrade --force --install modules/module01/chart &
[...]
helm upgrade --force --install modules/module20/chart &
await_modules()
Executing the same startup script again later to 'restart' the application still takes about 5 minutes. As far as I know, unchanged objects are not modified at all by Kubernetes. Only the circa 40 hooks are run by Helm.
Running a single hook manually with docker run is fast (~3 seconds). Running that same hook through Helm and Kubernetes regularly takes 15 seconds or more.
Some things we have discovered and tried are listed below.
Linux staging environment
Our staging environment consists of Ubuntu with native Docker. Kubernetes is installed through minikube with --vm-driver none.
Contrary to our local development environment, the staging environment retrieves the application code through a (deprecated) gitRepo volume for almost every deployment and job. Understandibly, this only seems to worsen the problem. Starting the environment for the first time takes over 25 minutes, restarting it takes about 20 minutes.
We tried replacing the gitRepo volume with a sidecar container that retrieves the application code as a TAR. Although we have not modified the whole application, initial tests indicate this is not particularly faster than the gitRepo volume.
This situation can probably be improved with an alternative type of volume that enables sharing of code between deployements and jobs. We would rather not introduce more complexity, though, so we have not explored this avenue any further.
Docker run time
Executing a single empty alpine container through docker run alpine echo "test" takes roughly 2 seconds. This seems to be overhead of the setup on Windows. That same command takes less 0.5 seconds on our Linux staging environment.
Docker volume sharing
Most of the containers - including the hooks - share code with the host through a hostPath. The command docker run -v <host path>:<container path> alpine echo "test" takes 3 seconds to run. Using volumes seems to increase runtime with aproximately 1 second.
Parallel or sequential
Sequential execution of the commands in the startup script does not improve startup time. Neither does it drastically worsen.
IO bound?
Windows taskmanager indicates that IO is at 100% when executing the startup script. Our hooks and application code are not IO intensive at all. So the IO load seems to originate from Docker, Kubernetes or Helm. We have tried to find the bottleneck, but were unable to pinpoint the cause.
Reducing IO through ramdisk
To test the premise of being IO bound further, we exchanged /var/lib/docker with a ramdisk in our Linux staging environment. Starting the application with this configuration was not significantly faster.
To compare Kubernetes with Docker, you need to consider that Kubernetes will run more or less the same Docker command on a final step. Before that happens many things are happening.
The authentication and authorization processes, creating objects in etcd, locating correct nodes for pods scheduling them and provisioning storage and many more.
Helm itself also adds an overhead to the process depending on size of chart.
I recommend reading One year using Kubernetes in production: Lessons learned. Author goes into explaining what have they achieved by switching to Kubernetes as well differences in overhead:
Cost calculation
Looking at costs, there are two sides to the story. To run Kubernetes, an etcd cluster is required, as well as a master node. While these are not necessarily expensive components to run, this overhead can be relatively expensive when it comes to very small deployments. For these types of deployments, it’s probably best to use a hosted solution such as Google's Container Service.
For larger deployments, it’s easy to save a lot on server costs. The overhead of running etcd and a master node aren’t significant in these deployments. Kubernetes makes it very easy to run many containers on the same hosts, making maximum use of the available resources. This reduces the number of required servers, which directly saves you money. When running Kubernetes sounds great, but the ops side of running such a cluster seems less attractive, there are a number of hosted services to look at, including Cloud RTI, which is what my team is working on.

Windows 10 Docker Network DNS doesn't work after reboot

I'm not sure if this is an issue with the current version of Windows Docker network or poor configuration and misunderstanding on my part, but I have the following setup:
2 Docker containers (built using the Microsoft/ASP.NET image as a base) running a .NET MVC application in each.
1 Docker container running SQL server (built using the Microsoft/mssql-server-windows image)
When I create all 3 containers everything works great, I can attach and ping all other the other containers using their names without any issue. The applications run and can communicate with each other as I hoped.
However, when I reboot my machine and start all the containers again they can no longer ping/communicate with each other using their names (using IP addresses is fine).
I've tried this on the default NAT network and also tried replacing the NAT network with my own custom NAT network.
To resolve the issue I have to run the force network disconnect command for each container as such:
docker network disconnect nat <containername> --force
And then I have to reconnect each container to the network before starting them up. All containers can then ping/communicate with each other using their names as well as their IP addresses.
FYI, this is a development environment but I was hoping to do something similar in Azure using a Windows Server 2016 VM, although I don't quite know what the best network configuration is for live production yet as I need to have multiple applications (in separate containers) on the same node accessed via their own subdomains.
Any help or guidance would be great.
I'm not sure, in part because this question was asked several months before any other example I've run into, but this sounds very similar to the problem described at https://github.com/docker/for-win/issues/1038.
Basically, there appears to be a problem introduced with the 1709 update to Windows 10 which results in a scenario where Hyper-V networking doesn't work the way it ought to.
There appear to be two common ways of working around this problem: Turning off "Fast Start" in the Control Panel => Power Options => System Settings, or restarting Docker for Windows and any containers after booting. I also thought I saw something on a Microsoft blog post indicating that the underlying problem has now been resolved and will be included in an update to Windows 10, but alas I can no longer find that information or the specific version number in which the problem was (theoretically) resolved. It may well be the delayed 1803 "Spring Creators Update" release.

Resources