I had an experience on 2021.09.07: a freshly created docker image is downloaded so slow via docker pull (from hub.docker.com)...
The last layer was the obstacle – it took 40-50 minutes to be finished. What can be the reason?
e249e58386a8: Downloading [===> ] 83.73MB/303.3MB
Check your internet connectivity. Specially if you have a proxy or behind firewall. (If firewall has rules, ask admin to whitelist hub.docker.com - this could be the reason)
Your PC's firewall, virus-guard etc.
Restart the node and check.
Related
Sometimes my internet connection gets slow and unstable. And when I use docker pull image_name_here for images that are large in volume, I see that they get stuck in the middle of the download process. Or the internet connection is lost and it times out or exist with other errors.
But when I execute that pull command again, the layers that are already downloaded won't be saved on my drive. They would be downloaded again.
This means for large images on an unstable network I literally can't pull the image.
Is there a way for me to somehow resume the pull process from where it was interrupted.
Is there a third party app that does that?
I'm on Linux (Ubuntu & Debian)
I am looking for some advice on debugging some extremely painful Docker connectivity issues.
In particular, for an Azure DevOps Services Git repository, I am running a self-hosted (locally) dockerized Linux CI (setup according to https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/docker?view=azure-devops#linux), which has been working fine for a few months now.
All this runs on a company network, and since last week the network connection of my docker container became highly unstable:
Specifically it intermittently looses network connection, which is also visible via the logs of the Azure DevOps agent, which then keeps trying to reconnect.
This especially happens while downloading Git LFS objects. Enabling extra traces via GIT_TRACE=1 highlights a lot of connection failures and retries:
trace git-lfs: xfer: failed to resume download for "SHA" from byte N: expected status code 206, received 200. Re-downloading from start
During such a LFS pull / fetch, sometimes the container even stops responding as a docker container list command only responds:
Error response from daemon: i/o timeout
As a result the daemon cannot recover on its own, and needs a manual restart (to get back up the CI).
Also I see remarkable differences in network performance:
Manually cloning the same Git repository (including LFS objects, all from scratch) in container instances (created from the same image) on different machines, takes less than 2mins on my dev laptop machine (connected from home via VPN), while the same operation easily takes up to 20minutes (!) on containers running two different Win10 machines (company network, physically located in offices, hence no VPN.
Clearly this is not about the host network connection itself, since cloning on the same Win10 hosts (company network/offices) outside of the containers takes only 14seconds!
Hence I am suspecting some network configuration issues (e.g. sth with the Hyper-V vEthernet Adapter? Firewall? Proxy? or whichever other watchdog going astray?), but after three days of debugging, I am not quite sure how to further investigate this issue, as I am running out of ideas and expertise. Any thoughts / advice / hints?
I should add that LFS configuration options (such as lfs.concurrenttransfers and lfs.basictransfersonly) did not really help, similarly for git config http.version (or just removing some larger files)
UPDATE
it does not actually seem to be about the self-hosted agent but a more general docker network cfg issue within my corporate network.
Running the following works consistently fast on my VPN machine (running from home):
docker run -it
ubuntu bash -c "apt-get update; apt-get install -y wget; start=$SECONDS;
wget http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso;
echo Duration: $(( SECONDS - start )) seconds"
Comparision with powershell download (on the host):
$start=Get-Date
$(New-Object
net.webclient).Downloadfile("http://cdimage.ubuntu.com/lubuntu/releases/18.04/release/lubuntu-18.04-alternate-amd64.iso",
"e:/temp/lubuntu-18.04-alternate-amd64.iso")
'Duration: {0:mm}
min {0:ss} sec' -f ($(Get-Date)-$start)
Corporate network
Docker: 1560 seconds (=26 min!)
Windows host sys: Duration: 00 min 15 sec
Dev laptop (VPN, from home):
Docker: 144 seconds (=2min 24sec)
Windows host sys: Duration: 02 min 16 sec
Looking at the issues discussed in https://github.com/docker/for-win/issues/698 (and proposed workaround which didn't work for me), it seems to be a non-trivial problem with Windows / hyper-v ..
The whole issue "solved itself" when my company decided to finally upgrade from Win10 1803 to 1909 (which comes with WSL, replacing Hyper-V) .. 😂
Now everything runs supersmoothly (I kept running these tests for almost 20 times)
If I run Docker (Docker for Desktop, 2.0.0.3 on Windows 10), then access to internal infrastructure and containers is fine. I can easily do
docker pull internal.registry:5005/container:latest
But ones I enable Kubernetes there, I completely lose an access to internal infrastructure and [Errno 113] Host is unreachable in Kubernetes itself or connect: no route to host from Docker appears.
I have tried several ways, including switching of NAT from DockerNAT to Default Switch. That one doesn't work without restart and restart changes it back to DockerNAT, so, no luck here. This option also seems not to work.
let's start from the basics form the official documentation:
Please make sure you meet all the prerequisites and all other instructions were met.
Also you can use this guide. It has more info with details pointing to what might have gone wrong in your case.
If the above won't help, there are few other things to consider:
In case you are using a virtual machine, make sure that the IP you are referring to is the one of the docker-engines’ host and not the one on which the client is running.
Try to add tmpnginx in docker-compose.
Try to delete the pki directory in C:\programdata\DockerDesktop (first stop Docker, delete the dir and than start Docker). The directory will be recreated and k8s-app=kube-dns labels should work fine.
Please let me know if that helped.
I can deploy to my local service fabric cluster and it works fine. When I attempt to deploy it to my azure service fabric cluster it errors out with
Error event: SourceId='System.Hosting', Property='Download:1.0:1.0:5fb96531-7b75-42d0-8f23-6a9e42f0bda4'.
There was an error during download.System.Fabric.FabricException (-2147017731)
Container image download failed for ImageName=microsoft/aspnet with unexpected error. Exception=System.Exception: Container image history check failed after successful download. ImageName=microsoft/aspnet.
at Hosting.ContainerActivatorService.ContainerImageDownloader.d__6.MoveNext().
When googling this error, the common answers are that the vm hardrive is full (check one of my nodes, over 100gb available) or that the vm operating system is wrong (verified on the vm scaleset that it is running 2016-Datacenter-with-Containers). Also have seem some people mention not having enough resources on the vm's so I bumped them up to Standard_D3_v2 which should be plenty.
I did see some people mentioning increasing the container download timeout. The container is over 5gb so this is potentially an issues, and could work locally because its coming from docker cache. Unfortunately I'm not sure how to increase the timeout easily.
What else could cause this issue?
Make sure you target the correct version of the (base) image.
There are a few to choose from.
The version of the image must be compatible with the version of Windows you're running on the host.
For an image on this size, is likely that is timing out while downloading it.
You could try:
Use a private repository on same region as your cluster, like 'Azure Container Registry', you might get higher download speeds
If the botleneck is on your network, Increase the VM sizes, bigger VMs has more bandwidth.
Configure the cluster to wait longer to download the image. You can try setting the ContainerImageDownloadTimeout as described here
This is set in the cluster configuration, an you cluster manifest will have a section like this:
{
"name": "Hosting",
"parameters": [
{
"name": "ContainerImageDownloadTimeout",
"value": "1200"
}
]
}
To change the settings from an existing cluster, you can follow the instructions found here and here
We're setting up a server to host Windows containers.
This server gets the images from an internal Docker registry we have setup.
The issue is that the server is unable to pull down images because it's trying to get a base image from the internet, and the server has no internet connection.
I found a troubleshooting script from Microsoft and notice one passage:
At least one of 'microsoft/windowsservercore' or
'microsoft/nanoserver' should be installed
Try docker pull microsoft/nanoserver or docker pull
microsoft/windowsservercore to pull a Windows container image
Since my PC has internet connection, I downloaded these images, pushed them to the registry, but pulling the images on the new server fails:
The description for Event ID '1' in Source 'docker' cannot be found. The local computer may not have the necessary registry information or message DLL files to display the message, or you may not have permission to access them. The following information is part of the event:'Error initiating layer download: Get https://go.microsoft.com/fwlink/?linkid=860052: dial tcp 23.207.173.222:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.'
That link it's trying to get is a base image on the internet, but I thought the registry was storing the complete image, so what gives? Is it really not possible to store the base images in a registry?
Doing some reading I found this: https://docs.docker.com/registry/deploying/#considerations-for-air-gapped-registries
Certain images, such as the official Microsoft Windows base images,
are not distributable. This means that when you push an image based on
one of these images to your private registry, the non-distributable
layers are not pushed, but are always fetched from their authorized
location. This is fine for internet-connected hosts, but will not work
in an air-gapped set-up.
The doc then details how to setup the registry to store non-distributable layers, but they also say to be mindful of the terms of use for non-distributable layers.
So two possible solutions are:
Make sure you can store the non-distributable layers, then reconfigure the registry to store the non-distributable layers
Connect the server to the internet, download the base images, then use those images