Docker pull fails for some layers from self-hosted Registry - docker

I have encountered an issue with a self-hosted Docker Registry. When pulling a certain image, the pull fails for some of its layers. As soon as I start the docker pull command, some layers report:
9894d615bbeb: Retrying in 5 seconds
dfc282427f6f: Retrying in 5 seconds
8dbb865cf7b1: Retrying in 1 second
This happens immediately when the command is started, the layers are not large.
The registry has been working flawlessly for about a year.
Recently we adopted a more docker-centric CI/CD flow and the need to continuously clear up the registry arose. For that reason we run a nightly cleanup job of the registry:
Delete all manifests on the filesystem (except for a few persistent images we want to keep)
Run the bin/registry garbage-collect command to delete all non-referenced blobs
I have verified that blobs for these failing layers do indeed exist on the registry, I can navigate to them on the filesystem.
The issue reproduces both on local system and on a remote server
The docker registry logs show that the HTTP request for the blob was successful:
time="2019-05-03T13:09:21.714123801Z" level=info msg="response completed" go.version=go1.7.6 http.request.host=registry.example.com http.request.id=e5c01bee-a755-48a5-b87a-716560dd0e25 http.request.method=GET http.request.remoteaddr=94.237.28.78 http.request.uri="/v2/applications/foo/dist-staging/blobs/sha256:9894d615bbebb5b235bb5a7aed17e9b2ba35c95c9fc8c0c763476c057536842f" http.request.useragent="docker/17.05.0-ce go/go1.7.5 git-commit/89658be kernel/4.4.0-145-generic os/linux arch/amd64 UpstreamClient(Swipely/Docker-API 1.33.6)" http.response.contenttype="application/octet-stream" http.response.duration=8.154072ms http.response.status=200 http.response.written=0 instance.id=92dfad5e-bf76-4db6-a8de-07901539d36e service=registry version=v2.6.2
172.17.0.1 - - [03/May/2019:13:09:21 +0000] "GET /v2/applications/foo/dist-staging/blobs/sha256:9894d615bbebb5b235bb5a7aed17e9b2ba35c95c9fc8c0c763476c057536842f HTTP/1.0" 200 0 "" "docker/17.05.0-ce go/go1.7.5 git-commit/89658be kernel/4.4.0-145-generic os/linux arch/amd64 UpstreamClient(Swipely/Docker-API 1.33.6)"
The docker registry is behind an nginx proxy, but its settings have not been changed recently. While debugging, I tried the following, with no luck:
proxy_buffering off;
proxy_max_temp_file_size 0;
Is there anything else I should check? Can this be caused by the registry cleanup somehow? Why? How?
Edit
It seems that something was stale in the Registry cache, because restarting it started throwing unknown blob errors.
After rebuilding and re-pushing the image, this error went away and the Registry was able to serve the image to clients again.
I think that means that something got messed up during the registry cleanup, but why? garbage-collect should only remove non-referenced blobs and should be safe to use, if I understand it correctly?
This question took a turn but I will leave it here in its entirety for others.

Related

docker push to nexus registry (behind proxy) ends with EOF

I have tried a lot, but I can't find a solution to this problem.
I am running a nexus sonatype (3.21.1-01) docker image on a centos7 server behind a vthunder a10 proxy.
The docker login and pull works great but docker push fail with EOF after some retrying.
Here the interested routes:
docker image port 8081 > my.server:8081
docker image port 8443 > my.server:8443
proxy.domain.local:443 > my.server:8081
proxy.domain.local:8443 > my.server:8443
I have created a docker repository in nexus which have the http connector exposed on 8443
The proxy is exposed under ssl with self signed certificate
The client's /etc/docker/daemon.json file contains the insecure registry options:
"insecure-registries": ["proxy.domain.local:8443","proxy.domain.local"]
Here the situation:
If I try to push from the client an image of which all layers already exist on the remote server (but missing on nexus repository), it works.
If I try the same but adding some difference to the same image (such as a new LABEL), it fail in this way:
(9c27e219663c: Layer already exists
Patch https://proxy.domain.local:8443/v2/test4/blobs/uploads/6862fe60-d63b-4942-bbb6-f403307e677a: EOF)
If I push directly from my.server machine, pointing to localhost:8443 it works.
If i push from the client machine an image with new layers it fail in this way after some retrying (the same behavior with smaller images):
docker push proxy.domain.local:8443/ara
The push refers to repository [proxy.domain.local:8443/ara]
edb7a4f74e22: Retrying in 8 seconds
de421654540d: Retrying in 8 seconds
-------------
The push refers to repository [proxy.domain.local:8443/ara]
edb7a4f74e22: Pushing [==================================================>] 172.6MB/172.6MB
de421654540d: Pushing [==================================================>] 200.8MB/200.8MB
EOF
this is a summary of what happen in wireshark
the.client my.server HTTP 316 GET /v2/ HTTP/1.1
...
my.server the.client HTTP 654 HTTP/1.1 401 Unauthorized (application/json)
...
the.client my.server HTTP 442 HEAD /v2/alpine-test/blobs/sha256:95f5ecd24e438e09033c8e69ec136079f8774ab8284f1431f5433a829054b5e7 HTTP/
(asking to nexus if the image is already uploaded)
my.server the.client HTTP 493 HTTP/1.1 404 Not Found
(it isn't)
the.client my.server HTTP 437 POST /v2/alpine-test/blobs/uploads/ HTTP/1.1
(so it start to post the image)
my.server the.client HTTP 584 HTTP/1.1 202 Accepted
...
the.client my.server HTTP 437 POST /v2/alpine-test/blobs/uploads/ HTTP/1.1
...
my.server the.client HTTP 584 HTTP/1.1 202 Accepted
..
and so on with some FIN/ACK in the middle until the client stops to send it...
** on nexus server log there is absolutely no trace about this **
this is the nexus docker compose:
services:
nexus:
build:
context: .
args:
DOCKER_GID: ${DOCKER_GID}
NEXUS_UID: ${NEXUS_UID}
NEXUS_GID: ${NEXUS_GID}
restart: always
environment:
- NEXUS_UID_GID=${NEXUS_UID_GID}
- HOSTNAME_DOCKER_NEXUS=${HOSTNAME_DOCKER_NEXUS}
ports:
- "8081:8081"
- "8443:8443"
user: ${NEXUS_UID_GID}
hostname: ${HOSTNAME_DOCKER_NEXUS}
volumes:
- /var/nexus-data:/nexus-data
- /etc/hosts:/etc/hosts
- /var/run/docker.sock:/var/run/docker.sock
Can you help me?
I was thinking about a possibile nexus-docker-user permission issue on the local machine/docker binary permissions (if i try from localhost it works, yes, but the image is already stored on the system of course) - but I think it is not so probable.
I was thinking also about proxy configuration issue (more probable), but I don't know much about proxy.
[Workaround]
Because I can not figure out the problem, I ended up with make proxy transparent and configuring nexus to serve directly in https throught it's jetty.xml, jetty.https and nexus.properties.
Serving https directly from jetty instead of let the proxy upgrade the connection solved the above problem.

Docker push intermittent failure to private docker registry on kubernetes (docker-desktop)

I'm running a kubernetes cluster on docker-desktop (mac).
It has a local docker registry inside it.
I'm able to query the registry no problem via the API calls to get the list of tags.
I was able to push an image before, but it took multiple attempts to push.
I can't push new changes now. It looks like it pushes successfully for layers, but then doesn't acknowledge the layer has been pushed and then retries.
Repo is called localhost:5000 and I am correctly port forwarding as per instructions on https://blog.hasura.io/sharing-a-local-registry-for-minikube-37c7240d0615/
I'm ot using ssl certs as this is for development on local machine.
(The port forwarding is proven to work otherwise API call would fail)
e086a4af6e6b: Retrying in 1 second
35c20f26d188: Layer already exists
c3fe59dd9556: Pushing [========================> ] 169.3MB/351.5MB
6ed1a81ba5b6: Layer already exists
a3483ce177ce: Retrying in 16 seconds
ce6c8756685b: Layer already exists
30339f20ced0: Retrying in 1 second
0eb22bfb707d: Pushing [==================================================>] 45.18MB
a2ae92ffcd29: Waiting
received unexpected HTTP status: 502 Bad Gateway
workaround (this will suffice but not ideal, as have to build each container
apiVersion: v1
kind: Pod
metadata:
name: producer
namespace: aetasa
spec:
containers:
- name: kafkaproducer
image: localhost:5000/aetasa/cta-user-create-app
imagePullPolicy: Never // this line uses the built container in docker
ports:
- containerPort: 5005
Kubectl logs for registry
10.1.0.1 - - [20/Feb/2019:19:18:03 +0000] "POST /v2/aetasa/cta-user-create-app/blobs/uploads/ HTTP/1.1" 202 0 "-" "docker/18.09.2 go/go1.10.6 git-commit/6247962 kernel/4.9.125-linuxkit os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.2 \x5C(darwin\x5C))" "-"
2019/02/20 19:18:03 [warn] 12#12: *293 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000011, client: 10.1.0.1, server: localhost, request: "PATCH /v2/aetasa/cta-user-create-app/blobs/uploads/16ad0e41-9af3-48c8-bdbe-e19e2b478278?_state=qjngrtaLCTal-7-hLwL9mvkmhOTHu4xvOv12gxYfgPx7Ik5hbWUiOiJhZXRhc2EvY3RhLXVzZXItY3JlYXRlLWFwcCIsIlVVSUQiOiIxNmFkMGU0MS05YWYzLTQ4YzgtYmRiZS1lMTllMmI0NzgyNzgiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTktMDItMjBUMTk6MTg6MDMuMTU2ODYxNloifQ%3D%3D HTTP/1.1", host: "localhost:5000"
2019/02/20 19:18:03 [error] 12#12: *293 connect() failed (111: Connection refused) while connecting to upstream, client: 10.1.0.1, server: localhost, request: "PATCH /v2/aetasa/cta-user-create-app/blobs/uploads/16ad0e41-9af3-48c8-bdbe-e19e2b478278?_state=qjngrtaLCTal-7-hLwL9mvkmhOTHu4xvOv12gxYfgPx7Ik5hbWUiOiJhZXRhc2EvY3RhLXVzZXItY3JlYXRlLWFwcCIsIlVVSUQiOiIxNmFkMGU0MS05YWYzLTQ4YzgtYmRiZS1lMTllMmI0NzgyNzgiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTktMDItMjBUMTk6MTg6MDMuMTU2ODYxNloifQ%3D%3D HTTP/1.1", upstream: "http://10.104.68.90:5000/v2/aetasa/cta-user-create-app/blobs/uploads/16ad0e41-9af3-48c8-bdbe-e19e2b478278?_state=qjngrtaLCTal-7-hLwL9mvkmhOTHu4xvOv12gxYfgPx7Ik5hbWUiOiJhZXRhc2EvY3RhLXVzZXItY3JlYXRlLWFwcCIsIlVVSUQiOiIxNmFkMGU0MS05YWYzLTQ4YzgtYmRiZS1lMTllMmI0NzgyNzgiLCJPZmZzZXQiOjAsIlN0YXJ0ZWRBdCI6IjIwMTktMDItMjBUMTk6MTg6MDMuMTU2ODYxNloifQ%3D%3D", host: "localhost:5000"
Try configure --max-concurrent-uploads=1 for your docker client. You are pushing quite large layers (350MB), so probably you are hitting some limits (request sizes, timeouts) somewhere. Single concurrent upload may help you, but it is only a work around. Real solution will be configuration (buffer sizes, timeouts, ...) of registry + reverse proxy in front of registry eventually.
It may be a disk space issue. If you store docker images inside the Docker VM you can fill up the disk space quite fast.
By default, docker-desktop VM disk space is limited to 64 gigabytes. You can increase it up to 112Gb on the "Disk" tab in Docker Preferences.
I have encountered this issues quite few times and unfortunately couldn't get to the permanent fix.
Most likely the image should have been corrupted in the registry. As a work around, i suggest you delete the image from registry and do a fresh push. it would work and subsequent pushes would work too.
This issue must be related to the missing layers of the image. sometimes we delete the image using --force option, in that case it is possible that some of the common layers might get deleted and would affect other images that share the deleted layers.

Unable to push image to Docker Hub registry

I am brand new to Docker and I am trying to follow the Getting Started tutorial from Docker. I am using Docker 17.05-ce under Ubuntu 17.04. The problem appears to be network related. When I try to push I get the following results:
jonathan#poseidon:~/DockerTest$ sudo docker push jgossage/get-started:part1
The push refers to a repository [docker.io/jgossage/get-started]
1770f1c9a8cf: Pushed
61fd1d8cd138: Pushed
e0f735a5e86f: Layer already exists
1de570a07fb5: Pushed
b3640b6d4ac2: Layer already exists
08d4c9ccebfd: Pushed
007ab444b234: Retrying in 1 second
dial tcp: lookup registry-1.docker.io on 127.0.0.53:53: dial udp 127.0.0.53:53: i/o timeout
jonathan#poseidon:~/DockerTest$ sudo docker logs 58e8df0a7426
* Running on http://0.0.0.0:80/ (Press CTRL+C to quit)
172.17.0.1 - - [20/Jun/2017 15:12:24] "GET / HTTP/1.1" 200 -
172.17.0.1 - - [20/Jun/2017 15:13:17] "GET / HTTP/1.1" 200 -
The push runs for some time with several retries before timing out.
This is on a home network with one computer connected to the router via WiFi and then normal TCP to my ISP and the Internet. What steps can I take to make Docker run reliably?
It looks like a DNS issue similar to this one: https://forums.docker.com/t/fata-0025-io-timeout-on-docker-image-push/1742/9
The suggestion is to replace your current DNS (127.0.0.53) by the Google DNS (8.8.8.8).
I'm not sure if there is an open issue concerning this problem. I couldn't find one.
I resolved this issue by replacing the standard DNS caching and resolving DNS server with a third party implementation unbound. The following web page contains complete instructions for doing this at the end of the document. As also suggested by others, it is a good idea to change to use the public Google DNS servers

Push docker image to Google Container Registry failure on Mac

I was trying to upload my image to Google Container Registry, but it return some error and I don't know how to troubleshooting.
$> gcloud docker -- push asia.gcr.io/dtapi-1314/web
The push refers to a repository [asia.gcr.io/dtapi-1314/web]
53ccd4e59f47: Retrying in 1 second
32ca8635750d: Retrying in 1 second
e5363ba7dd4d: Retrying in 1 second
d575d439624a: Retrying in 1 second
5c1cba20b78d: Retrying in 1 second
7198e99c156d: Waiting
6ca37046de16: Waiting
b8f2f07b3eab: Waiting
16681562a534: Waiting
92ea1d98cb79: Waiting
97ca462ad9ee: Waiting
unable to decode token response: read tcp 10.0.2.10:54718->74.125.23.82:443: read: connection reset by peer
I checked permission on my Mac.
$> gsutil acl get gs://asia.artifacts.dtapi-1314.appspot.com
It returned a list of correct permission.
I'd tested push on the cloud console, it works.
Does anyone have clue?
Thanks a lot if anyone could help. :)
Other troubleshooting
gcloud auth login
gcloud docker -- login -p $(gcloud auth print-access-token) -u _token https://asia.gcr.io
gsutil acl get gs://asia.artifacts.{%PROJECT_ID}.appspot.com
Add insecure-registry to dockerd startup command.
--insecure-registry asia.gcr.io
Might be the same cause
gcloud docker -- pull google/python
The error was
Error response from daemon: Get https://registry-1.docker.io/v2/google/python/manifests/latest: read tcp 10.0.2.15:37762->52.45.33.149:443: read: connection reset by peer
docker server log
DEBU[0499] Increasing token expiration to: 60 seconds
ERRO[0500] Error trying v2 registry: Get https://registry-1.docker.io/....../python/manifests/latest: read tcp 10.0.2.15:37762->52.45.33.149:443: read: connection reset by peer
ERRO[0500] Attempting next endpoint for pull after error: Get https://registry-1.docker.io/....../python/manifests/latest: read tcp 10.0.2.15:37762->52.45.33.149:443: read: connection reset by peer
DEBU[0500] Skipping v1 endpoint https://index.docker.io because v2 registry was detected
ERRO[0500] Handler for POST /v1.24/images/create returned error: Get https://registry-1.docker.io/....../python/manifests/latest: read tcp 10.0.2.15:37762->52.45.33.149:443: read: connection reset by peer
Environment
MacOS: 10.11.6
Docker Toolbox (on MAC)
Docker 1.12.3 (Git commit: 6b644ec, Built: Wed Oct 26 23:26:11 2016)
The root cause was stupid, but I'd like to update this for anyone who see this question. I found when I attached my computer to company's WIFI. Then It would work (Still some reset). The cable network of my company is mysterious broken to Google Container Registry. The cable network works for all other services (google/youtube/mobile services) we used but broken to Google Container Registry.
Seems like a permission issue. Try running
gcloud auth login
I remember running into a similar issue and this helped.

HTTP status: 500 error on docker pull using docker-machine

After removing and reinstalling a default machine using Docker Quickstart and VirtualBox any docker pull fails. Restarting docker-machine doesn't help.
For example:
~$ docker pull ubuntu:14.04
Error response from daemon: Get https://registry-1.docker.io/v2/library/ubuntu/manifests/14.04:
Received unexpected HTTP status: 500 Internal Server Error
Likely caused by an error at docker.io (500 should've been a red flag ;) that was intermittent. Next time double check from another machine if possible.
Earlier thoughts:
With the default docker-machine running, regenerating the tls certs:
docker-machine regenerate-certs
fixes the problem, sometimes, for one pull. But has also yielded:
Error response from daemon: Get <url omitted>: Get <url omitted>:
net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Resources