Irregular TIMEOUTs in Docker Operations between GCP GKE and AWS ECR - timeout

In my company, we have a GitLab CI/CD pipeline running on the GitLab Runner on Kubernetes (GKE). In that process, we push Docker images to a private repository in AWS ECR. The processes access the Internet through Cloud NAT.
This worked for several months with no issues until two weeks ago, when we started getting timeout errors such as below:
Error response from daemon: Get https://<account>.dkr.ecr.<region>.amazonaws.com/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
This is not consistent and, sometimes, the process finishes successfully. Also, the messages vary because this sometimes happen during docker login, other times during docker push. Finally, retries during push like below didn't use to happen in the past, but now they happen a lot.
Retrying in 5 seconds ... Retrying in 1 second
Have you guys ever faced something similar to this situation?
I've already tried some solutions I found online such as: reducing --max-concurrent-uploads or setting --dns 8.8.8.8 on the Docker Daemon. Also, I've tried increasing the Cloud NAT's Minimum ports per VM instance option and changing the TCP TIME_WAIT Timeout. No success with any of these.
We have other pipelines in the company, but only this one access ECR. No issues accessing any other endpoint.

Related

Hitting docker rate limit without pulling at all

I have a computer that is running docker. Now I get the error toomanyrequests when I try to pull an image. The twist is, I get this error if Docker is just running and I do not pull anything. So by waiting I never get to pull anything, except if I change my IP. If I get a fresh IP, I can pull without a problem. But after a few hours, I cannot pull anymore from the IP that the computer that is running Docker is using.
To my knowledge, I do not have any other software running that should provoke a pull. Is there anything from Docker itself, that contact docker hub and is causing the rate limit to kick-in. I just have 3 simple services running in Docker: A web proxy, a database and keycload. This is on a VM running Ubuntu 22.04.
There are no other machine on my network that are running Docker. If I start other machines and start Docker there, this problem does not occur. For example, I can start Docker Desktop on another machine and pull lots of stuff and leave it running, I do not get toomanyrequests.
Can anyone offer an explanation what is causing this? How can I fix this?

Concourse Can't Connect to Docker Repository

I'm new to concourse and trying to set it up in my environment. I'm running Ubuntu 18.04 on Virtualbox 6.1.4 r136177 on Windows machine. I managed to get the node running and concourse worker set up, and I was able to access my concourse dashboard successfully. The problem occurred when I was trying to run a simple hello world pipeline as outlined on this page : https://concourse-ci.org/hello-world-example.html
The error says :
[31mERRO [0m[0004] check failed: get remote image: Get https://index.docker.io/v2/: dial tcp: lookup index.docker.io on [::1]:53: read udp [::1]:55989->[::1]:53: read: connection refused
Googling for similar error indicates that virtualbox might not be able to connect to docker repository. So I proceed with installing docker to my system and run the following command :
sudo docker run hello-world
But this this time docker successfully pulled the image. So I think it is not an issue with my virtualbox. Have anyone experienced the same issue and found a solution?
UPDATES
The following question inspire me to build my own registry :
How to use a local docker image as resource in concourse-docker
I have configured my local docker registry, and have verified that it does work by pulling my image from my own registry. So I configured a simple concourse pipeline to use my registry by modifying the hello world example :
---
jobs:
- name: job
public: true
plan:
- task: simple-task
config:
platform: linux
image_resource:
type: docker-image
source:
repository: 127.0.0.1:5000/busybox
tag: latest
insecure_registries: [ "127.0.0.1:5000" ]
run:
path: echo
args: ["Hello, world!"]
But then I run into the following error :
resource script '/opt/resource/check []' failed: exit status 1
stderr:
failed to ping registry: 2 error(s) occurred:
* ping https: Get https://127.0.0.1:5000/v2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers
* ping http: Get http://127.0.0.1:5000/v2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers
That 127.0.0.1 is likely referring to the IP of the check container, not the machine where Concourse is running as a worker (unless you have houdini as the container strategy). Try getting the actual IP of the machine running docker and try that.
I faced the same problem. In my case, concourse worker was installed on a qemu VM inside proxmox.
When starting a job with fly-t tutorials trigger-job --job hello-world/hello-world-job --watch command (given in tutorial), worker answered ERRO[0030] checking origin busybox failed: initialize transport: Get "https://index.docker.io/v2/": dial tcp xx.xx.xx.xx:443: i/o timeout.
It means that worker can't reach any DNS server.
There are two ways to solve this problem.
First option: run everything through docker-compose. docker-compose.yml has setting for worker: CONCOURSE_GARDEN_DNS_PROXY_ENABLE: "true". And all works fine. However, I tried to specify same setting when running worker directly inside VM (without docker), and this did not fix the problem.
Second option (without docker):
Use this settings for your worker:
CONCOURSE_RUNTIME=containerd
CONCOURSE_CONTAINERD_EXTERNAL_IP=192.168.1.106
CONCOURSE_CONTAINERD_DNS_SERVER=192.168.1.1
CONCOURSE_CONTAINERD_ALLOW_HOST_ACCESS=true
CONCOURSE_CONTAINERD_DNS_PROXY_ENABLE=true
After setting these parameters my worker could see DNS server and can get access docker registry.
Replace 192.168.1.106 with your machine address in your local network. And
192.168.1.1 with your DNS server.
These parameters are documented here. Also you can get these description with concourse worker --help command.
Containerd Container Networking:
--containerd-external-ip= IP address to use to reach container's mapped ports. Autodetected if not specified. [$CONCOURSE_CONTAINERD_EXTERNAL_IP]
--containerd-dns-server= DNS server IP address to use instead of automatically determined servers. Can be specified multiple times. [$CONCOURSE_CONTAINERD_DNS_SERVER]
--containerd-restricted-network= Network ranges to which traffic from containers will be restricted. Can be specified multiple times. [$CONCOURSE_CONTAINERD_RESTRICTED_NETWORK]
--containerd-network-pool= Network range to use for dynamically allocated container subnets. (default: 10.80.0.0/16) [$CONCOURSE_CONTAINERD_NETWORK_POOL]
--containerd-mtu= MTU size for container network interfaces. Defaults to the MTU of the interface used for outbound access by the host. [$CONCOURSE_CONTAINERD_MTU]
--containerd-allow-host-access Allow containers to reach the host's network. This is turned off by default. [$CONCOURSE_CONTAINERD_ALLOW_HOST_ACCESS]
I had the same issue. Cloned this repo - https://github.com/concourse/concourse-docker
followed the directions as per the readme to generate the keys and then used the docker-compose.yml file from the clone to spin up the docker container.

Docker pull stops working on amazon ec2 instance after ACL was applied to the subnet which this instance belongs to

I have an amazon-ec2 instance, running Amazon Linux. Docker installed on that instance used to work just fine until I created a network ACL and applied it to the subnet, which my instance with Docker belongs to.
The ACL contains restricts Inbound traffic to certain IP addresses and allows all Outbound traffic.
After that ACL had been applied to the subnet, pulling images from the https://hub.docker.com/ ("docker pull" command) stopped working and fails with the error:
Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
I tried looking into a Flow Logs and saw some incoming requests with status REJECTED. I suspect, that "docker-pull" causes some incoming connections which are blocked by the ACL. And these connections are coming from different IP addresses, so I could not find any "fixed" set of IPs which I could add to the allowed list in ACL.
Can anybody suggest the way to configure it properly and fix pulling docker images?
The ACL configuration:
Outbound rules
Inbound rules

GKE nodes can't reach external IP hosted on the same GKE cluster

Running 1.11.2-gke.9 (COS image)
I have installed installed gitlab-ci (including container registry) via helm chart. Everything is green.
Simple CI/Cd pipeline was pushing new images to gitlab docker registry. Push works.
On deployment - there is Error: ErrImagePull with
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
After ssh'ing (gcp console ssh within browser) i noticed that the node can reach virtually the entire internet but not the very ingress the cluster is hosting.
Hence docker login/pull hanging.
How come that gitlab runner running inside GKE can push to registry, but the node that starts application pods cannot pull/login?
All FW rules are created by GKE itself, and they allow 80/443.
Routing quirk/bug?

IBM Cloud Private node appears to be running but services are unresponsive

One my ICP nodes appears to be running, but the services on that node are unresponsive and will at times return a 504 Gateway Timeout.
When I SSH into the unresponsive node and run journalctl -u kubelet -f I am seeing error messages such as transport: dial unix /var/run/docker/containerd/docker-containerd.sock: connect: connection refused
Furthermore, when I run top I'm seeing dockerd using an usually high percentage of my CPU.
What is causing this behavior and how can I return my node to its normal working condition?
These errors might be due to a known issue with Docker where an old containerd reference is used even after the containerd daemon was restarted. This defect causes the Docker daemon to go into an internal error loop that uses a high amount of CPU resources and logs a high number of errors. For more information about this error, please see the Refresh containerd remotes on containerd restarted pull request against the Moby project.
To work around this issue, use the host operating system command to restart the docker service on the node. After some time, the services should resume.

Resources