Docker: too many open files - docker

I have Docker installed via Snap on Ubuntu 20.04. From time to time, Portainer (which I'm using as graphical UI to manage the containers) stops responding - in the sense, the UI accepts interaction, but the list of containers won't load, or the list of volumes, I won't be able to set up a new container, etc.
When I ran snap logs docker just now, I got the following:
2022-10-07T09:38:59+03:00 docker.dockerd[770]: time="2022-10-07T09:38:59.500434820+03:00" level=error msg="Error replicating health state for container 5e7ed995ca45945035048596539293a9bb11ee0b4e30e7c4956eec953077806b: open /var/snap/docker/common/var-lib-docker/containers/5e7ed995ca45945035048596539293a9bb11ee0b4e30e7c4956eec953077806b/.tmp-config.v2.json632473701: too many open files"
2022-10-07T09:38:59+03:00 docker.dockerd[770]: time="2022-10-07T09:38:59.542152208+03:00" level=error msg="Error replicating health state for container f876ad961153cbd2815cc7715983987bc072c631a378cf3d2ba7e248aae27423: open /var/snap/docker/common/var-lib-docker/containers/f876ad961153cbd2815cc7715983987bc072c631a378cf3d2ba7e248aae27423/.tmp-config.v2.json471368832: too many open files"
2022-10-07T09:38:59+03:00 docker.dockerd[770]: time="2022-10-07T09:38:59.542155861+03:00" level=error msg="Error replicating health state for container dabe3abe216046ef5b84e3169f7faf90b6e3050cef75a8ca23e940e516ecff20: open /var/snap/docker/common/var-lib-docker/containers/dabe3abe216046ef5b84e3169f7faf90b6e3050cef75a8ca23e940e516ecff20/.tmp-config.v2.json801890271: too many open files"
2022-10-07T09:39:00+03:00 dockerd[770]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2022-10-07T09:39:01+03:00 dockerd[770]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2022-10-07T09:39:02+03:00 dockerd[770]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2022-10-07T09:39:02+03:00 docker.dockerd[770]: time="2022-10-07T09:39:02.822036736+03:00" level=error msg="Error replicating health state for container 98bee8456d114bff5ee423f46e7e0892dfdf5af694c8edc522841ce5e5976b1f: open /var/snap/docker/common/var-lib-docker/containers/98bee8456d114bff5ee423f46e7e0892dfdf5af694c8edc522841ce5e5976b1f/.tmp-config.v2.json624827314: too many open files"
2022-10-07T09:39:03+03:00 dockerd[770]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
2022-10-07T09:39:04+03:00 dockerd[770]: http: Accept error: accept unix /var/run/docker.sock: accept4: too many open files; retrying in 1s
Could you help me figure out what's causing this and how to avoid it in the future, please?
Thank you!
EDIT (31 Oct): the same problem has been occurring numerous times since I last wrote this post. I've tried the suggestions by IamK below, but they don't seem to work.
More specifically, even though I edited etc/sysctl.conf, the current soft and hard limits remain 1024 and 1048576, respectively.
I'd grealy appreciate further help. Thank you!

Related

Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1 (missing "/")

Executive summary
For several weeks we sporadically see the following error on all of our AKS Kubernetes clusters:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1
Obviously there is a missing "/" after "mcr.microsoft.com".
The problem started after upgrading the clusters from 1.17 to 1.20.
Where does this spelling error come from? Is there anything WE can do about it?
Some details
The full error is:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": rpc error: code = Unknown desc = failed to pull and unpack image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to resolve reference "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to do request: Head https://mcr.microsoft.comoss/v2/calico/pod2daemon-flexvol/manifests/v3.18.1: dial tcp: lookup mcr.microsoft.comoss on 168.63.129.16:53: no such host
In 50% of the cases the following is logged also:
Pod 'calico-system/calico-typha-685d454c58-pdqkh' triggered a Warning-Event: 'FailedMount'. Warning Message: Unable to attach or mount volumes: unmounted volumes=[typha-ca typha-certs calico-typha-token-424k6], unattached volumes=[typha-ca typha-certs calico-typha-token-424k6]: timed out waiting for the condition
There seems to be no measurable effect on cluster health apart from the warnings - I see no correlating errors in any services.
We did not find a trigger which causes the behavior. It does not seem to be correlated to any change we do from our side (deployments, scaling, ...).
Also there seems to be no pattern as to the frequency. Sometimes there is no problem for several days and then we have the error pop up 10 times per day.
Another observation is that the calico-kube-controller and several pods were restarted. Replicaset and deployments did not change.
Restart time
Since all the pods of the daemonset are running eventually, the problem seems to be solving itself after some time.
Are you behind a firewall, and used this link to set it up
https://learn.microsoft.com/en-us/azure/aks/limit-egress-traffic
If so add HTTP to the mcr.microsoft.com, looks like MS missed the 's' in an update recently
Paul

OpenShift 4 error: Error reading manifest

during OpenShift installation from a local mirror registry, after I started the bootstrap machine i see the following error in the journal log:
release-image-download.sh[1270]:
Error: error pulling image "quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129":
unable to pull quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: unable to pull image:
Error initializing source docker://quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129:
(Mirrors also failed: [my registry:5000/ocp4/openshift4#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: Error reading manifest
sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129 in my registry:5000/ocp4/openshift4: manifest unknown: manifest unknown]):
quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129: error pinging docker registry quay.io:
Get "https://quay.io/v2/": dial tcp 50.16.140.223:443: i/o timeout
Does anyone have any idea what it can be?
The answer is here in the error:
... dial tcp 50.16.140.223:443: i/o timeout
Try this on the command line:
$ podman pull quay.io/openshift-release-dev/ocp-release#sha256:999a6a4bd731075e389ae601b373194c6cb2c7b4dadd1ad06ef607e86476b129
You'll need to be authenticated to actually download the content (this is what the pull secret does). However, if you can't get the "unauthenticated" error then this would more solidly point to some network configuration.
That IP resolves to a quay host (you can verify that with "curl -k https://50.16.140.223"). Perhaps you have an internet filter or firewall in place that's blocking egress?
Resolutions:
fix your network issue, if you have one
look at doing an disconnected /airgap install -- https://docs.openshift.com/container-platform/4.7/installing/installing-mirroring-installation-images.html has more details on that
(If you're already doing an airgap install and it's your local mirror that's failing, then your local mirror is failing)

Permission denied when trying to scrape targets with Prometheus

I have Prometheus setup as a standalone service in a VM. I am able to go to localhost:9090 and move freely through the UI - graph, status, alerts, etc. I am also able to curl the metrics page and get a list of metrics. The issue I’m having is I get a permission denied error when trying to scrape metrics from my targets (localhost:9090/targets), including the same Prometheus instance.
The specific error message I get is “Get “http://localhost:9090/metrics": dial tcp [::1]:9090: connect: permission denied”. The error shows itself in all of the surrounding services - node exporter & alertmanager.
Prometheus target error: Get "http://localhost:9090/metrics": dial tcp [::1]:9090: connect: permission denied
Node target error: Get "http://localhost:9100/metrics": dial tcp [::1]:9100: connect: permission denied
Alertmanager error: msg="Error sending alert" err="Post "http://localhost:9093/api/v2/alerts": dial tcp [::1]:9093: connect: permission denied"
Does anyone know where this error message originates from? What permissions do I not have?
Edit: I've tried opening all ports listed above.

minio on AKS - unable to mount the volume

I am trying to spin minio pods on AKS service, the pod runs for keeps crashing, here are the detailed logs:
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 2s)
Waiting for all other servers to be online to format the disks.
Unable to connect to http://minio-1.minio.default.svc.cluster.local:9000/data: volume not found
Unable to connect to http://minio-2.minio.default.svc.cluster.local:9000/data: volume not found
Unable to connect to http://minio-3.minio.default.svc.cluster.local:9000/data: Post http://minio-3.minio.default.svc.cluster.local:9000/minio/storage/v8/data/getinstanceid?: dial tcp: lookup minio-3.minio.default.svc.cluster.local on 10.0.0.10:53: no such host
I suspected this is due to the Volumes that I attached, I was using Azure_Managed_Disk as it does not support multiple readWriteMany I am using Azure_Files now but still, I am getting the same error.
my deployment file:
Any leads on this will be appreciated.
Thanks,
Abhishek

installing dashboard on Kubernetes

world.
Trying to install the dashboard in Kubernetes with command:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-beta4/aio/deploy/recommended.yaml
The reply looks like this:
Failed to pull image "kubernetesui/dashboard:v2.0.0-beta4": rpc error: code = Unknown desc = error pulling image configuration: Get https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/68/6802d83967b995c2c2499645ede60aa234727afc06079e635285f54c82acbceb/data?verify=1568998309-bQcnrEV6vQpN4irzUtO2FEIv%2FkE%3D: dial tcp: lookup production.cloudflare.docker.com on 192.168.73.1:53: read udp 192.168.73.91:35778->192.168.73.1:53: i/o timeout
And a simple ping command said:
ping: unknown host https://production.cloudflare.docker.com
After that I watched the domain from downforeveryoneorjustme service and it told me that the server is down.
It's not just you! production.cloudflare.docker.com is down.
Googling the problem showed that I need to configure the docker proxy, but I have no proxy in my setup..
https://docs.docker.com/network/proxy/#configure-the-docker-client
Any thoughts? Thank you in advance.
Check first Cloudflare status:
There was multiple "DNS delays" and "Cloudflare API service issues" in the past few hours, which might have an effect on your installation.

Resources