Troubleshooting docker intermitted failure issue for cgroup process

Troubleshooting docker intermitted failure issue for cgroup process - docker

Let me try to describe my situation here (trying best to capture whatever information I have).
We have a production level service which consists of many dockers containing multiple services running in cloud (asuzre) VM.
Now if we keep on running it for long (long >= 5 days) as a part of Longivity testing, we can see -- sometimes (i.e. not always after 5 days, sometimes) - services start failing, denial of services to our clients.
ERROR: for health-checker Cannot start service health-checker: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/ad4926b8e5b583ce3ae30d4e3d1f1379ee89fc2735d83a87b127ef4e1e7089db/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}
ERROR: for credentials Cannot start service credentials: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/5b2cef0997776af7265fcc41bd640059a29fc723375e43acde63514f58ec6055/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}
ERROR: for occm Cannot start service occm: runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:297: applying cgroup configuration for process caused \"failed to write 1 to memory.kmem.limit_in_bytes: write /sys/fs/cgroup/memory/docker/9d5912c7459a514c6f9bdaa3a170b1bf0ba4fa3189b482b72c2013a85cf5b8ba/memory.kmem.limit_in_bytes: cannot allocate memory\"": unknown {}
failed to perform container upgrade task. java.lang.RuntimeException: Failed to deploy containers {akkaAddress=akka://some-manager, akkaSource=akka://some-manager/user/service-deployer, sourceActorSystem=some-manager}
So as a consequence, none of our services are accessible, all the https calls are denied:
Name does not resolve {}\n","stream":"stdout","time":"2021-07-02T03:38:29.720361925Z"}
Name does not resolve {}\n","stream":"stdout","time":"2021-07-02T03:38:29.744298675Z"}
I was trying doing lots of google and try to get something actionable and meaningful from where to start.
Any pointer / insight / clue will be highly appreciated.
(I understand I may not be very detailed or very pin pointing the issue - actually I am bit clueless as it's failing sometimes after 5 days of run.)
Seeking guidance.
Pradip

Rebuild your docker & containerd after upgrading the kernel.
This happened to me after upgrading 5.4.6 -> 5.18.5 in one go. Rebuilding docker & containerd packages solved it.

Related

Sysbox-runc is not working as runtime for docker in Ubuntu

I attempted to use sysbox-runc as the runtime for Docker on Ubuntu. sysbox-runc is operational. Nevertheless, an error occurred when I tried to create a container using Docker.
The command I was using: docker run --runtime=sysbox-runc nginx
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: container_linux.go:425: starting container process caused: process_linux.go:607: container init caused: process_linux.go:578: handleReqOp caused: rootfs_init_linux.go:366: failed to mkdirall /var/lib/sysbox/shiftfs/2e6d4302-28cd-4d9d-827e-6088b8b34e89/var/lib/kubelet: mkdir /var/lib/sysbox/shiftfs/2e6d4302-28cd-4d9d-827e-6088b8b34e89/var/lib/kubelet: value too large for defined data type caused: mkdir /var/lib/sysbox/shiftfs/2e6d4302-28cd-4d9d-827e-6088b8b34e89/var/lib/kubelet: value too large for defined data type: unknown.
ERRO[0000] error waiting for container: context canceled
Notes:
The same works fine with the default runtime runc.
Running docker and sysbox-runc as root.
Has anyone come across this before, please?

Is it ubuntu 22.04 ? Do you use kernel 5.15.(>=48) ? plz take a look at
Unfortunately there isn't much we can do with Ubuntu kernels 5.15.(>=48) as they are apparently missing a Ubuntu-patch on overlayfs that breaks interaction with shiftfs.
If you can, please upgrade to newer kernels (e.g., 5.19, 6.0, etc.).
If you must use kernel 5.15, try using 5.15.47 or earlier.
If you must use kernel 5.15.(>=48), you can work-around the problem by either:
Removing the shiftfs module from the kernel (e.g., rmmod) or
Configuring Sysbox to not use shiftfs. You do this by configuring the systemd service unit for sysbox-mgr, and passing the --disable-shiftfs flag to Sysbox. See here for more.
https://github.com/nestybox/sysbox/issues/596#issuecomment-1291235140

Docker stopped working with error message OCI runtime create failed

I'm using Manjaro Linux and Kernerl 5.10.13.
I'm not sure what happened, maybe something was updated, but Docker stopped working for me.
When I try to do docker run hello-world, I see the following message:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367:
starting container process caused: process_linux.go:495: container init caused: apply apparmor
profile: apparmor failed to apply profile: write /proc/self/attr/exec: invalid argument: unknown.
ERRO[0000] error waiting for container: context canceled
If I switch to kernel 5.9.16, it seems to be fine. Am I missing something here?

You may need to enable apparmor in your kernel parameters (apparmor=1 lsm=lockdown,yama,apparmor,bpf)
See https://www.reddit.com/r/archlinux/comments/ldhx0v/cant_start_docker_containers_on_latest_kernel/

I'm not sure what happened there, but the next morning (around 7 hours after I posted this), there was an update on my system, which seems to have resolved the issue

Docker image fails to create netlink handle

Can anyone help me make sense of the below error and others like it? I've Googled around, but nothing makes sense for my context. I download my Docker Image, but the container refuses to start. The namespace referenced is not always 26, but could be anything from 20-29. I am launching my Docker container onto an EC2 instance and pulling the image from AWS ECR. The error is persistent no matter if I re-launch the instance completely or restart docker.
docker: Error response from daemon: oci runtime error:
container_linux.go:247: starting container process caused
"process_linux.go:334: running prestart hook 0 caused \"error running
hook: exit status 1, stdout: , stderr: time=\\\"2017-05-
11T21:00:18Z\\\" level=fatal msg=\\\"failed to create a netlink handle:
failed to set into network namespace 26 while creating netlink socket:
invalid argument\\\" \\n\"".

Update from my Github issue: https://github.com/moby/moby/issues/33656
It seems like the DeepSecurity agent (ds_agent) running on a container with Docker can cause this issue invariably. A number of other users reported this problem, causing me to investigate. I previously installed ds_agent on these boxes, before replacing it with other software as a business decision, which is when the problem went away. If you are having this problem, might be worthwhile to check if you are running the ds_agent process, or other similar services that could be causing a conflict using 'htop' as the user in the issue above did.

Did you try running it with the --privileged option?
If it still doesn't run, try adding --security-opts seccomp=unconfined and either --security-opts apparmor=unconfined or --security-opts selinux=unconfined depending whether you're running Ubuntu or a distribution with SELinux enabled, respectively.
If it works, try substituting the --privileged option with --cap-add=NET_ADMIN` instead, as running containers in privileged mode is discouraged for security reasons.

Mesos killing tasks. Failed to determine cgroup for the 'cpu' subsystem

I'm running a bunch of services in dockers in Mesos(v0.22.1) via Marathon (v0.9.0) and sometimes Mesos killing tasks. Usually it happens for multiple services at once
Log line related to this issue from mesos-slave.ERROR log:
Failed to update resources for container 949b1491-2677-43c6-bfcf-bae6b40534fc
of executor production-app-emails.15437359-a95e-11e5-a046-e24e30c7374f running task production-app-emails.15437359-a95e-11e5-a046-e24e30c7374f
on status update for terminal task,
destroying container: Failed to determine cgroup for the 'cpu' subsystem:
Failed to read /proc/21292/cgroup:
Failed to open file '/proc/21292/cgroup': No such file or directory

I'd strongly suggest to update your stack. Mesos 0.22.1 and Marathon 0.9.0 are quite outdated as of today. Mesos 0.26.0 and Marathon 0.13.0 are out.
Concerning your problem, have a look at
https://issues.apache.org/jira/browse/MESOS-1837
https://github.com/mesosphere/marathon/issues/994
The first one suggests fixes on the Mesos side (post 0.22.1), and the second indicates a lack of resources of the started containers.
Maybe try to increase the RAM for the specific containers, and if that doesn't help, update the Mesos stack IMHO.

Unable to run rabbitmq using marathon mesos

I am unable to run rabbitmq using marathon/mesos framework. I have tried it with rabbitmq images available in docker hub as well as custom build rabbitmmq docker image. In the mesos slave log I see the following error:
E0222 12:38:37.225500 15984 slave.cpp:2344] Failed to update resources for container c02b0067-89c1-4fc1-80b0-0f653b909777 of executor rabbitmq.9ebfc76f-ba61-11e4-85c9-56847afe9799 running task rabbitmq.9ebfc76f-ba61-11e4-85c9-56847afe9799 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/13197/cgroup: Failed to open file '/proc/13197/cgroup': No such file or directory
On googling I could find one hit as follows
https://github.com/mesosphere/marathon/issues/632
Not sure if this is the issue even I am facing. Anyone tried running rabbitmq using marathon/mesos/docker?

Looks like the process went away (likely crashed) before the container was set up. You should check stdout and stderr to see what happened, and fix the root issue.

"cmd": "", is the like'y culprit. I'd look at couchbase docker containers for a few clues on how to get it working.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Troubleshooting docker intermitted failure issue for cgroup process - docker

Rebuild your docker & containerd after upgrading the kernel. This happened to me after upgrading 5.4.6 -> 5.18.5 in one go. Rebuilding docker & containerd packages solved it.

Related

Sysbox-runc is not working as runtime for docker in Ubuntu

Docker stopped working with error message OCI runtime create failed

Docker image fails to create netlink handle

Mesos killing tasks. Failed to determine cgroup for the 'cpu' subsystem

Unable to run rabbitmq using marathon mesos

Categories

Resources