Running container fails with failed to add the host cannot allocate memory - docker

OS: Red Hat Enterprise Linux release 8.7 (Ootpa)
Version:
$ sudo yum list installed | grep docker
containerd.io.x86_64 1.6.9-3.1.el8 #docker-ce-stable
docker-ce.x86_64 3:20.10.21-3.el8 #docker-ce-stable
docker-ce-cli.x86_64 1:20.10.21-3.el8 #docker-ce-stable
docker-ce-rootless-extras.x86_64 20.10.21-3.el8 #docker-ce-stable
docker-scan-plugin.x86_64 0.21.0-3.el8 #docker-ce-stable
Out of hundreds os docker calls made over days, a few of them fails. This is the schema of the commandline:
/usr/bin/docker
run
-u
1771:1771
-a
stdout
-a
stderr
-v
/my_path:/data
--rm
my_image:latest
my_entry
--my_args
The failure:
docker: Error response from daemon: failed to create endpoint recursing_aryabhata on network bridge: failed to add the host (veth6ad97f8) <=> sandbox (veth23b66ce) pair interfaces: cannot allocate memory.
It is not reproducible. The failure rate is less than one percent. At the time this error happens system has lots of free memory. Around the time that this failure happens, the application is making around 5 docker calls per second. Each call take about 5 to 10 seconds to complete.

Related

Cannot run nodetool commands and cqlsh to Scylla in Docker

I am new to Scylla and I am following the instructions to try it in a container as per this page: https://hub.docker.com/r/scylladb/scylla/.
The following command ran fine.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
I see the container is running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6c4e19ff1bd scylladb/scylla "/docker-entrypoint.…" 14 seconds ago Up 13 seconds 22/tcp, 7000-7001/tcp, 9042/tcp, 9160/tcp, 9180/tcp, 10000/tcp some-scylla
However, I'm unable to use nodetool or cqlsh. I get the following output.
$ docker exec -it some-scylla nodetool status
Using /etc/scylla/scylla.yaml as the config file
nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)
See 'nodetool help' or 'nodetool help <command>'.
and
$ docker exec -it some-scylla cqlsh
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
Any ideas?
Update
Looking at docker logs some-scylla I see some errors in the logs, the last one is as follows.
2021-10-03 07:51:04,771 INFO spawned: 'scylla' with pid 167
Scylla version 4.4.4-0.20210801.69daa9fd0 with build-id eb11cddd30e88ef39c32c847e70181b5cf786355 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --overprovisioned --listen-address 172.17.0.2 --rpc-address 172.17.0.2 --seed-provider-parameters seeds=172.17.0.2 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, overprovisioned, listen-address: 172.17.0.2, rpc-address: 172.17.0.2, seed-provider-parameters: seeds=172.17.0.2, blocked-reactor-notify-ms: 999999999]
ERROR 2021-10-03 07:51:05,203 [shard 6] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application
2021-10-03 07:51:05,316 INFO exited: scylla (exit status 1; not expected)
2021-10-03 07:51:06,318 INFO gave up: scylla entered FATAL state, too many start retries too quickly
Update 2
The reason for the error was described on the docker hub page linked above. I had to start container specifying the number of CPUs with --smp 1 as follows.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --smp 1
According to the above page:
This command will start a Scylla single-node cluster in developer mode
(see --developer-mode 1) limited by a single CPU core (see --smp).
Production grade configuration requires tuning a few kernel parameters
such that limiting number of available cores (with --smp 1) is the
simplest way to go.
Multiple cores requires setting a proper value to the
/proc/sys/fs/aio-max-nr. On many non production systems it will be
equal to 65K. ...
As you have found out, in order to be able to use additional CPU cores you'll need to increase fs.aio-max-nr kernel parameter.
You may run as root:
# sysctl -w fs.aio-max-nr=65535
Which should be enough for most systems. Should you still have any error preventing it to use all of your CPU cores, increase its value further.
Do notice that the above configuration is not persistent. Edit /etc/sysctl.conf in order to make it persistent across reboots.

cannot run gazebo9 on docker with privileged on ubuntu 18.04

I am stuck on this for quite a while now i have tried searching and trying stuff but i am getting nowhere.
My setup is as follows:
Host
linux Distro: Archlinux
kernel version: 5.14.2
docker version: 20.10.8, build 3967b7d28e
nvidia driver version: 470.63.01-1
nvidia container toolkit version: 1.5.0-2 , cgroups disabled.
amd gpu driver: xf86-video-amdgpu 21.0.0-1
Container
base image: ubuntu:18.04
command line : docker run -it --rm --privileged --gpus all -e DISPLAY=$DISPLAY -e XAUTHORITY=~/.Xauthority --network host --volume /tmp/.X11-unix/:/tmp/.X11-unix --volume $XAUTHORITY:/root/.Xauthority gazebo:libgazebo9-bionic gazebo
Expected results
expected gazebo window to open with hardware acceleration, using privileged access.
Actual results
On using --privileged:
si_init_perfcounters: max_sh_per_se = 2 not supported (inaccurate performance counters)
X Error of failed request: BadAlloc (insufficient resources for operation)
Major opcode of failed request: 149 ()
Minor opcode of failed request: 2
Serial number of failed request: 35
Current serial number in output stream: 36
Without --privileged and specifying graphic cards in --device manually:
gazebo window opens up with hardware acceleration and works smoothly as expected.
Detailed description
I was actually trying to run gazebo version 9 in a custom image which i had created using ubuntu:18.04 as base image. i referred to gazebo:libgazebo9-bionic,nvidia/cuda:11.4.1-cudnn8-devel-ubuntu18.04 and ros:melodic-desktop while writing the dockerfile. i even tried the same thing for gazebo 11 on the same base image and got the same issue as above. Whereas the exactly similar setup for ubuntu foxy works smoothly. i really need to use privileged because i am going to be working on hardware for a lot of time. please help me on how should this be fixed. thanks alot
P.S. Other GUI applications (rviz,moveit,etc) are running without any issues. Im getting this issue with gazebo only.
Ok found the solution!
Gazebo was working on osrf/ros:noetic-desktop-full but not on osrf/ros:melodic-desktop-full.
I got the exact same error:
X Error of failed request: BadAlloc (insufficient resources for operation)
X Error of failed request: BadAlloc (insufficient resources for operation)
Major opcode of failed request: 149 ()
The solution was to update the MESA drivers on the ros:melodic image from version Mesa 20.0.8 to Mesa 22.0.2.
sudo add-apt-repository ppa:kisak/kisak-mesa -y
sudo apt update
sudo apt upgrade -y
If you want to check your current Mesa version:
sudo apt install mesa-utils
glxinfo | grep Mesa

query.js crash on Hyperledger fabric fabcar example

The fabcar example of the hyperledger tutorial crashes for me at the step of attempting to run query.js.
I have removed all hyperledger related docker images (with docker rmi), so all required content was downloaded automatically when running startFabric.sh. The output on startup looks slightly "clouded" but not very suspicious (I skipped the lengthy output about the images being downloaded):
# wait for Hyperledger Fabric to start
# incase of errors when running later commands, issue export FABRIC_START_TIMEOUT=<larger number>
export FABRIC_START_TIMEOUT=10
#echo ${FABRIC_START_TIMEOUT}
sleep ${FABRIC_START_TIMEOUT}
# Create the channel
docker exec -e "CORE_PEER_LOCALMSPID=Org1MSP" -e "CORE_PEER_MSPCONFIGPATH=/etc/hyperledger/msp/users/Admin#org1.example.com/msp" peer0.org1.example.com peer channel create -o orderer.example.com:7050 -c mychannel -f /etc/hyperledger/configtx/channel.tx
flag provided but not defined: -e
See 'docker exec --help'.
The next step as asked is
npm install
It also delivers mostly ok output,just one warning:
npm WARN fabcar#1.0.0 No repository field.
I have verified images are running (also shows that the user is authorized to use the docker, while the user is otherwise not root):
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
9acf0dd8a2e2 hyperledger/fabric-peer:x86_64-1.0.0 "peer node start" 20 seconds ago Up 19 seconds 0.0.0.0:7051->7051/tcp, 0.0.0.0:7053->7053/tcp peer0.org1.example.com
da42dca3cbda hyperledger/fabric-orderer:x86_64-1.0.0 "orderer" 20 seconds ago Up 19 seconds 0.0.0.0:7050->7050/tcp orderer.example.com
0265c3cd86f2 hyperledger/fabric-ca:x86_64-1.0.0 "sh -c 'fabric-ca-ser" 20 seconds ago Up 20 seconds 0.0.0.0:7054->7054/tcp ca.example.com
4f71895a78c0 hyperledger/fabric-couchdb:x86_64-1.0.0 "tini -- /docker-entr" 20 seconds ago Up 19 seconds 4369/tcp, 9100/tcp, 0.0.0.0:5984->5984/tcp couchdb
When I finally try to run
node query.js
I observe the following errors:
Create a client and set the wallet location
Set wallet path, and associate user PeerAdmin with application
Check user is enrolled, and set a query URL in the network
Make query
Assigning transaction_id: eb03c5e69259b880433861daf57a5ac2d33e41d93cebe80a7a478a1aa2cba711
error: [client-utils.js]: sendPeersProposal - Promise is rejected: Error: Endpoint read failed
at /home/hla/fabric-samples/fabcar/node_modules/grpc/src/node/src/client.js:434:17
returned from query
Query result count = 1
error from query = { Error: Endpoint read failed
at /home/hla/fabric-samples/fabcar/node_modules/grpc/src/node/src/client.js:434:17 code: 14, metadata: Metadata { _internal_repr: {} } }
Response is Error: Endpoint read failed
My OS:
uname -a
Linux uhost 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
I have checked this but my node.js version is correct:
node --version
v6.11.2
npm -- version
{ fabcar: '1.0.0',
npm: '3.10.10',
ares: '1.10.1-DEV',
http_parser: '2.7.0',
icu: '56.1',
modules: '48',
node: '6.11.2',
openssl: '1.0.2l',
uv: '1.11.0',
v8: '5.1.281.103',
zlib: '1.2.11' }
Also, the error message is completely different. The machine has ports 8080 and 8443 in use, but when I tired to shut down the applications using them, was not helpful.
That's because you didn't follow the steps.. as it says, before query.js u should enroll the admin and register user, then it works properly. And don't pay attention to "npm WARN fabcar#1.0.0 No repository field", it works well. Try the following:
$docker stop $(docker ps -a -q)
$docker ps -qa|xargs docker rm
$./startFabric.sh
$cd fabric-samples/fabcar/javascript
$node enrollAdmin.js
$npm install
$node registerUser.js
$node query.js
[

docker - driver "devicemapper" failed to remove root filesystem after process in container killed

I am using Docker version 17.06.0-ce on Redhat with devicemapper storage. I am launching a container running a long-running service. The master process inside the container sometimes dies for whatever reason. I get the following error message.
/bin/bash: line 1: 40 Killed python -u scripts/server.py start go
I would like the container to exit and to be restarted by docker. However docker never exits. If I do it manually I get the following error:
Error response from daemon: driver "devicemapper" failed to remove root filesystem.
After googling, I tried a bunch of things:
docker rm -f <container>
rm -f <pth to mount>
umount <pth to mount>
All result in device is busy. The only remedy right now is to reboot the host system which is obviously not a long-term solution.
Any ideas?
I had the same problem and the solution was a real surprise.
So here is the error om docker rm:
$ docker rm 08d51aad0e74
Error response from daemon: driver "devicemapper" failed to remove root filesystem for 08d51aad0e74060f54bba36268386fe991eff74570e7ee29b7c4d74047d809aa: remove /var/lib/docker/devicemapper/mnt/670cdbd30a3627ae4801044d32a423284b540c5057002dd010186c69b6cc7eea: device or resource busy
Then I did the following (basically go through all processes and look for docker in mountinfo):
$ grep docker /proc/*/mountinfo | grep 958722d105f8586978361409c9d70aff17c0af3a1970cb3c2fb7908fe5a310ac
/proc/20416/mountinfo:629 574 253:15 / /var/lib/docker/devicemapper/mnt/958722d105f8586978361409c9d70aff17c0af3a1970cb3c2fb7908fe5a310ac rw,relatime shared:288 - xfs /dev/mapper/docker-253:5-786536-958722d105f8586978361409c9d70aff17c0af3a1970cb3c2fb7908fe5a310ac rw,nouuid,attr2,inode64,logbsize=64k,sunit=128,swidth=128,noquota
This got be the PID of the offending process keeping it busy - 20416 (the item after /proc/)
So I did a ps -p and to my surprise find:
[devops#dp01app5030 SeGrid]$ ps -p 20416
PID TTY TIME CMD
20416 ? 00:00:19 ntpd
A true WTF moment. So I pair problem solved with Google and found this:
Then found this https://github.com/docker/for-linux/issues/124
Turns out I had to restart ntp daemon and that fixed the issue!!!

lxc-kill: failed to get the init pid

I got a problem ,it seems that the container is already stopped. Cause I ping the container's ip, and got no answer.
lxc-info indicate that the process is STOP
[root#matrix-node04 mnt]# lxc-info -n f3a939113d6e12450829a2dc76be3c761b818e63fbd33df513772e6e4485565e
state: STOPPED
pid: -1
But docker ps indicate the process is still running
[root#matrix-node04 mnt]# docker ps | grep f3a939113d6e
f3a939113d6e c69436ea2169 /bin/sh -c '/usr/loc 4 weeks ago Up 2 weeks d-mcl-354_lisx_test_kr22-n-3
can I use lxc-start to manualy start the container ? I tried the following cmd
[root#matrix-node04 mnt]# lxc-start -n f3a939113d6e12450829a2dc76be3c761b818e63fbd33df513772e6e4485565e -f /srv/docker/containers/f3a939113d6e12450829a2dc76be3c761b818e63fbd33df513772e6e4485565e/config.lxc
lxc-start: No such file or directory - failed to get real path for '/srv/docker/devicemapper/mnt/f3a939113d6e12450829a2dc76be3c761b818e63fbd33df513772e6e4485565e/rootfs'
lxc-start: failed to pin the container's rootfs
lxc-start: failed to spawn 'f3a939113d6e12450829a2dc76be3c761b818e63fbd33df513772e6e4485565e'
Has someone met this ?

Resources