How to troubleshoot failed docker tasks - docker

I'm just starting my way in the docker world and many (basic) principles of how everything is organized are still unclear. Please help me to understand how should we approach troubleshooting failed docker tasks.
My docker service doesn't work but that is a secondary problem. The primary issue is that it's totally unclear how to troubleshoot that.
This is my docker-compose file, it consists of an application service and mongodb service. The application service writes logs to the /opt/myapp/log/app.log. Complete sources can be found here. I also built a corresponding docker image and uploaded it to the dockerhub
Let's start the stack:
docker swarm init
Swarm initialized: current node (xpkngdn0vpr73nioalzbkem1k) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join --token SWMTKN-1-6109nv6pn7eb9gtam8bq4m198k5sk7ztzf7hy7yfv5c47kcrmq-9fbrmmccd977kx22mivs7segn 192.168.65.3:2377
To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.
docker stack deploy -c docker-compose.yml myapp
Creating network myapp_default
Creating service myapp_web
Creating service myapp_db
After that, let's wait for a little while (~1 minute) and proceed:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
31e9f3a8f5aa deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" 31 seconds ago Up 25 seconds 8090/tcp myapp_web.1.sij3z7cbbynsxos6608ru2f8a
9fc6a5868c12 mongo:latest "docker-entrypoint.s…" About a minute ago Up About a minute 27017/tcp myapp_db.1.gxl5xwj1tg80nr16clbskk2oc
a3ff2ba0c8c5 deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" About a minute ago Exited (137) 32 seconds ago myapp_web.1.3dv8x2dx6kig4qkf1wc2axro8
We see that there is a failed task. Let's try to understand what went wrong:
docker commit a3ff2ba0c8c5 snapshot
sha256:bec4756cadebbada400b4d1037cac671168396bf73b7d3e875c6f98f63522afd
docker run --rm -it snapshot /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:34:20 - Starting Start on a3ff2ba0c8c5 with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:34:21 - No active profile set, falling back to default profiles: default
The task's log doesn't contain target information which would be enough to troubleshoot the problem. However, that data is available when we run the application image standalone:
docker run --rm -d deniszhdanov/docker-swarm-troobleshoot-service:1
825a818b425feb7ed1f593c14a411efb68457aee9c6bfcf27f745fd58cfa0001
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
825a818b425f deniszhdanov/docker-swarm-troobleshoot-service:1 "java -jar /opt/myap…" 46 seconds ago Up 44 seconds 8090/tcp zen_bardeen
docker exec -it 825a818b425f /bin/sh
/opt/myapp # cat /opt/myapp/log/app.log
2018-10-05 15:30:09 - Starting Start on 825a818b425f with PID 1 (/opt/myapp/lib/myapp.jar started by root in /opt/myapp)
2018-10-05 15:30:09 - No active profile set, falling back to default profiles: default
2018-10-05 15:30:09 - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext#18ef96: startup date <Fri Oct 05 15:30:09 GMT 2018>; root of context hierarchy
2018-10-05 15:30:10 - Cluster created with settings {hosts=<db:27017>, mode=MULTIPLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
2018-10-05 15:30:10 - Adding discovered server db:27017 to client view of cluster
2018-10-05 15:30:11 - Exception in monitor thread while connecting to server db:27017
com.mongodb.MongoSocketException: db: Name does not resolve
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:188)
at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:59)
at com.mongodb.connection.SocketStream.open(SocketStream.java:57)
at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:126)
at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:114)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: db: Name does not resolve
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at com.mongodb.ServerAddress.getSocketAddress(ServerAddress.java:186)
... 5 common frames omitted
Huh, it took a while to describe all of that, thanks to everyone who managed to get to this point :)
Questions:
Why do we have different states for standalone containers and tasks?
What is the recommended way to troubleshoot failing tasks
Regards, Denis

Eventually, I found this docker troubleshooting page, checked docker logs and found this:
2018-10-06 11:58:29.662461+0800 localhost com.docker.hyperkit[583]: [91168.810550] CPU: 3 PID: 50578 Comm: java Not tainted 4.9.93-linuxkit-aufs #1
2018-10-06 11:58:29.663013+0800 localhost com.docker.hyperkit[583]: [91168.811356] Hardware name: BHYVE, BIOS 1.00 03/14/2014
2018-10-06 11:58:29.663984+0800 localhost com.docker.hyperkit[583]: [91168.811909] 0000000000000000 ffffffffa243922a ffffbb9dc0763de8 ffff937eb67aed00
2018-10-06 11:58:29.664792+0800 localhost com.docker.hyperkit[583]: [91168.812878] ffffffffa21f5d85 0000000000000000 0000000000000000 ffffbb9dc0763de8
2018-10-06 11:58:29.665681+0800 localhost com.docker.hyperkit[583]: [91168.813694] ffff937e6df576a0 0000000000000202 ffffffffa27f9dae ffff937eb67aed00
2018-10-06 11:58:29.666082+0800 localhost com.docker.hyperkit[583]: [91168.814585] Call Trace:
2018-10-06 11:58:29.666638+0800 localhost com.docker.hyperkit[583]: [91168.814984] [<ffffffffa243922a>] ? dump_stack+0x5a/0x6f
2018-10-06 11:58:29.667238+0800 localhost com.docker.hyperkit[583]: [91168.815534] [<ffffffffa21f5d85>] ? dump_header+0x78/0x1ed
2018-10-06 11:58:29.667980+0800 localhost com.docker.hyperkit[583]: [91168.816144] [<ffffffffa27f9dae>] ? _raw_spin_unlock_irqrestore+0x16/0x18
2018-10-06 11:58:29.668686+0800 localhost com.docker.hyperkit[583]: [91168.816878] [<ffffffffa21a1f90>] ? oom_kill_process+0x83/0x324
2018-10-06 11:58:29.669273+0800 localhost com.docker.hyperkit[583]: [91168.817583] [<ffffffffa21a25b7>] ? out_of_memory+0x239/0x267
2018-10-06 11:58:29.669944+0800 localhost com.docker.hyperkit[583]: [91168.818162] [<ffffffffa21ef2cd>] ? mem_cgroup_out_of_memory+0x4b/0x79
2018-10-06 11:58:29.670652+0800 localhost com.docker.hyperkit[583]: [91168.818834] [<ffffffffa21f34a6>] ? mem_cgroup_oom_synchronize+0x26b/0x294
2018-10-06 11:58:29.671338+0800 localhost com.docker.hyperkit[583]: [91168.819560] [<ffffffffa21ef650>] ? mem_cgroup_is_descendant+0x48/0x48
2018-10-06 11:58:29.671982+0800 localhost com.docker.hyperkit[583]: [91168.820253] [<ffffffffa21a2612>] ? pagefault_out_of_memory+0x2d/0x6f
2018-10-06 11:58:29.672615+0800 localhost com.docker.hyperkit[583]: [91168.820886] [<ffffffffa20459b0>] ? __do_page_fault+0x3c6/0x45f
2018-10-06 11:58:29.673158+0800 localhost com.docker.hyperkit[583]: [91168.821516] [<ffffffffa27fb3c8>] ? page_fault+0x28/0x30
2018-10-06 11:58:29.677517+0800 localhost com.docker.hyperkit[583]: [91168.822159] Task in /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235 killed as a result of limit of /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
2018-10-06 11:58:29.678200+0800 localhost com.docker.hyperkit[583]: [91168.826457] memory: usage 51188kB, limit 51200kB, failcnt 8764
2018-10-06 11:58:29.678913+0800 localhost com.docker.hyperkit[583]: [91168.827160] memory+swap: usage 102400kB, limit 102400kB, failcnt 78
2018-10-06 11:58:29.679469+0800 localhost com.docker.hyperkit[583]: [91168.827815] kmem: usage 884kB, limit 9007199254740988kB, failcnt 0
2018-10-06 11:58:29.681923+0800 localhost com.docker.hyperkit[583]: [91168.828391] Memory cgroup stats for /docker/bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235: cache:20KB rss:50284KB rss_huge:0KB mapped_file:4KB dirty:8KB writeback:0KB swap:51212KB inactive_anon:25264KB active_anon:25020KB inactive_file:8KB active_file:8KB unevictable:0KB
2018-10-06 11:58:29.682630+0800 localhost com.docker.hyperkit[583]: [91168.830820] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
2018-10-06 11:58:29.683423+0800 localhost com.docker.hyperkit[583]: [91168.831586] [50168] 0 50168 499844 14545 94 5 12964 0 java
2018-10-06 11:58:29.684186+0800 localhost com.docker.hyperkit[583]: [91168.832329] Memory cgroup out of memory: Kill process 50168 (java) score 1078 or sacrifice child
2018-10-06 11:58:29.685451+0800 localhost com.docker.hyperkit[583]: [91168.833172] Killed process 50168 (java) total-vm:1999376kB, anon-rss:49108kB, file-rss:9072kB, shmem-rss:0kB
2018-10-06 11:58:29.970702+0800 localhost com.docker.hyperkit[583]: [91169.119073] oom_reaper: reaped process 50168 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2018-10-06 11:58:30.206071+0800 localhost com.docker.driver.amd64-linux[579]: osxfs: die event: de-registering container bf5d7ef29816596e58e25b05e5bde1f57531e02fe31317d6a1dbad477580b235
I.e. the reason was that I inadvertently copied resources/limits/memory setup from one of the docker-compose tutorials and docker kept killing my app because of that.
In the end of the day, the problem was trivial and the troubleshooting also was not a real bear. It was only necessary to look into docker daemon logs. Only due to my inexperience with docker I spent an evening trying to find the root from docker container/swarm side (service logs, container logs etc). Well, practice makes perfect :)

Related

Cannot run nodetool commands and cqlsh to Scylla in Docker

I am new to Scylla and I am following the instructions to try it in a container as per this page: https://hub.docker.com/r/scylladb/scylla/.
The following command ran fine.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla
I see the container is running.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e6c4e19ff1bd scylladb/scylla "/docker-entrypoint.…" 14 seconds ago Up 13 seconds 22/tcp, 7000-7001/tcp, 9042/tcp, 9160/tcp, 9180/tcp, 10000/tcp some-scylla
However, I'm unable to use nodetool or cqlsh. I get the following output.
$ docker exec -it some-scylla nodetool status
Using /etc/scylla/scylla.yaml as the config file
nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)
See 'nodetool help' or 'nodetool help <command>'.
and
$ docker exec -it some-scylla cqlsh
Connection error: ('Unable to connect to any servers', {'172.17.0.2': error(111, "Tried connecting to [('172.17.0.2', 9042)]. Last error: Connection refused")})
Any ideas?
Update
Looking at docker logs some-scylla I see some errors in the logs, the last one is as follows.
2021-10-03 07:51:04,771 INFO spawned: 'scylla' with pid 167
Scylla version 4.4.4-0.20210801.69daa9fd0 with build-id eb11cddd30e88ef39c32c847e70181b5cf786355 starting ...
command used: "/usr/bin/scylla --log-to-syslog 0 --log-to-stdout 1 --default-log-level info --network-stack posix --developer-mode=1 --overprovisioned --listen-address 172.17.0.2 --rpc-address 172.17.0.2 --seed-provider-parameters seeds=172.17.0.2 --blocked-reactor-notify-ms 999999999"
parsed command line options: [log-to-syslog: 0, log-to-stdout: 1, default-log-level: info, network-stack: posix, developer-mode: 1, overprovisioned, listen-address: 172.17.0.2, rpc-address: 172.17.0.2, seed-provider-parameters: seeds=172.17.0.2, blocked-reactor-notify-ms: 999999999]
ERROR 2021-10-03 07:51:05,203 [shard 6] seastar - Could not setup Async I/O: Resource temporarily unavailable. The most common cause is not enough request capacity in /proc/sys/fs/aio-max-nr. Try increasing that number or reducing the amount of logical CPUs available for your application
2021-10-03 07:51:05,316 INFO exited: scylla (exit status 1; not expected)
2021-10-03 07:51:06,318 INFO gave up: scylla entered FATAL state, too many start retries too quickly
Update 2
The reason for the error was described on the docker hub page linked above. I had to start container specifying the number of CPUs with --smp 1 as follows.
docker run --name some-scylla --hostname some-scylla -d scylladb/scylla --smp 1
According to the above page:
This command will start a Scylla single-node cluster in developer mode
(see --developer-mode 1) limited by a single CPU core (see --smp).
Production grade configuration requires tuning a few kernel parameters
such that limiting number of available cores (with --smp 1) is the
simplest way to go.
Multiple cores requires setting a proper value to the
/proc/sys/fs/aio-max-nr. On many non production systems it will be
equal to 65K. ...
As you have found out, in order to be able to use additional CPU cores you'll need to increase fs.aio-max-nr kernel parameter.
You may run as root:
# sysctl -w fs.aio-max-nr=65535
Which should be enough for most systems. Should you still have any error preventing it to use all of your CPU cores, increase its value further.
Do notice that the above configuration is not persistent. Edit /etc/sysctl.conf in order to make it persistent across reboots.

SCADA LTS - HTTP Status 404

After starting a SCADA LTS Docker container as suggested on https://github.com/SCADA-LTS/Scada-LTS with the following command:
docker run -it -e DOCKER_HOST_IP=docker-machine ip-p 81:8080 scadalts/scadalts /root/start.sh
...The container works well for some time and then suddenly a "HTTP Status 404" error is shown, like the following:
http://[IP]/ScadaBR/
HTTP Status 404 - /ScadaBR/
type Status report
message /ScadaBR/
description The requested resource is not available.
Apache Tomcat/7.0.85
Where [IP] is the default Docker IP address and port, most of the times is localhost:81.
Any idea how to solve it?
Thank you in advance!
TL;DR
After some time running the MySQLservice dies. Is necessary to restart it manually with this:
docker exec scada service mysql restart
docker exec scada killall tail
DETAILED REPORT
When the error is shown, you can check if all the services are running on the container (in this case named 'scada'):
>docker exec scada ps -A
PID TTY TIME CMD
1 ? 00:00:00 start.sh
790 ? 01:00:22 java
791 ? 00:01:27 tail
858 ? 00:00:00 ps
As can be seen, no MySQL service is running. This explains why Tomcat is running but SCADA-LTS don't.
You can restart MySQL service inside the container with:
docker exec scada service mysql restart
After that SCADA-LTS is still down and you have to restart tomcat which can be done in this way:
docker exec scada killall tail
After a minute or less, all the services are running:
>docker exec scada ps -A
PID TTY TIME CMD
1 ? 00:00:00 start.sh
43 ? 00:00:00 mysqld_safe
398 ? 00:00:00 mysqld
481 ? 00:00:31 java
482 ? 00:00:00 sleep
618 ? 00:00:00 ps
Now SCADA-LTS is running!

How to view docker image in localhost

I'm trying to view a docker image in local host.
After running all the setup I run docker ps and get this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
73e73358ad00 talk-example_app "./docker/php/init.sh" 17 hours ago Created 9000/tcp, 0.0.0.0:9000->3000/tcp dazzling_galois
66acb4dbebdf nginx:latest "nginx -g 'daemon of…" 17 hours ago Up 17 hours 0.0.0.0:8111->80/tcp talk-example_nginx_1
e8cf5884ad4a talk-example_app "./docker/php/init.sh" 17 hours ago Up 17 hours 9000/tcp talk-example_app_1
e34e2574db56 talk-example_redis "docker-entrypoint.s…" 17 hours ago Up 17 hours 0.0.0.0:63791->6379/tcp talk-example_redis_1
ec44ad9b1c1f mysql:5.7 "docker-entrypoint.s…" 17 hours ago Up 17 hours 33060/tcp, 0.0.0.0:33061->3306/tcp talk-example_database_1
I'm trying to run the image talk_example_app. According to the documentation from this package I was simply supposed to clone the git repo, then run a command it would work. I've tried several different docker commands that I read online and couldn't get any to work on localhost.
EDIT: Documentation here
How can achieve this?
Adding logs:
2019-09-13 17:01:27,528 INFO exited: talk-worker_01 (exit status 255; not expected)
2019-09-13 17:01:30,540 INFO spawned: 'talk-worker_02' with pid 31
2019-09-13 17:01:30,552 INFO spawned: 'talk-worker_00' with pid 32
2019-09-13 17:01:30,566 INFO spawned: 'talk-worker_01' with pid 33
2019-09-13 17:01:30,645 INFO exited: talk-worker_02 (exit status 255; not expected)
2019-09-13 17:01:30,657 INFO gave up: talk-worker_02 entered FATAL state, too many start retries too quickly
2019-09-13 17:01:30,664 INFO exited: talk-worker_00 (exit status 255; not expected)
The provided link suppose to run 4 containers.
nginx,php,db and redis
From you docker ps Nginx is missing which is listening on port 8088
run docker ps -a which will show stoped container you may see nginx there, check the logs of the Nginx why it stopped I assume the host 8088 may be occupied.
To check images run docker images and see images, for sure nginx will be exist.
you can check logs using
docker logs
or
docker compose logs nginx

Trouble connecting to my docker app via VM IP

Solved at bottom
But why do I have to append :4000?
I'm following the docker get-started Guide here, https://docs.docker.com/get-started/part4/
I'm fairly certain I've done everything correctly, but am wondering why I can't connect to view the app after deploying it.
I've set my env to my VM, myvm1, for reference to following commands.
docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
099e16249604 beresj/getting-started:part2 "python app.py" 12 seconds ago Up 12 seconds 80/tcp getstartedlab_web.5.y0e2k1r1ev47u24e5iufkyn3i
6f9a24b343a7 beresj/getting-started:part2 "python app.py" 12 seconds ago Up 12 seconds 80/tcp getstartedlab_web.3.1pls3osj3uhsb5dyqtt4ts8j6
docker image ls -a
REPOSITORY TAG IMAGE ID CREATED SIZE
beresj/getting-started <none> e290b6208c21 22 hours ago 131MB
docker stack ls
NAME SERVICES ORCHESTRATOR
getstartedlab 1 Swarm
docker-machine ls
NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS
myvm1 * virtualbox Running tcp://192.168.99.100:2376 v18.09.6
myvm2 - virtualbox Running tcp://192.168.99.101:2376 v18.09.6
docker stack ps getstartedlab
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
vkxx79fh3h85 getstartedlab_web.1 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
qexbaa3wz0pd getstartedlab_web.2 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
1pls3osj3uhs getstartedlab_web.3 beresj/getting-started:part2 myvm1 Running Running 3 minutes ago
ucuwen1jrncf getstartedlab_web.4 beresj/getting-started:part2 myvm2 Running Running 3 minutes ago
y0e2k1r1ev47 getstartedlab_web.5 beresj/getting-started:part2 myvm1 Running Running 3 minutes ago
curl 192.168.99.100
curl: (7) Failed to connect to 192.168.99.100 port 80: Connection refused
docker info
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.6
...
Swarm: active
NodeID: 0p9qrax9h3by0fupat8ufkfbq
Is Manager: true
ClusterID: 7vnqdk85n8jx6fqck9k7dv2ka
Managers: 1
Nodes: 2
Default Address Pool: 10.0.0.0/8
...
Node Address: 192.168.99.100
Manager Addresses:
192.168.99.100:2377
...
Kernel Version: 4.14.116-boot2docker
Operating System: Boot2Docker 18.09.6 (TCL 8.2.1)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 989.4MiB
Name: myvm1
I would expect to see what I was able to see when I just ran it on my local machine instead of on a VM in a swarm (I think I have the lingo correct?)
Not sure how to check open ports.
Again: this works if I simply remove the stack, unset the docker-machine environment, and just run:
docker stack deploy -c docker-compose.yml getstartedlab
not on the vm.
Thank you in advance. (Also, I'm new hence the get-started guide so I appreciate any help)
Edit
It works if I append :4000 to the VM IP in my url, ex: 192.168.99.100:4000 or 192.168.99.101:4000. It shows the two container Id's listed in 'docker container ls' for myvm1, and the other three are from myvm2. Could anyone tell me why I have to append 4000? Is it because I have ports: "4000:80" in my docker-compose.yml?
Not sure if this will help but if you use docker inspect <instance_id_here>, you can see what ports are exposed.
Exposed ports aren't open ports. You would need to bind a host port to a container port in the docker-compose.yml in order for it to be to be open.

Docker error at higher core counts on a multi core machine

I am running a Centos Container using docker on a RHEL 65 machine. I am trying to run an MPI application (MILC) on 16 cores.
My server has 20 cores and 128 GB of memory.
My application runs fine until 15 cores but fails with the APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7) error when using 16 cores and up. At 16 cores and up these are the messages I see in the logs.
Jul 16 11:29:17 localhost abrt[100668]: Can't open /proc/413/status: No such file or directory
Jul 16 11:29:17 localhost abrt[100669]: Can't open /proc/414/status: No such file or directory
Jul 16 11:29:17 localhost abrt[100670]: Can't open /proc/417/status: No such file or directory
A few details on the container:
kernel 2.6.32-431.el6.x86_64
Official centos from docker hub
Started container as:
docker run -t -i -c 20 -m 125g --name=test --net=host centos /bin/bash
I would greatly appreciate any and all feedback regarding this. Please do let me know if I can provide any further information.
Regards

Resources