All docker containers suddenly stopped at once. How to debug docker itself? - docker

It happened twice now that all my containers stopped all of a sudden.
I would suspect, my VPS test server restarted somehow. Might be RAM issues maybe.
$ free -m
total used free shared buff/cache available
Mem: 987 572 67 59 347 175
Swap: 1975 551 1424
I can not seem to find a similar question on here. Most of them are concerned with how to stop containers.
Furthermore, what would be the best implementation to get notified, when a container stops?
Edit:
All processes had this status:
Exited (255)

Related

docker container hanging on run - how to debug

I am trying to run Screaming Frog in a docker. For this I have used as a starting point, this Github project:
https://github.com/iihnordic/screamingfrog-docker
After building, I ran the docker with the follwing command:
docker run -v /<my-path>/screamingfrog-crawls:/home/crawls screamingfrog --crawl https://<my-domain> --headless --save-crawl
--output-folder /home/crawls
It worked the first time, but after multiple attempts, it seems that the process hangs 8 out of 10 times with no error, always hanging at a different stage in the process.
I assumed the most likely reason is memory, but despite significantly increasing the docker memory and also increasing the Screaming Frog Memory to 16GB the same issue persists.
How can I go about debugging my container when no errors are thrown except for the container hanging indefinitely
As suggested by #Ralle, I checked docker stats, and while it seems that Memory usage is actually staying well below 10%, the CPU is always of 100%
Try
docker stats:
returns something like this.
At least you can see the behaviour of memory and cpu.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3.18kB / 496B 17.2MB / 41.4MB 8
9949a4ee1238 nest-api-1 0.87% 290MiB / 3.725GiB 7.60% 2.14MB / 37.2kB 156kB / 2.06MB 33
96fe43dba2b0 postgres 0.00% 29MiB / 3.725GiB 0.76% 7.46kB / 6.03kB 1.17MB / 67.8MB 7
ff570659e917 redis 0.30% 3.004MiB / 3.725GiB 0.08% 2.99kB / 0B 614kB / 4.1kB 5
ALSO docker top shows you you the pid ids.
I dont know your application, but check also what if the issue can be related to the volumes itselfs each time the container restarts.

tutum node always at full memory consumption

I have a tutum node and it's always near full memory consumption. If I upgrade the node to more memory, it's again full.
To be more specific: Once I started with a node and, say, 6 services I can't add another one because the node is always full. I have to restart from scratch or buy another node:
> watch -n 5 free
total used free shared buffers cached
Mem: 4048280 3902288 145992 22796 310708 1334052
-/+ buffers/cache: 2257528 1790752
Swap: 0 0 0
The interesting thing is, that tutum itself does not know about that and thinks the node has more than 50% of his memory:
What may be the reason for that? There is nothing else running on the node, it's been completly provisioned via tutum.
The underlying cloud provider is digital ocean, docker version is 1.8.3.

Spark: what's the advantages of having multiple executors per node for a Job?

I am running my job on AWS-EMR cluster. It is a 40 nodes cluster using cr1.8xlarge instances. Each cr1.8xlarge has 240G memory and 32 cores. I can run with the following config:
--driver-memory 180g --driver-cores 26 --executor-memory 180g --executor-cores 26 --num-executors 40 --conf spark.default.parallelism=4000
or
--driver-memory 180g --driver-cores 26 --executor-memory 90g --executor-cores 13 --num-executors 80 --conf spark.default.parallelism=4000
Since from the job-tracker website, the number of tasks running simultaneously is mainly just the number of cores (cpu) available. So I am wondering is there any advantages or specific scenarios that we want to have more than one executor per node?
Thanks!
Yes, there are advantages of running multiple executors per node - especially on large instances like yours. I recommend that you read this blog post from Cloudera.
A snippet of the post that would be of particular interest to you:
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.
The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers.
The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node.
15 cores per executor can lead to bad HDFS I/O throughput.
A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one with the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

Testing free memory on Debian

On my Virtual Server running Debian, I have impression, that there is wrongly configured memory, even though my provider claims everything works correctly.
Even with 3GB of RAM, I keep running out of memory, even though top commands claims it still has enough memory.
Is there a way to test, that the free memory is actually usable? For instance, if I had 1,5 GB of memory, I would like to create a block of 1 GB and see that everything still works correctly.
Thanks,
Which applications are you using? There must be a reason that you are running out of memory.
Try command free:
$ free -m
total used free shared buffers cached
Mem: 3022 2973 48 0 235 1948
-/+ buffers/cache: 790 2232
Swap: 3907 0 3907
This will show you something like the above (that is a 3 GB machine of my own).
Always check the system log of your machine if you have memory issues.
# tail /var/log/syslog

Phusion-Passenger seems to create MANY (140+) orphaned processes

We're running 3 Apache Passenger servers sharing the same file system, each running 11 Rails applications.
We've set
PassengerPoolIdleTime = 0 to ensure none of the applications ever die out completely, and
PassengerMaxPoolSize = 20 to ensure we have enough processes to run all the applications.
The problem is that when I run passenger-memory-stats on one of the servers I see 210 VM's!
And when I run passenger-status I see 20 application instances (as expected)!
Anyone know what's going on? How do I figure out which of those 210 instances are still being used, and how do I kill those on a regular basis? Would PassengerMaxInstancesPerApp do anything to reduce those seemingly orphaned instances?
Turns out we actually have that many Apache worker processes running, and only 24 of them are Passenger processes (asked someone a little more experienced than myself). We are actually hosting many more websites and shared hosting accounts than I thought.
Thanks for all the replies though!
You can get a definitive answer as to how many Rails processes you have by running this command:
ps auxw | grep Rails | wc -l
I doubt you really do have over 100 processes running though, as at about 50 mb each they'd collectively be consuming over 5 gb of RAM and/or swap and your system would likely have slowed to a crawl.
Not so much an answer, but some useful diagnostic advice.
Try adding the Rack::ProcTitle middleware to your stack. We have been using it in production for years. Running ps aux should give info on what the worker is up to (idle, handing a specific request etc…).
I would not assume that these processes are being spawned directly by passenger. It could be something forking deeper into your app.
Maybe the processes are stuck during shutdown. Try obtaining their backtraces to see what they're doing.

Resources