docker container hanging on run - how to debug - docker

I am trying to run Screaming Frog in a docker. For this I have used as a starting point, this Github project:
https://github.com/iihnordic/screamingfrog-docker
After building, I ran the docker with the follwing command:
docker run -v /<my-path>/screamingfrog-crawls:/home/crawls screamingfrog --crawl https://<my-domain> --headless --save-crawl
--output-folder /home/crawls
It worked the first time, but after multiple attempts, it seems that the process hangs 8 out of 10 times with no error, always hanging at a different stage in the process.
I assumed the most likely reason is memory, but despite significantly increasing the docker memory and also increasing the Screaming Frog Memory to 16GB the same issue persists.
How can I go about debugging my container when no errors are thrown except for the container hanging indefinitely
As suggested by #Ralle, I checked docker stats, and while it seems that Memory usage is actually staying well below 10%, the CPU is always of 100%

Try
docker stats:
returns something like this.
At least you can see the behaviour of memory and cpu.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
3.18kB / 496B 17.2MB / 41.4MB 8
9949a4ee1238 nest-api-1 0.87% 290MiB / 3.725GiB 7.60% 2.14MB / 37.2kB 156kB / 2.06MB 33
96fe43dba2b0 postgres 0.00% 29MiB / 3.725GiB 0.76% 7.46kB / 6.03kB 1.17MB / 67.8MB 7
ff570659e917 redis 0.30% 3.004MiB / 3.725GiB 0.08% 2.99kB / 0B 614kB / 4.1kB 5
ALSO docker top shows you you the pid ids.
I dont know your application, but check also what if the issue can be related to the volumes itselfs each time the container restarts.

Related

docker load from tar is failing - no space left on device

I am using docker over https https://x.x.198.38:2376/v1.40/images/load
And I started getting this error when running docker on Centos, this was not an issue on Ubuntu.
The image in question is 1.1gb in size.
Error Message:
Error processing tar file(exit status 1): open /root/.cache/node-gyp/12.21.0/include/node/v8-testing.h: no space left on device
I ran into a similar issue some time back.
The image might have a lot of small files and you might be falling short on disk space or inodes.
I was able to get to it only when I did "watch df -hi", it showed me that inodes were pegging up to 100 but docker cleaned up and it was back to 3%. Check this screensshot
And further analysis showed that the volume attached was very small, it was just 5gb out of which 2.9 was already used by some unused images and stopped or exited containers
Hence as a quick fix
sudo docker system prune -a
And this increased the inodes from 96k to 2.5m
And as a long-term fix, I increased the aws abs volume to up to 50gb as we had plans to use windows images too in the future..
HTH
#bjethwan you caught very good command. I solved my problem.Thank you. I am using redhat. I want to add something.
watch command works 2 seconds interval at default. When i used it default, It couldnt catch the problematic inodes.
I ran watch command with 0.5 seconds. This arrested the guilty volume :)
watch -n 0.5 df -hi
After detecting the true volume you must increase it.

Nifi 1.6.0 memory leak

We're running Docker containers of NiFi 1.6.0 in production and have to come across a memory leak.
Once started, the app runs just fine, however, after a period of 4-5 days, the memory consumption on the host keeps on increasing. When checked in the NiFi cluster UI, the JVM heap size used hardly around 30% but the memory on the OS level goes to 80-90%.
On running the docker starts command, we found that the NiFi docker container is consuming the memory.
After collecting the JMX metrics, we found that the RSS memory keeps growing. What could be the potential cause of this? In the JVM tab of cluster dialog, young GC also seems to be happening in a timely manner with old GC counts shown as 0.
How do we go about identifying in what's causing the RSS memory to grow?
You need to replicate that in a non-docker environment, because with docker, memory is known to raise.
As I explained in "Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container", docker has some bugs (like issue 10824 and issue 15020) which prevent an accurate report of the memory consumed by a Java process within a Docker container.
That is why a plugin like signalfx/docker-collectd-plugin mentions (two weeks ago) in its PR -- Pull Request -- 35 to "deduct the cache figure from the memory usage percentage metric":
Currently the calculation for memory usage of a container/cgroup being returned to SignalFX includes the Linux page cache.
This is generally considered to be incorrect, and may lead people to chase phantom memory leaks in their application.
For a demonstration on why the current calculation is incorrect, you can run the following to see how I/O usage influences the overall memory usage in a cgroup:
docker run --rm -ti alpine
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
dd if=/dev/zero of=/tmp/myfile bs=1M count=100
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
You should see that the usage_in_bytes value rises by 100MB just from creating a 100MB file. That file hasn't been loaded into anonymous memory by an application, but because it's now in the page cache, the container memory usage is appearing to be higher.
Deducting the cache figure in memory.stat from the usage_in_bytes shows that the genuine use of anonymous memory hasn't risen.
The signalFX metric now differs from what is seen when you run docker stats which uses the calculation I have here.
It seems like knowing the page cache use for a container could be useful (though I am struggling to think of when), but knowing it as part of an overall percentage usage of the cgroup isn't useful, since it then disguises your actual RSS memory use.
In a garbage collected application with a max heap size as large, or larger than the cgroup memory limit (e.g the -Xmx parameter for java, or .NET core in server mode), the tendency will be for the percentage to get close to 100% and then just hover there, assuming the runtime can see the cgroup memory limit properly.
If you are using the Smart Agent, I would recommend using the docker-container-stats monitor (to which I will make the same modification to exclude cache memory).
Yes, NiFi docker has memory issues, shoots up after a while & restarts on its own. On the other hand, the non-docker works absolutely fine.
Details:
Docker:
Run it with 3gb Heap size & immediately after the start up it consumes around 2gb. Run some processors, the machine's fan runs heavily & it restarts after a while.
Non-Docker:
Run it with 3gb Heap size & it takes 900mb & runs smoothly. (jconsole)

Docker: resource issues inside containers

I have a centos7.4 container running. Inside this container I am doing a variety of things. I am noticing very frequent issues such as
clang-5.0: error: unable to execute command: posix_spawn failed: Resource temporarily unavailable.
And
There is insufficient memory for the Java Runtime Environment to continue.
Cannot create GC thread. Out of system resources.
Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread
And
make[1]: vfork: Resource temporarily unavailable
And
fork() failed: Command could not be run:
If I execute the same "variety of things" on the same Docker host but outside of the container, i dont run into issues. I am trying to understand if this is docker specific thing that I am not aware of. Docker by default inherits host's resources correct? i.e. number of cores and mem
i see this then i run docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
8cbc44773def my-rh6 1.98% 48.59MiB / 251.6GiB 0.02% 33.8GB / 31.6GB 65.9MB / 11.5MB 26
683313a4e70e my-rh7 85.99% 73.26MiB / 251.6GiB 0.03% 21.1GB / 1.31GB 269MB / 26.6MB 6
my host has 72 cores and 250G mem. so mem seems to be the same. I have no idea how java would run out of mem with 250G being available...
When I do execute "variety of things" i do see CPU % go above 1000% at times which i am not sure if its the norm. If i run the same thing outside of container, sar doesnt show cpu consumption nowhere near 1000%. (70% max at times)

Airbnb Airflow using all system resources

We've set up Airbnb/Apache Airflow for our ETL using LocalExecutor, and as we've started building more complex DAGs, we've noticed that Airflow has starting using up incredible amounts of system resources. This is surprising to us because we mostly use Airflow to orchestrate tasks that happen on other servers, so Airflow DAGs spend most of their time waiting for them to complete--there's no actual execution that happens locally.
The biggest issue is that Airflow seems to use up 100% of CPU at all times (on an AWS t2.medium), and uses over 2GB of memory with the default airflow.cfg settings.
If relevant, we're running Airflow using docker-compose running the container twice; once as scheduler and once as webserver.
What are we doing wrong here? Is this normal?
EDIT:
Here is the output from htop, ordered by % Memory used (since that seems to be the main issue now, I got CPU down):
I suppose in theory I could reduce the number of gunicorn workers (it's at the default of 4), but I'm not sure what all the /usr/bin/dockerd processes are. If Docker is complicating things I could remove it, but it's made deployment of changes really easy and I'd rather not remove it if possible.
I have also tried everything I could to get the CPU usage down and Matthew Housley's advice regarding MIN_FILE_PROCESS_INTERVAL was what did the trick.
At least until airflow 1.10 came around... then the CPU usage went through the roof again.
So here is everything I had to do to get airflow to work well on a standard digital ocean droplet with 2gb of ram and 1 vcpu:
1. Scheduler File Processing
Prevent airflow from reloading the dags all the time and set:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=60
2. Fix airflow 1.10 scheduler bug
The AIRFLOW-2895 bug in airflow 1.10, causes high CPU load, because the scheduler keeps looping without a break.
It's already fixed in master and will hopefully be included in airflow 1.10.1, but it could take weeks or months until its released. In the meantime this patch solves the issue:
--- jobs.py.orig 2018-09-08 15:55:03.448834310 +0000
+++ jobs.py 2018-09-08 15:57:02.847751035 +0000
## -564,6 +564,7 ##
self.num_runs = num_runs
self.run_duration = run_duration
+ self._processor_poll_interval = 1.0
self.do_pickle = do_pickle
super(SchedulerJob, self).__init__(*args, **kwargs)
## -1724,6 +1725,8 ##
loop_end_time = time.time()
self.log.debug("Ran scheduling loop in %.2f seconds",
loop_end_time - loop_start_time)
+ self.log.debug("Sleeping for %.2f seconds", self._processor_poll_interval)
+ time.sleep(self._processor_poll_interval)
# Exit early for a test mode
if processor_manager.max_runs_reached():
Apply it with patch -d /usr/local/lib/python3.6/site-packages/airflow/ < af_1.10_high_cpu.patch;
3. RBAC webserver high CPU load
If you upgraded to use the new RBAC webserver UI, you may also notice that the webserver is using a lot of CPU persistently.
For some reason the RBAC interface uses a lot of CPU on startup. If you are running on a low powered server, this can cause a very slow webserver startup and permanently high CPU usage.
I have documented this bug as AIRFLOW-3037. To solve it you can adjust the config:
AIRFLOW__WEBSERVER__WORKERS=2 # 2 * NUM_CPU_CORES + 1
AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800 # Restart workers every 30min instead of 30seconds
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT=300 #Kill workers if they don't start within 5min instead of 2min
With all of these tweaks my airflow is using only a few % of CPU during idle time on a digital ocean standard droplet with 1 vcpu and 2gb of ram.
I just ran into an issue like this. Airflow was consuming roughly a full vCPU in a t2.xlarge instance, with the vast majority of this coming from the scheduler container. Checking the scheduler logs, I could see that it was processing my single DAG more than once a second even though it only runs once a day.
I found that the MIN_FILE_PROCESS_INTERVAL was set to the default value of 0, so the scheduler was looping over the DAG. I changed the process interval to 65 seconds, and Airflow now uses less than 10 percent of a vCPU in a t2.medium instance.
Try to change the below config in airflow.cfg
# after how much time a new DAGs should be picked up from the filesystem
min_file_process_interval = 0
# How many seconds to wait between file-parsing loops to prevent the logs from being spammed.
min_file_parsing_loop_time = 1
the key point is HOW to processing dag files.
reduce cpu usage from 80%+ to 30% for scheduler on a 8-core server, i have updated 2 config key,
min_file_process_interval from 0 to 60.
max_threads from 1000 to 50.
For starters, you can use htop to monitor and debug your CPU usage.
I would suggest that you run webserver and scheduler processes on the same docker container which would reduce the resources required to run two containers on a ec2 t2.medium. Airflow workers need resources for downloading data and reading it in memory but webserver and scheduler are pretty lightweight processes. Makes sure when you run webserver you are controlling the number of workers running on the instance using the cli.
airflow webserver [-h] [-p PORT] [-w WORKERS]
[-k {sync,eventlet,gevent,tornado}]
[-t WORKER_TIMEOUT] [-hn HOSTNAME] [--pid [PID]] [-D]
[--stdout STDOUT] [--stderr STDERR]
[-A ACCESS_LOGFILE] [-E ERROR_LOGFILE] [-l LOG_FILE]
[-d]
I have faced the same issue deploying airflow on EKS.Its resolved by updating max_threads to 128 in airflow config.
max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by max_threads with default value of 2. User should increase this value to a larger value (e.g numbers of cpus where scheduler runs - 1) in production.
From here
https://airflow.apache.org/docs/stable/faq.html
I tried to run Airflow on a AWS t2.micro instance (1vcpu, 1gb of memory, eligible for free tier), and had the same issue : the worker consumed 100% of the cpu and consumed all available memory.
The EC2 instance was totally stuck and unusable, of course Airflow didn't working.
So I created a 4GB swap file using the method described here. With the swap, no more issues, Airflow was fully functionnal.
Of course, with only one vcpu, you cannot expect incredible performances, but it runs.

CoreOS Single Container High Memory Usage

So I have a simple Go web app I deployed as a Docker container. I am running a t2.small instance on AWS with CoreOS AMI.
The container is very small, only using about 10MB of memory according to docker stat:
CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O
8e230506e99a 0.00% 11.11 MB / 2.101 GB 0.53% 49.01 MB / 16.39 MB 1.622 MB / 0 B
However the CoreOS instance seems to be using a lot of memory:
$ free
total used free shared buffers cached
Mem: 2051772 1686012 365760 25388 253096 1031836
-/+ buffers/cache: 401080 1650692
Swap: 0 0 0
As you can see it's using almost 1.7GB of memory of its 2GB total memory with only about 300MB left. And this seems to be slowly getting worse.
I've had the instance running for about 3 days now and the free memory started at around 400MB after fresh launch and starting a single Docker container.
Is this something I should worry about? Or is CoreOS supposed to use so so much memory when my little Go app in a container only uses tiny 10MB.
Because a lot of that memory usage is buffers and cache. The better indicator is your application from Docker (which is likely close if it is a small Go app) and the OS total usage minux buffers and cache on the second line (which is closer to 400 MB used).
See https://unix.stackexchange.com/a/152301/6515 for a decent explanation.

Resources