I have a spark job with setting spark.executor.memory=4G and setting docker memory limitation=5G
And I monitoring memory usage by docker stats
$ docker stats --format="{{.MemUsage}}"
1.973GiB / 5GiB
Usage of memory = RSS+Cache = 930MB + 1.xGB = 1.97GB
The cache size increasing until triggering OOM killer then my job will be failed.
Currently, I release cache manually by typing command
$ sync && echo 3 > /proc/sys/vm/drop_caches
It works for me, but is there any better way to limit docker memory cache size or dropping memory cache automatically
Related
I'm trying to build APK on Google cloud run using flutter but I got Memory limit of 2048 exceeded
and Google Cloud Run does not have a larger memory option
is there any way to limit memory usage without killing the process on Cloud Run?
or a way to reduce/limit the memory used by flutter build apk command (it's inside docker file)
I tested it on my machine and it uses something like 2.2 GB ram so it's all about 0.2 more ram usage
According to the official documentation Memory for Cloud Run:
`Maximum memory size is 2 GB per container instance and the limit can not be increased'
Memory Maximum memory size, in GB 2 No per container instance
I am working on a java service that basically creates files in a network file system to store data. It runs in a k8s cluster in a Ubuntu 18.04 LTS.
When we began to limit the memory in kubernetes (limits: memory: 3Gi), the pods began to be OOMKilled by kubernetes.
At the beginning we thought it was a leak of memory in the java process, but analyzing more deeply we noticed that the problem is the memory of the kernel.
We validated that looking at the file /sys/fs/cgroup/memory/memory.kmem.usage_in_bytes
We isolated the case to only create files (without java) with the DD command like this:
for i in {1..50000}; do dd if=/dev/urandom bs=4096 count=1 of=file$i; done
And with the dd command we saw that the same thing happened ( the kernel memory grew until OOM).
After k8s restarted the pod, I got doing a describe pod:
Last State:Terminated
Reason: OOMKilled
Exit Code: 143
Creating files cause the kernel memory grows, deleting those files cause the memory decreases . But our services store data , so it creates a lot of files continuously, until the pod is killed and restarted because OOMKilled.
We tested limiting the kernel memory using a stand alone docker with the --kernel-memory parameter and it worked as expected. The kernel memory grew to the limit and did not rise anymore. But we did not find any way to do that in a kubernetes cluster.
Is there a way to limit the kernel memory in a K8S environment ?
Why the creation of files causes the kernel memory grows and it is not released ?
Thanks for all this info, it was very useful!
On my app, I solved this by creating a new side container that runs a cron job, every 5 minutes with the following command:
echo 3 > /proc/sys/vm/drop_caches
(note that you need the side container to run in privileged mode)
It works nicely and has the advantage of being predictable: every 5 minutes, your memory cache will be cleared.
We're running Docker containers of NiFi 1.6.0 in production and have to come across a memory leak.
Once started, the app runs just fine, however, after a period of 4-5 days, the memory consumption on the host keeps on increasing. When checked in the NiFi cluster UI, the JVM heap size used hardly around 30% but the memory on the OS level goes to 80-90%.
On running the docker starts command, we found that the NiFi docker container is consuming the memory.
After collecting the JMX metrics, we found that the RSS memory keeps growing. What could be the potential cause of this? In the JVM tab of cluster dialog, young GC also seems to be happening in a timely manner with old GC counts shown as 0.
How do we go about identifying in what's causing the RSS memory to grow?
You need to replicate that in a non-docker environment, because with docker, memory is known to raise.
As I explained in "Difference between Resident Set Size (RSS) and Java total committed memory (NMT) for a JVM running in Docker container", docker has some bugs (like issue 10824 and issue 15020) which prevent an accurate report of the memory consumed by a Java process within a Docker container.
That is why a plugin like signalfx/docker-collectd-plugin mentions (two weeks ago) in its PR -- Pull Request -- 35 to "deduct the cache figure from the memory usage percentage metric":
Currently the calculation for memory usage of a container/cgroup being returned to SignalFX includes the Linux page cache.
This is generally considered to be incorrect, and may lead people to chase phantom memory leaks in their application.
For a demonstration on why the current calculation is incorrect, you can run the following to see how I/O usage influences the overall memory usage in a cgroup:
docker run --rm -ti alpine
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
dd if=/dev/zero of=/tmp/myfile bs=1M count=100
cat /sys/fs/cgroup/memory/memory.stat
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
You should see that the usage_in_bytes value rises by 100MB just from creating a 100MB file. That file hasn't been loaded into anonymous memory by an application, but because it's now in the page cache, the container memory usage is appearing to be higher.
Deducting the cache figure in memory.stat from the usage_in_bytes shows that the genuine use of anonymous memory hasn't risen.
The signalFX metric now differs from what is seen when you run docker stats which uses the calculation I have here.
It seems like knowing the page cache use for a container could be useful (though I am struggling to think of when), but knowing it as part of an overall percentage usage of the cgroup isn't useful, since it then disguises your actual RSS memory use.
In a garbage collected application with a max heap size as large, or larger than the cgroup memory limit (e.g the -Xmx parameter for java, or .NET core in server mode), the tendency will be for the percentage to get close to 100% and then just hover there, assuming the runtime can see the cgroup memory limit properly.
If you are using the Smart Agent, I would recommend using the docker-container-stats monitor (to which I will make the same modification to exclude cache memory).
Yes, NiFi docker has memory issues, shoots up after a while & restarts on its own. On the other hand, the non-docker works absolutely fine.
Details:
Docker:
Run it with 3gb Heap size & immediately after the start up it consumes around 2gb. Run some processors, the machine's fan runs heavily & it restarts after a while.
Non-Docker:
Run it with 3gb Heap size & it takes 900mb & runs smoothly. (jconsole)
Steps to reproduce
Tell us about your environment:
Puppeteer version:1.6.1
Platform / OS version: linux
URLs (if applicable):
Node.js version: 8
What steps will reproduce the problem?
deployed docker in linux. We make a health check of screenshot per minutes. The problem is that the docker cache memory is increasing all the time even we disable nearly all the cache, although rss did not increase.
this is part of code below:
const browser = await puppeteer.launch({ args: ['--no-sandbox','--disable-dev-shm-usage','--media-cache-size=1','--disk-cache-size=1','--disable-application-cache','--disable-session-storage','--user-data-dir=/dev/null']})
await page.setCacheEnabled(false);
But if we execute "# sync; echo 2 > /proc/sys/vm/drop_caches" to clear dentries and inodes the cache memory will be decreasing rapidly. But we have disabled chrome to write cache. So we don't know what makes cache memory grow.
I think BMitch is right. linux will grow the disk cache from unused ram. And docker.stats will include disk cache. The reason why docker.stats is always growing is the disk cache is growing. And ec2 will monitor the docker.stats. I think this is ec2's problem. Thx guys.
After executing show disk details , I am able to see the "Disk rebuild speed low "on solace appliance 3560.
What is mean of "Disk rebuild speed low" in solace.
The CLI command
disk rebuild speed <low | high >
changes the RAID rebuild speed of a physical appliance. The higher the rebuild speed, the faster the RAID 1 mirroring completes but more system resources are consumed for the rebuild task.
You can check disk rebuild status in the support shell with:
[support#solace ~]$ cat /proc/mdstat
and the corresponding speed limits set after changing the speeds with the above CLI command:
[support#solace ~]$ cat /proc/sys/dev/raid/speed_limit_max
[support#solace ~]$ cat /proc/sys/dev/raid/speed_limit_min