We have an Image Service written in Golang.
It supports image operation like resize crop blur..
The RPS is around 400.
Pod Config : 16GB RAM and 8 cores
We deployed the application and observed for a day, it showed high core utilization
We introduced ballast(https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-to-stop-worrying-and-love-the-heap-26c2462549a2/) of 4GB and Sync pool(https://medium.com/a-journey-with-go/go-understand-the-design-of-sync-pool-2dde3024e277) to contain the core issues
Next we started observing high memory utilization.
Hence we reduced Ballast to 1GB, but still memory utilization is high
According to this article https://www.bwplotka.dev/2019/golang-memory-monitoring/ Goland version 1.12+ reported high RSS According to the article "This does not mean that they require more memory, it’s just optimization for cases where there is no other memory pressure."
To verify that we did a small POC on local machine to validate above and it worked.
Local Set up - Container memory - 500MB
The memory would continuously increase if it had and would remain there at 450MB until the pressure increases. As soon as the pressure increases the memory would go down to 4MB.
But this POC failed on Kubernetes cluster and the pods started crashing and restarting when the memory reached ~16 GB RAM on high RPS like 400.
Can someone suggest how can we contain this memory issue and why this POC failed on the cluster.
Let me know if more detail is required..
Related
When running a simple Linux container on ACI there is a huge discrepancy between the 'graphed' CPU usage in the portal compared to running 'top' in the container itself.
I can see my process running in 'top' and the cpu usage stays at around 5% and the load on the machine is below 0.10 but the portal reports around 60% usage. It's a single processor container.
Under heavier loads I have seen CPU usage of 300-400 % which feels like an issue related to the number of processors but even this does not add up and as previously stated it's a single processor container
Any thoughts ??
the ACI CPU Usage metrics seems to be in millicores, not in %. So when you see 300-400 it would be in fact .3 to .4 CPU which for a single CPU would represent 30-40%.
https://learn.microsoft.com/en-us/azure/container-instances/container-instances-monitor#available-metrics
Hoping this helps.
For cloud run's memory usage from the docs (https://cloud.google.com/run/docs/configuring/memory-limits)
Cloud Run applications that exceed their allowed memory limit are terminated.
When you configure memory limit settings, the memory allocation you are specifying is used for:
Operating your service
Writing files to disk
Running binaries or other processes in your container, such as the nginx web server.
Does the size of the container count towards "operating your service" and counts towards the memory limit?
We're intending to use images that could already approach the memory limit, so we would like to know if the service itself will only have access to what is left after subtracting container size from the limit
Cloud Run PM here.
Only what you load in memory counts toward your memory usage. So for example, if you have a 2GB container but only execute a very small binary inside it, then only this one will count as used memory.
This means that if your image contains a lot of OS packages that will never be loaded (because for example you inherited from a.big base image), this is fine.
Size of the container image you deploy to Cloud Run does not count towards the memory limit. For example, if your container image is 3 GiB, you can still run on a 256 MiB memory environment.
Writing new files to local filesystem, or (obviously) allocating more memory within your app will count towards the memory usage of your container. (Perhaps also obvious, but worth mentioning) the operating system will "load" your container's entrypoint executable to memory (well, to execute it). That will count towards the available memory as well.
I'm doing an internship focused on Docker and I have to load-balance an application which have a client, a server and a database. My goal is to dynamically scale the number of server containers according their CPU usage. For instance if the CPU usage is over 60% I add a new container on the fly to divide the CPU usage. My problem is that my simulation does not get the CPU usage higher than 20%, it is a very simple simulation where a random users register and go to random pages.
Question : How can I lower the CPU capacity of my server containers using my docker-compose file in order to artificially make the CPU go higher ? I tried to use the cpu_quota and cpu_shares instructions but it's not very documented and I don't know how it works or affects my containers.
I have a t2.micro EC2 instance, running at about 2% CPU. I know from other posts that the CPU usage shown in TOP is different to CPU reported in CloudWatch, and the CloudWatch value should be trusted.
However, I'm seeing very different values for Memory usage between TOP, CloudWatch, and NewRelic.
There's 1Gb of RAM on the instance, and TOP shows ~300Mb of Apache processes, plus ~100Mb of other processes. The overall memory usage reported by TOP is 800Mb. I guess there's 400Mb of OS/system overhead?
However, CloudWatch reports 700Mb of usage, and NewRelic reports 200Mb of usage (even though NewRelic reports 300Mb of Apache processes elsewhere, so I'm ignoring them).
The CloudWatch memory metric often goes over 80%, and I'd like to know what the actual value is, so I know when to scale if necessary, or how to reduce memory usage.
Here's the recent memory profile, seems something is using more memory over time (big dips are either Apache restart, or perhaps GC?)
Screenshot of memory usage over last 12 days
AWS doesn't supports Memory metrics of any EC2 instance. As Amazon does all his monitoring from outside the EC2 instance(servers), it is unable to capture the memory metrics inside the instance. But, for complete monitoring of an instance, you must need Memory Utilisation statistics for any instance, along with his CPU Utilisation and Network IO operations.
But, we can use custom metrics feature of cloudwatch to send any app-level data to Cloudwatch and monitor it using amazon tools.
You can follow this blog for more details: http://upaang-saxena.strikingly.com/blog/adding-ec2-memory-metrics-to-aws-cloudwatch
You can set a cron for 5 min interval in that instance, and all the data points can be seen in Cloudwatch.
CloudWatch doesn't actually provide metrics regarding memory usage for EC2 instance, you can confirm this here.
As a result, the MemoryUtilization metric that you are referring to is obviously a custom metric that is being pushed by something you have configured or some application running on your instance.
As a result, you need to determine what is actually pushing the data for this metric. The data source is obviously pushing the wrong thing, or is unreliable.
The behavior you are seeing is not a CloudWatch problem.
This is odd behavior but I have been able to reproduce it 100%. I'm currently testing Neo4j 2.0.1 Enterprise on my laptop and desktop machines. Laptop has 8GB RAM i7 4600U and Desktop has 16GB RAM i7 4770k. (Both machines are running Windows 8.1 x64 ENT and the same version of Java (latest as of FEB 19, 2014).
On first boot of each, when I run an expensive (or not so expensive) query, I can see the memory allocation go up (as expected for the cache). When starting the server, initial allocation is around 200-250mb, give or take. After a few expensive queries, it goes up to about 2GB, which is fine, I want this memory allocation. However, I have a batch script that stops the service, clears out the database and restarts the service (to start fresh when testing different development methods).
After 3 or 4 restarts, I noticed that the memory will NEVER climb above 400mb. Processor usage sits around 30-40% during the expensive query, but memory never increases. I will then get Unknown Error messages in the console when running other expensive queries. This is the same query that after a full reboot of the system, would bring the memory usage up to 2GB.
I'm not sure what could be causing this, or if there is a way to make sure that memory usage continues to be allocated, even on service restart. Rebooting a production server doesn't seem like a viable option, unless running in HA.