Excessive memory use pyspark

Excessive memory use pyspark - memory

I have setup a JupyterHub and configured a pyspark kernel for it. When I open a pyspark notebook (under username Jeroen), two processes are added, a Python process and a Java process. The Java process is assigned 12g of virtual memory (see image). When running a test script on a range of 1B number it grows to 22g. Is that something to worry about when we work on this server with multiple users? And if it is, how can I prevent Java from allocating so much memory?

You don't need to worry about virtual memory usage, reserved memory is much more important here (the RES column).
You can control size of JVM heap usage using --driver-memory option passed to spark (if you use pyspark kernel on jupyterhub you can find it in environment under PYSPARK_SUBMIT_ARGS key). This is not exactly the memory limit for your application (there are other memory regions on JVM), but it is very close.
So, when you have multiple users setup, you should learn them to set appropriate driver memory (the minimum they need for processing) and shutdown notebooks after they finish work.

Related

Understanding JVM Memory on Containers Environment

I'm running a regular JVM application on containers in GCP.
The container_memory_working_set_bytes metric returns 4GB, while sum(jvm_memory_bytes_used) returns 2GB.
I'm trying to understand which processes are using the remaining 2GB.
In theory, what can consume this memory? how can I understand it via Prometheus / Linux shell by kubectl exec?

According to these docs jvm_memory_bytes_used is only heap memory.
JVM can consume much more than that and this has already been asked and answered several times.
One of the best answers that I know about is here: Java using much more memory than heap size (or size correctly Docker memory limit)

I managed to find the difference, It's was the JVM committed memory which can be found via jvm_memory_bytes_commited

Docker, set Memory Limit for group of container

I'm running a docker environment with roundabout 25 containers for my private use. My system has 32GB Ram.
Especially RStudio Server and JupyterLab often need a lot of memory.
So I limited the memory to for both container at 26 GB.
This works good as long not both application storing dataframes in memory. If now RServer stores some GB and Jupyter also is filling memory to the limit my system crashes.
Is there any way to configure that these two container together are allowed to use 26GB ram max.
Or a relative limit like Jupyter is allowed to use 90% of free memory.
As I'm working now with large datasets it can happen all the time that (because I forget to close a kernel or something else) the memory can increase to the limit and I want just the container to crash and not the whole system.
And I don't want lower the limit for Jupyter further as the biggest dataset on its own need 15 GB of memory.
Any ideas?

Does Google Cloud Run memory limit apply to the container size?

For cloud run's memory usage from the docs (https://cloud.google.com/run/docs/configuring/memory-limits)
Cloud Run applications that exceed their allowed memory limit are terminated.
When you configure memory limit settings, the memory allocation you are specifying is used for:
Operating your service
Writing files to disk
Running binaries or other processes in your container, such as the nginx web server.
Does the size of the container count towards "operating your service" and counts towards the memory limit?
We're intending to use images that could already approach the memory limit, so we would like to know if the service itself will only have access to what is left after subtracting container size from the limit

Cloud Run PM here.
Only what you load in memory counts toward your memory usage. So for example, if you have a 2GB container but only execute a very small binary inside it, then only this one will count as used memory.
This means that if your image contains a lot of OS packages that will never be loaded (because for example you inherited from a.big base image), this is fine.

Size of the container image you deploy to Cloud Run does not count towards the memory limit. For example, if your container image is 3 GiB, you can still run on a 256 MiB memory environment.
Writing new files to local filesystem, or (obviously) allocating more memory within your app will count towards the memory usage of your container. (Perhaps also obvious, but worth mentioning) the operating system will "load" your container's entrypoint executable to memory (well, to execute it). That will count towards the available memory as well.

Docker 'killing' my program

I am running a data analysis code in docker using pandas on MacOS.
However, the program gets killed on high memory allocation in a data frame (I know because it gets killed when my program is loading a huge dataset).
Without the container, my program runs alright on my laptop.
Why is this happening and how can I change this?

Docker on MacOS is running inside a Linux VM which has an explicit memory allocation. From the docs:
MEMORY
By default, Docker for Mac is set to use 2 GB runtime memory,
allocated from the total available memory on your Mac. You can
increase the RAM on the app to get faster performance by setting this
number higher (for example to 3) or lower (to 1) if you want Docker
for Mac to use less memory.
Those instructions are referring to the Preferences dialog.

make full use of 24G memory for jboss

we have a solaris sparc 64 bit running the jboss. it has 24G mem. but because of JVM limitation, i can only set to JAVA_OPTS="-server -Xms256m -Xmx3600m -XX:MaxPermSize=3600m".
i don't know the exactly cap. but if i set to 4000m, java won't like it.
is there any way to use this 24G mem fully or more efficiently?
if i use cluster on one machine, is it stable? it need rewrite some part of code, i heard.

All 32-bit processes are limited to 4 gigabytes of addressable memory. 2^32 == 4 gibi.
If you can run jboss as a 64-bit process (usually just adding "-d64" to JAVA_OPTS), then you can tune jboss with increased thread and object pools to make use of that memory. As others have mentioned, you might have horrific garbage collection pauses, but you may be able to figure out how to avoid those with some load testing and the right choice of garbage collection algorithms.
If you have to run as a 32-bit process, then yes, you can at least run multiple instances of jboss on the same server. In your case, there's three options: zones, virtual interfaces, and port binding.
Solaris Zones
Since you're running solaris, it's relatively easy to create virtual servers ("non-global zones") and install jboss in each one just like you would the real server.
Multi-homing
Configure an extra IP address for each jboss instance you want to run (usually by adding virtual interfaces, but you could also install extra NICs) and bind each instance of jboss to its own IP address with the "-b" startup option.
Service Binding Manager
This is the most complicated / brittle option, but at least it requires no OS changes.
Whether or not to actually configure the jboss instances as a cluster depends on your application. One benefit is the ability to use http session failover. One downside is cluster merge issues if your application is unstable or tends to become unresponsive for periods of time.
You mention your application is slow; before you go too far down any of these paths, try to understand where your bottlenecks are. Use jvisualvm or jstat to observe if you're doing frequent garbage collections. If you're not, then adding heap memory or extra jboss instances may not resolve your performance woes.

you can't use the full physical memory, JVM requires max contined memory trunck, try use java -Xmxnnnnm -version to test the max available memory on your box.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart