How to monitor the size of spark's "STATE"? - memory

How to monitor the size of state of spark streaming application? The storage tab in driver GUI is only showing the results of Mapwithstate operation (Mapwithstaterdd, but not the actual Spark State RDD!
From Grafana, we observed the overall memory usage of spark streaming application "grows" with each batch of incoming stream processing. The memory usage by worker nodes (overall cluster) shown in the Grafana is much higher than the Mapwithstaterdd's (results of mapwithstate operation) size under Storage tab in driver GUI.
I stopped feeding input data for about 30 mins, but the memory usage never comes down. I suspect the bulk of the memory is consumed by spark 'state'. Is there a way i can monitor the size of the spark 'state'?

Seems it's not able to check directly.
From the "Storage" page in Spark UI, we can get "Size in Memory" of the "MapWithStateRDD". But it contains the input data of that batch and the state.
As Understanding Spark Caching, the memory cost will be about 3 times bigger than the data size.
And by default, Spark will cache 2 * 10 (checkpointDuration) MapWithStateRDDs.
So, the total memory cost is large.

Related

Dask: number of CPUs used for same computation drops over time

While using dask for decimating timeseries, I noticed that sometimes only 1-2 CPUs are used for the computation even though at other times CPU utilization is high (suggesting the algorithm is not the problem). I would like to know why the the drop in CPU use occurs and how I can make sure that the CPUs are utilized to the fullest every time.
The set up is the following: we have a (n,t) array (n=~400 and t=~50M timepoints) stored in a binary file that we wrap in a delayed memory-mapped dask array following the example here. We then use map_overlap to apply scipy.signal.decimate to partially overlapping chunks of the data.
Next, I spin up a dask-scheduler and a single dask-worker (nprocs=1 and nthreads=10) in a jupyterhub instance on a single machine that has 32 cores and 256GB memory. In a jupyter notebook, I connect a client to the scheduler and run the same decimation computation multiple times.
In some instances, all works as expected and the CPU utilization is high. See snapshot of the CPU utilization below that shows high utilization for 5 executions of the same dask graph.
However in other instances, e.g. when I restart schedular/worker, CPU utilization is initially high, but drops such that only 1-2 CPUs are used, as seen in the snapshot with 3 executions of the same dask graph.

Why does Spark use and grab too many memory when shuffle occurred?

I find there is too many memory usage when shuffle occurred in Spark process.
Following figure is memory metric when I use 700MB data and just three rdd.map.
(I use Ganglia as monitoring tool, and show just three nodes of my cluster. x-axis means time-series, y-axis means memory usage)
enter image description here
and following figure is also memory metric when I use same data and use three rdd.groupBy, three rdd.flatMap (order : groupBy1->flatMap1->groupBy2->flatMap2->groupBy3->flatMap3)
enter image description here
As you can see, all of three node's memory is considerably increased (several GB) even though I use just 700MB data. Indeed I have 8 worker node, and all of 8 worker's memory is considerably increased.
I think the main cause is shuffle since rdd.map has no shuffle but rdd.groupBy has shuffle.
In this situation, I wonder three point below :
why is there too many memory usage? (more than 15GB is used when I use 700MB in all of my worker node.)
why does it seem that used memory for old shuffle is not removed before Spark application is finished?
Is there any way to reduce memory usage or remove memory generated in old shuffle?
P.S. - My environment :
cloud platform : MS Azure (8 worker nodes)
Spec. of one worker : 8 cores CPU, 16GB RAM
Language : Java
Spark version : 1.6.2
Java version : 1.7(development), 1.8(execution)
Run in Spark-standalone (Not use Yarn or Mesos)
In Spark, The operating system will decide if the data can stay in its buffer cache or should it be spilled to DISK. Each map task creates as many shuffle spill files as number of reducers. SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
Please refer to this post which very well explains this in detail in continuation to above explanation.
You can also refer to Optimizing Shuffle Performance in Spark paper.
~Kedar

EC2 CloudWatch memory metrics don't match what Top shows

I have a t2.micro EC2 instance, running at about 2% CPU. I know from other posts that the CPU usage shown in TOP is different to CPU reported in CloudWatch, and the CloudWatch value should be trusted.
However, I'm seeing very different values for Memory usage between TOP, CloudWatch, and NewRelic.
There's 1Gb of RAM on the instance, and TOP shows ~300Mb of Apache processes, plus ~100Mb of other processes. The overall memory usage reported by TOP is 800Mb. I guess there's 400Mb of OS/system overhead?
However, CloudWatch reports 700Mb of usage, and NewRelic reports 200Mb of usage (even though NewRelic reports 300Mb of Apache processes elsewhere, so I'm ignoring them).
The CloudWatch memory metric often goes over 80%, and I'd like to know what the actual value is, so I know when to scale if necessary, or how to reduce memory usage.
Here's the recent memory profile, seems something is using more memory over time (big dips are either Apache restart, or perhaps GC?)
Screenshot of memory usage over last 12 days
AWS doesn't supports Memory metrics of any EC2 instance. As Amazon does all his monitoring from outside the EC2 instance(servers), it is unable to capture the memory metrics inside the instance. But, for complete monitoring of an instance, you must need Memory Utilisation statistics for any instance, along with his CPU Utilisation and Network IO operations.
But, we can use custom metrics feature of cloudwatch to send any app-level data to Cloudwatch and monitor it using amazon tools.
You can follow this blog for more details: http://upaang-saxena.strikingly.com/blog/adding-ec2-memory-metrics-to-aws-cloudwatch
You can set a cron for 5 min interval in that instance, and all the data points can be seen in Cloudwatch.
CloudWatch doesn't actually provide metrics regarding memory usage for EC2 instance, you can confirm this here.
As a result, the MemoryUtilization metric that you are referring to is obviously a custom metric that is being pushed by something you have configured or some application running on your instance.
As a result, you need to determine what is actually pushing the data for this metric. The data source is obviously pushing the wrong thing, or is unreliable.
The behavior you are seeing is not a CloudWatch problem.

Apache Spark - Memory management

So assume i've got a cluster with 100 GB memory for spark to utilize. I got a dataset of 2000 GB and want to run a iterative application o this dataset. 200 iterations.
My question is, when using .cache(), will spark keep the first 100 GB in memory and perform the 200 iteration before reading the next 100 GB automatically?
When working within the memory limit sparks advantages are very clear, but when working with larger datasets im not entirely sure how spark and yarn manages the data.
This is not the behaviour you will see. Spark's caching is done using LRU eviction, so if you cache a dataset which is too big for memory, only the most recently used part will be kept in memory. However, spark also has a MEMORY_AND_DISK persistence mode (described in more detail at https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence ) which sounds like it could be a good fit for your case.

Monitoring CPU Core Usage on Terminal Servers

I have windows 2003 terminal servers, multi-core. I'm looking for a way to monitor individual CPU core usage on these servers. It is possible for an end-user to have a run-away process (e.g. Internet Explorer or Outlook). The core for that process may spike to near 100% leaving the other cores 'normal'. Thus, the overall CPU usage on the server is just the total of all the cores or if 7 of the cores on a 8 core server are idle and the 8th is running at 100% then 1/8 = 12.5% usage.
What utility can I use to monitor multiple servers ? If the CPU usage for a core is "high" what would I use to determine the offending process and then how could I automatically kill that process if it was on the 'approved kill process' list?
A product from http://www.packettrap.com/ called PT360 would be perfect except they use SMNP to get data and SMNP appears to only give total CPU usage, it's not broken out by an individual core. Take a look at their Dashboard option with the CPU gauge 'gadget'. That's exactly what I need if only it worked at the core level.
Any ideas?
Individual CPU usage is available through the standard windows performance counters. You can monitor this in perfmon.
However, it won't give you the result you are looking for. Unless a thread/process has been explicitly bound to a single CPU then a run-away process will not spike one core to 100% while all the others idle. The run-away process will bounce around between all the processors. I don't know why windows schedules threads this way, presumably because there is no gain from forcing affinity and some loss due to having to handle interrupts on particular cores.
You can see this easily enough just in task manager. Watch the individual CPU graphs when you have a single compute bound process running.
You can give Spotlight on Windows a try. You can graphically drill into all sorts of performance and load indicators. Its freeware.
perfmon from Microsoft can monitor each individual CPU. perfmon also works remote and you can monitor farious aspects of Windows.
I'm not sure if it helps to find run-away processes because the Windows scheduler dos not execute a process always on the same CPU -> on your 8 CPU machine you will see 12.5 % usage on all CPU's if one process runs away.

Resources