So assume i've got a cluster with 100 GB memory for spark to utilize. I got a dataset of 2000 GB and want to run a iterative application o this dataset. 200 iterations.
My question is, when using .cache(), will spark keep the first 100 GB in memory and perform the 200 iteration before reading the next 100 GB automatically?
When working within the memory limit sparks advantages are very clear, but when working with larger datasets im not entirely sure how spark and yarn manages the data.
This is not the behaviour you will see. Spark's caching is done using LRU eviction, so if you cache a dataset which is too big for memory, only the most recently used part will be kept in memory. However, spark also has a MEMORY_AND_DISK persistence mode (described in more detail at https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence ) which sounds like it could be a good fit for your case.
Related
I am new to the RAPIDS AI world and I decided to try CUML and CUDF out for the first time.
I am running UBUNTU 18.04 on WSL 2. My main OS is Windows 11. I have a 64 GB RAM and a laptop RTX 3060 6 GB GPU.
At the time I am writing this post, I am running a TSNE fitting calculation over a CUDF dataframe composed by approximately 26 thousand values, stored in 7 columns (all the values are numerical or binary ones, since the categorical ones have been one hot encoded).
While classifiers like LogisticRegression or SVM were really fast, TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big). The task manager is telling me that 100% of GPU is being used for the calculations even if, by running "nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use. This seems odd to me since I read papers on RAPIDS AI's TSNE algorithm being 20x faster than the standard scikit-learn one.
I wonder if there is a way of increasing the percentage of dedicated GPU memory to perform faster computations or if it is just an issue related to WSL 2 (probably it limits the GPU usage at just 2 GB).
Any suggestion or thoughts?
Many thanks
The task manager is telling me that 100% of GPU is being used for the calculations
I'm not sure if the Windows Task Manager will be able to tell you of GPU throughput that is being achieved for computations.
"nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use
Memory utilisation is a different calculation than GPU throughput. Any GPU application will only use as much memory as is requested, and there is no correlation between higher memory usage and higher throughput, unless the application specifically mentions a way that it can achieve higher throughput by using more memory (for example, a different algorithm for the same computation may use more memory).
TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big).
This definitely seems odd, and not the expected behavior for a small dataset. What version of cuML are you using, and what is your method argument for the fit task? Could you also open an issue at www.github.com/rapidsai/cuml/issues with a way to access your dataset so the issue can be reproduced?
I'm running Hadoop with 3 datanodes on a single machine using Docker containers. I've run a KMeans algorithm on a small simulated dataset with 200 data points.
Because of the Hadoop overhead, the process takes a long time, about 2 or 3 minutes, while running kmeans locally in R takes few seconds.
I wonder how big my dataset has to be to Hadoop overperform the non-distributed approach, and if that's possible since I'm running all the nodes on single machine.
It's the number of cores and RAM available to process the data, that matters more than the amount of data itself, so limiting Hadoop jobs inside containers is actually running little JVM containers within those containers. Therefore, it's expected that giving one full machine access to process the same amount of data will be much more likely to process quicker, and I'm sure there's a way to write the same distributed algorithm without Hadoop
Besides that, if the data itself isn't splittable or less than the hdfs block size, then it'll only be able to be processed by a single mapreduce task anyway. You didn't mention the size, but I suspect 200 data points is only a few MB at most
I find there is too many memory usage when shuffle occurred in Spark process.
Following figure is memory metric when I use 700MB data and just three rdd.map.
(I use Ganglia as monitoring tool, and show just three nodes of my cluster. x-axis means time-series, y-axis means memory usage)
enter image description here
and following figure is also memory metric when I use same data and use three rdd.groupBy, three rdd.flatMap (order : groupBy1->flatMap1->groupBy2->flatMap2->groupBy3->flatMap3)
enter image description here
As you can see, all of three node's memory is considerably increased (several GB) even though I use just 700MB data. Indeed I have 8 worker node, and all of 8 worker's memory is considerably increased.
I think the main cause is shuffle since rdd.map has no shuffle but rdd.groupBy has shuffle.
In this situation, I wonder three point below :
why is there too many memory usage? (more than 15GB is used when I use 700MB in all of my worker node.)
why does it seem that used memory for old shuffle is not removed before Spark application is finished?
Is there any way to reduce memory usage or remove memory generated in old shuffle?
P.S. - My environment :
cloud platform : MS Azure (8 worker nodes)
Spec. of one worker : 8 cores CPU, 16GB RAM
Language : Java
Spark version : 1.6.2
Java version : 1.7(development), 1.8(execution)
Run in Spark-standalone (Not use Yarn or Mesos)
In Spark, The operating system will decide if the data can stay in its buffer cache or should it be spilled to DISK. Each map task creates as many shuffle spill files as number of reducers. SPARK doesn't merge and partition shuffle spill files into one big file, which is the case with Apache Hadoop.
Example: If there are 6000 (R) reducers and 2000 (M) map tasks, there will be (M*R) 6000*2000=12 million shuffle files. This is because, in spark, each map task creates as many shuffle spill files as number of reducers. This caused performance degradation.
Please refer to this post which very well explains this in detail in continuation to above explanation.
You can also refer to Optimizing Shuffle Performance in Spark paper.
~Kedar
I am using neo4j recently. My data size is only moderate: a little less than 5 million nodes, around 24 million edges and 30 million properties. This data size is not huge according to traditional relational database such as MySQL or Oracle. But when I run neo4j, it seems quite memory demanding. To me, a database should not be memory demanding: if you have sufficient memory and allow to use as much, it will perform faster. But if you don't have much memory, it should still work. But for neo4j, it is interrupted (due to low memory) sometimes (not consistently but is annoying enough as I expect a database to be much more reliable).
To be more specific, I have a Linux machine that has 8G memory. I only allow an initial and max heap size of 2G to run the graph data.
Anyone experiencing something similar? Any solutions?
Neo4j uses off-heap RAM to cache the graph to speed up reading nodes, relationships and properties.
You can tweak the amount of memory being used for caching by setting dbms.memory.pagecache.size.
I have build a Spark and Flink k-means application.
My test case is a clustering on 1 million points on a 3 node cluster.
When in-memory bottlenecks begin, Flink starts to outsource to disk and work slowly but works.
However, Spark lose executers if the memory is full and starts again (infinite loop?).
I try to customize the memory setting with the help from the mailing list here, thanks. But Spark does still not work.
Is it necessary to have any configurations to be set? I mean Flink works with low memory, Spark must also be able to; or not?
I am not a Spark expert (and I am an Flink contributor). As far as I know, Spark is not able to spill to disk if there is not enough main memory. This is one advantage of Flink over Spark. However, Spark announced a new project call "Tungsten" to enable managed memory similar to Flink. I don't know if this feature is already available: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
There are a couple of SO question about Spark out of memory problems (an Internet search with "spark out of memory" yield many results, too):
spark java.lang.OutOfMemoryError: Java heap space
Spark runs out of memory when grouping by key
Spark out of memory
Maybe one of those help.