I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example,
if you have a server with 32g ram.
set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47
~ 19.
This is explained here if you want to know more :
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Related
I've been tasked with tuning the snapshotting process. I'm dealing with 3 master node instances and 9 data node instances. We are using S3 as the store for the repositories.
There are 28 indices and total of 189 shards in this particular cluster. The only snapshot tuning parameters I can find are chunk_size and max_snapshot_bytes_per_sec.
I've left chunk_size at the default (unlimited) and modified the bytes_per_sec and from the default (40mb) to 100mb and 500mb respectively.
Baseline it takes 3 hours to snapshot the entire cluster (using the default settings), after two experiments with changes in just bytes_per_sec, the snapshot process is still 3 hours.
To me this sounds like the process is either CPU or network bound or am I missing something? Not sure what other parameters I can change.
This is a follow-up to this question.
I'm experiencing problems with persisting a large dataset in distributed memory. I have a scheduler running on one machine and 8 workers each running on their own machines connected by 40 gigabit ethernet and a backing Lustre filesystem.
Problem 1:
ds = DataSlicer(dataset) # ~600 GB dataset
dask_array = dask.array.from_array(ds, chunks=(13507, -1, -1), name=False) # ~22 GB chunks
dask_array = client.persist(dask_array)
When inspecting the Dask status dashboard, I see all 28 tasks being assigned to and processed by one worker while the other workers do nothing. Additionally, when every task has finished processing and the tasks are all in the "In memory" state, only 22 GB of RAM (i.e. the first chunk of the dataset) is actually stored on the cluster. Access to the indices within the first chunk are fast, but any other indices force a new round of reading and loading the data before the result returns. This seems contrary to my belief that .persist() should pin the complete dataset across the memory of the workers once it finishes execution. In addition, when I increase the chunk size, one worker often runs out of memory and restarts due to it being assigned multiple huge chunks of data.
Is there a way to manually assign chunks to workers instead of the scheduler piling all of the tasks on one process? Or is this abnormal scheduler behavior? Is there a way to ensure that the entire dataset is loaded into RAM?
Problem 2
I found a temporary workaround by treating each chunk of the dataset as its own separate dask array and persisting each one individually.
dask_arrays = [da.from_delayed(lazy_slice, shape, dtype, name=False) for \
lazy_slice, shape in zip(lazy_slices, shapes)]
for i in range(len(dask_arrays)):
dask_arrays[i] = client.persist(dask_arrays[i])
I tested the bandwidth from persisted and published dask arrays to several parallel readers by calling .compute() on different chunks of the dataset in parallel. I could never achieve more than 2 GB/s aggregate bandwidth from the dask cluster, far below our network's capabilities.
Is the scheduler the bottleneck in this situation, i.e. is all data being funneled through the scheduler to my readers? If this is the case, is there a way to get in-memory data directly from each worker? If this is not the case, what are some other areas in dask I may be able to investigate?
I have written a variety of queries using cypher that take no less than 200ms per query. They're very straightforward, so I'm having trouble identifying where the bottleneck is.
Simple Match with Parameters, 2200ms:
Simple Distinct Match with Parameters, 200ms:
Pathing, 2500ms:
At first I thought the issue was a lack of resources, because I was running neo4j and my application on the same box. While the performance monitor indicated that CPU and memory were largely free'd up and available, I moved the neo4j server to another local box and observed similar latency. Both servers are workstations with fairly new Xeon processors, 12GB memory and SSDs for the data storage. All of the above leads me to believe that the latency isn't due to my hardware. OS is Windows 7.
The graph has less than 200 nodes and less than 200 relationships.
I've attached some queries that I send to neo4j along with the configuration for the server, database, and JVM. No plugins or extensions are loaded.
Pastebin Links:
Database Configuration
Server Configuration
JVM Configuration
[Expanding a bit on a comment I made earlier.]
#TFerrell: Your comments state that "all nodes have labels", and that you tried applying indexes. However, it is not clear if you actually specified the labels in your slow Cypher queries. I noticed from your original question statement that neither of your slower queries actually specified a node label (which presumably should have been "Project").
If your Cypher query does not specify the label for a node, then the DB engine has to test every node, and it also cannot apply an index.
So, please try specifying the correct node label(s) in your slow queries.
Is that the first run or a subsequent run of these queries?
You probably don't have a label on your nodes and no index or unique constraint.
So Neo4j has to scan the whole store for your node pulling everything into memory, loading the properties and checking.
try this:
run until count returns 0:
match (n) where not n:Entity set n:Entity return count(*);
add the constraint
create constraint on (e:Entity) assert e.Id is unique;
run your query again:
match (n:Element {Id:{Id}}) return n
etc.
It seems there is something wrong with the automatic memory mapping calculation when you are on Windows (memory mapping on heap).
I just looked at your messages.log and added up some numbers, so it seems the mmio alone is enough to fill your java heap space (old-gen) leaving no room for the database, caches etc.
Please try to amend that by fixing the mmio config in your conf/neo4j.properties to more sensible values (than the auto-calculation).
For your small store just uncommenting the values starting with #neostore. (i.e. remove the #) should work fine.
Otherwise something like this (fitting for a 3GB heap) for a larger graph (2M nodes, 10M rels, 20M props,10M long strings):
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M
neostore.propertystore.db.mapped_memory=250M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M
Here are the added numbers:
auto mmio: 134217728 + 134217728 + 536870912 + 536870912 + 1073741824 = 2.3GB
stores sizes: 1073920 + 1073664 + 3221698 + 3221460 + 1073786 = 9MB
JVM max: 3.11 RAM : 13.98 SWAP: 27.97 GB
max heaps: Eden: 1.16, oldgen: 2.33
taken from:
neostore.propertystore.db.strings] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073920b)
neostore.propertystore.db.arrays] brickCount=8 brickSize=134144b mappedMem=134217728b (storeSize=1073664b)
neostore.propertystore.db] brickCount=6 brickSize=536854b mappedMem=536870912b (storeSize=3221698b)
neostore.relationshipstore.db] brickCount=6 brickSize=536844b mappedMem=536870912b (storeSize=3221460b)
neostore.nodestore.db] brickCount=1 brickSize=1073730b mappedMem=1073741824b (storeSize=1073786b)
Can you please explain the best way to add relationship indexes to a Neo4j database created using the BatchInserter?
Our database contains about 30 million nodes and about 300 million relationships. If we build this without any indexes then it takes about 10 hours (just calls to BatchInserter.createNode and BatchInserter.createRelationship).
However if we also try to create relationship indexes using LuceneBatchInserterIndexProvider with repeated calls to index.add then the process takes 12 hours to add everything but then gets stuck on indexProvider.shutdown and doesn't complete. The longest I have left it is 3 days. Can you please explain what it is doing at this point? I expected the work to be done during the calls to index.add. What is going on during shutdown that is taking so long?
Our PC has 64GB RAM and we have allocated 40GB to the JVM. During this shutdown step, Windows reports that 99% of the memory is in use (far more than allocated to the JVM) and the computer becomes almost unusable.
The configuration settings I am using are:
neostore.nodestore.db.mapped_memory = 1G
neostore.propertystore.db.mapped_memory = 1G
neostore.propertystore.db.index.mapped_memory = 1M
neostore.propertystore.db.index.keys.mapped_memory = 1M
neostore.propertystore.db.strings.mapped_memory = 1G
neostore.propertystore.db.arrays.mapped_memory = 1M
neostore.relationshipstore.db.mapped_memory = 10G
We've tried changing some of these but it didn't appear to make any difference.
We have also tried adding the relationship indexes as a separate step after first building the database without any indexes. In this case we used GraphDatabaseFactory.newEmbeddedDatabaseBuilder and GraphDatabaseService.index().forRelationships. Doing it this way seems to work although it was estimated that it would take around 6 days to complete. We have tried invoking commit at various different intervals which makes some difference but not significant. Most of the time seems to be spent just iterating over the relationships.
The only thing I can think of that may be abnormal about our data is that the relationships have about 20 properties on them. But even creating an index on just 1 of these properties doesn't work.
The file sizes without any indexes are:
neostore.nodestore.db 400MB
neostore.propertystore.db 100GB
neostore.propertystore.db.strings 2GB
neostore.relationshipstore.db 10GB
Can you please give us some advice on how to get this working either during the BatchInserter process or as a separate step?
We are using version 2.0.1 of the Neo4j jars.
Thanks, Damon
Simple question:
What is the memory limitation of the Pig LOAD statement?
More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?
Scenario:
A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.
-- load data from I_WATS_DIR
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as
(src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);
Details:
CLUSTER
1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode
3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode
TEST JOB 1
Same script referenced above, loading a directory with 1 file
Resident memory reported 1.2GB
Input: 138MB
Output: 207MB
Reduce input records: 1,630,477
Duration: 4m 11s
TEST JOB 2
Same script, 17 files
Resident memory: 16.4GB
Input: 3.5GB
Output: 1.3GB
Reduce input records: 10,648,807
Duration: 6m 48s
TEST JOB 3
Same script, 51 files
Resident memory: 41.4GB
Input: 10.9GB
Output: not recorded
Reduce input records: 31,968,331
Duration: 6m 18s
Final thoughts:
This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
Based on your description of your cluster and the amount of data your pushing through it, it sounds like you are running out of space during the map/shuffle phase of the job. The temporary data is sent over the network, uncompressed, and then written to disk on the reducer before being processed in the reduce phase. One thing you can try is to compress the output of the mappers by setting mapred.map.compress.output to true (and specifying the desired codec).
But with only four nodes, I suspect you're just trying to do too much at once. If you can, try splitting up your job into multiple steps. For example, if you are doing the standard word count example, do the word count over small portions of your data, and then run a second MR program that sums those counts.