I've been tasked with tuning the snapshotting process. I'm dealing with 3 master node instances and 9 data node instances. We are using S3 as the store for the repositories.
There are 28 indices and total of 189 shards in this particular cluster. The only snapshot tuning parameters I can find are chunk_size and max_snapshot_bytes_per_sec.
I've left chunk_size at the default (unlimited) and modified the bytes_per_sec and from the default (40mb) to 100mb and 500mb respectively.
Baseline it takes 3 hours to snapshot the entire cluster (using the default settings), after two experiments with changes in just bytes_per_sec, the snapshot process is still 3 hours.
To me this sounds like the process is either CPU or network bound or am I missing something? Not sure what other parameters I can change.
Related
We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.
I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.
I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example,
if you have a server with 32g ram.
set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47
~ 19.
This is explained here if you want to know more :
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Can you please explain the best way to add relationship indexes to a Neo4j database created using the BatchInserter?
Our database contains about 30 million nodes and about 300 million relationships. If we build this without any indexes then it takes about 10 hours (just calls to BatchInserter.createNode and BatchInserter.createRelationship).
However if we also try to create relationship indexes using LuceneBatchInserterIndexProvider with repeated calls to index.add then the process takes 12 hours to add everything but then gets stuck on indexProvider.shutdown and doesn't complete. The longest I have left it is 3 days. Can you please explain what it is doing at this point? I expected the work to be done during the calls to index.add. What is going on during shutdown that is taking so long?
Our PC has 64GB RAM and we have allocated 40GB to the JVM. During this shutdown step, Windows reports that 99% of the memory is in use (far more than allocated to the JVM) and the computer becomes almost unusable.
The configuration settings I am using are:
neostore.nodestore.db.mapped_memory = 1G
neostore.propertystore.db.mapped_memory = 1G
neostore.propertystore.db.index.mapped_memory = 1M
neostore.propertystore.db.index.keys.mapped_memory = 1M
neostore.propertystore.db.strings.mapped_memory = 1G
neostore.propertystore.db.arrays.mapped_memory = 1M
neostore.relationshipstore.db.mapped_memory = 10G
We've tried changing some of these but it didn't appear to make any difference.
We have also tried adding the relationship indexes as a separate step after first building the database without any indexes. In this case we used GraphDatabaseFactory.newEmbeddedDatabaseBuilder and GraphDatabaseService.index().forRelationships. Doing it this way seems to work although it was estimated that it would take around 6 days to complete. We have tried invoking commit at various different intervals which makes some difference but not significant. Most of the time seems to be spent just iterating over the relationships.
The only thing I can think of that may be abnormal about our data is that the relationships have about 20 properties on them. But even creating an index on just 1 of these properties doesn't work.
The file sizes without any indexes are:
neostore.nodestore.db 400MB
neostore.propertystore.db 100GB
neostore.propertystore.db.strings 2GB
neostore.relationshipstore.db 10GB
Can you please give us some advice on how to get this working either during the BatchInserter process or as a separate step?
We are using version 2.0.1 of the Neo4j jars.
Thanks, Damon
Simple question:
What is the memory limitation of the Pig LOAD statement?
More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?
Scenario:
A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.
-- load data from I_WATS_DIR
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as
(src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);
Details:
CLUSTER
1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode
3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode
TEST JOB 1
Same script referenced above, loading a directory with 1 file
Resident memory reported 1.2GB
Input: 138MB
Output: 207MB
Reduce input records: 1,630,477
Duration: 4m 11s
TEST JOB 2
Same script, 17 files
Resident memory: 16.4GB
Input: 3.5GB
Output: 1.3GB
Reduce input records: 10,648,807
Duration: 6m 48s
TEST JOB 3
Same script, 51 files
Resident memory: 41.4GB
Input: 10.9GB
Output: not recorded
Reduce input records: 31,968,331
Duration: 6m 18s
Final thoughts:
This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
Based on your description of your cluster and the amount of data your pushing through it, it sounds like you are running out of space during the map/shuffle phase of the job. The temporary data is sent over the network, uncompressed, and then written to disk on the reducer before being processed in the reduce phase. One thing you can try is to compress the output of the mappers by setting mapred.map.compress.output to true (and specifying the desired codec).
But with only four nodes, I suspect you're just trying to do too much at once. If you can, try splitting up your job into multiple steps. For example, if you are doing the standard word count example, do the word count over small portions of your data, and then run a second MR program that sums those counts.