By default during an EMR job, instances are configured to have fewer reducers than mappers. But the reducers aren't given any extra memory so it seems like they should be able to have the same amount. (For instance, extra-large high-cpu instances have 7 mappers, but only 2 reducers, but both mappers and reducers are configured with 512 MB of memory available).
Does anyone know why this is and is there some way I can specify to use as many reducers as mappers?
EDIT: I had the amount wrong, it's 512 MB
Mappers extract data from their input stream (the mapper's STDIN), and what they emit is much more compact. That outbound stream (the mapper's STDOUT) is also then sorted by the key. Therefore, the reducers have smaller, sorted data in their incoming.
That is pretty much the reason why the default configuration for any Hadoop MapReduce cluster, not just EMR, is to have more mappers than reducers, proportional to the number of cores available to the jobtracker.
You have the ability to control the number of mappers and reducers through the jobconf parameter. The configuration variables are mapred.map.tasks and mapred.reduce.tasks.
Related
Assuming I have an unbounded dataset with extremely high cardinity > 1,000,000,000 unique keys, lets say I want to count by key, lets say over fixed windows
My understanding the combine function will essentially maintain an accumulator on each machine in memory for each key.
Question 1
Is the above assumption correct or can workers flush out keys and accumulators to disk when under memory pressure
Question 2 (assuming above correct)
Assuming the data is not naturally partitioned (e.g reading from pubsub) would we run out of memory on each worker since every machine may in theory see every key and have to maintain an in memory structure for each key?
Question 3 (assuming above correct)
If we store the data on kafka and split up the data into partitions based on the key we are counting on. Assuming you have 1 beam worker reading from 1 partition then each worker only see a consistent subset of the keyspace. In this scenario would the memory use of the workers be any different?
Beam is meant to be highly scalable; there are Beam pipelines that run on Dataflow with many trillions of unique keys.
When running a combining operation in Beam a table of keys and aggregated values is kept in memory, but when the table becomes full it is flushed to disk (well, technically, to shuffle) so it will not run out of memory. Another worker will read this data out of shuffle, one value at a time, to compute the final aggregate over all upstream worker outputs.
As for your other two questions, if your input is naturally partitioned by key such that each worker only sees a subset of keys it is possible that more combining could happen before the shuffle, leading to less data being shuffled, but this is by no means certain and the effects would likely be small. In particular, memory considerations won't change.
We have DWUs option in Azure synapse for Dedicated SQL Pool, which start from 100 DWUs which as per document consists of Compute +Memory +IO.
but how to check what type of compute node it is ? because in document it says 100 DWU consists of 1 compute with 60 distributions and 60 GB of Memory.
but here what is the configuration of Compute Node ?
or if we can't find the configuration, how to calculate the required DWUs to process 10GB of Data.
A dedicated SQL pool (formerly SQL DW) represents a collection of analytic resources that are being provisioned. Analytic resources are defined as a combination of CPU, memory, and IO.
These three resources are bundled into units of compute scale called Data Warehouse Units (DWUs). A DWU represents an abstract, normalized measure of compute resources and performance.
What is the configuration of Compute Node ? And how to check what type of compute node it is?
Each Compute node has a node ID that is visible in system views. You can see the Compute node ID by looking for the node_id column in system views whose names begin with sys.pdw_nodes.
For a list of these system views, see Synapse SQL system views.
How to calculate the required DWUs to process 10GB of Data.
Dedicated SQL pool (formerly SQL DW) is a scale-out system that can provision vast amounts of compute and query sizeable quantities of data.
To see its true capabilities for scaling, especially at larger DWUs, we recommend scaling the data set as you scale to ensure that you have enough data to feed the CPUs.
For more information, refer to Data Warehouse Units (DWUs) for dedicated SQL pool (formerly SQL DW) in Azure Synapse Analytics.
How many cores do you think you need to process 10GB data? It's a pretty complicated question to answer.
What are your queries doing? Are you doing 20 self joins? How much tempdb space is needed?
Breast way to find out will be to run experiments for your particular workload and resize such that you use 80% of available resources.
I'm using Apache Beam Java SDK to process events and write them to the Clickhouse Database.
Luckily there is ready to use ClickhouseIO.
ClickhouseIO accumulates elements and inserts them in batch, but because of the parallel nature of the pipeline it still results in a lot of inserts per second in my case. I'm frequently receiving "DB::Exception: Too many parts" or "DB::Exception: Too much simultaneous queries" in Clickhouse.
Clickhouse documentation recommends doing 1 insert per second.
Is there a way I can ensure this with ClickhouseIO?
Maybe some KV grouping before ClickhouseIO.Write or something?
It looks like you interpret these errors not quite correct:
DB::Exception: Too many parts
It means that insert affect more partitions than allowed (by default this value is 100, it is managed by parameter max_partitions_per_insert_block).
So either the count of affected partition is really large or the PARTITION BY-key was defined pretty granular.
How to fix it:
try to group the INSERT-batch such way it contains data related to less than 100 partitions
try to reduce the size of insert-block (if it quite huge) - withMaxInsertBlockSize
increase the limit max_partitions_per_insert_block in SQL-query (like this, INSERT .. SETTINGS max_partitions_per_insert_block=300 (I think ClickhouseIO should have the ability to set custom options on query level)) or on server-side by modifying userprofile-settings
DB::Exception: Too much simultaneous queries
This one managed by param max_concurrent_queries.
How to fix it:
reduce the count of concurrent queries by Beam means
increase the limit on the server-side in userprofile- or server-settings (see https://github.com/ClickHouse/ClickHouse/issues/7765)
I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.
We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000.
The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).
Is there a way to increase this limit (and what is the limit)?
The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).
Rather than using the Partition transform and introducing many steps in the pipeline consider using either of the following approaches:
Many sinks support the option to specify the number of output shards. For example, TextIO has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.
Using the shard number as a key and using a GroupByKey + a DoFn to write the results.