Apach Beam / Dataflow transform advice - google-cloud-dataflow

I have a batch data parsing job where the inputs is a list of zip files and each zip file has numerous small text files to parse. In the order of 100Gb compressed across 50 zip files, each zip has 1 million text files.
I am using Apache Beam's package in Python and running the job through Dataflow.
I wrote it as
Create collection from the list of zip file paths
FlatMap with a function that yields for every text file inside the zip (one output is a bytes string for all the bytes read from the text file)
ParDo with a method that yields for every row in the data from the text file / bytes read
...do other stuff like insert each row in the relevant table of some database
I notice this is too slow - CPU resources are only a few % utilised. I suspect that each node is getting a zip file, but work is not distributed among local CPUs - so it's just one CPU working per node. I don't understand why that is the case considering I used FlatMap.

The Dataflow runner makes use of Fusion optimisation:
'...Such optimizations can include fusing multiple steps or transforms in your pipeline's execution graph into single steps.'
If you have a transform which in its DoFn has a large fan-out, which I suspect the Create transform in your description does, then you may want to manually break fusion by introducing a shuffle stage to your pipeline as described in the linked documentation.

Related

apache beam streaming and processing multiple files at same time and windowed joins?

I just read this article
https://medium.com/bb-tutorials-and-thoughts/how-to-create-a-streaming-job-on-gcp-dataflow-a71b9a28e432
What I am truly missing here though is if I drop 50 files and this is a streaming job like the article says(always live), then won't the output be a windowed join of all the files?
If not, what would it look like and how would it change to be a windowed join? I am trying to get a picture of my head of both worlds of
A windowed join in a streaming job(outputting 1 file for all the files input)
A not windowed join in a streaming job(outputting 1 file PER input file)
Can anyone shed light on that article and what would change?
I also read something about 'Bounded PCollections'. In that case, perhaps windowing is not needed as inside the stream it is sort of like a batch of until we have the entire Pcollection processed, we do not move to the next stage? Perhaps if the article is using bounded pcollcation, then all input files map 1 to 1 with output files?
How can one tell from inside a function if I am receiving data from a bounded or unbounded collection? Is there some other way I can tell that? Is bounded collections even possible in apache beam streaming job?
I'll try to answer some of your questions.
What I am truly missing here though is if I drop 50 files and this is
a streaming job like the article says(always live), then won't the
output be a windowed join of all the files?
Input (source) and output (sink) are not directly linked. So this depends on what you do in your pipeline. TextIO.watchForNewFiles is an streaming source transform that keeps observing a given file location and keeps reading news files and outputting lines read from such files. Hence the output from this step will be a PCollection<String> that stream lines of text read from such files.
Windowing is set next, this decides how your data will be bundled into Windows. For this pipeline, they choose to use FixedWindows of 1 minute. Timestamp will be the time the file was observed.
Sink transform is applied at the end of your pipeline (sometimes sinks also produce outputs, so it might not really be the end). In this case they choose TextIO.write() which writes lines of Strings from an input PCollection<String> to output text files.
So whether the output will include data from all input files or not depends on how your input files are processed and how they are bundled into Windows within the pipeline.
I also read something about 'Bounded PCollections'. In that case,
perhaps windowing is not needed as inside the stream it is sort of
like a batch of until we have the entire Pcollection processed, we do
not move to the next stage? Perhaps if the article is using bounded
pcollcation, then all input files map 1 to 1 with output files?
You could use bounded inputs in a streaming pipeline. In a streaming pipeline, the progression is tracked through a watermark function. If you use a bounded input (for example, a bounded source) the watermark will just go from 0 to infinity instead of progressing gradually. Hence your pipeline might just end instead of waiting for more data.
How can one tell from inside a function if I am receiving data from a
bounded or unbounded collection? Is there some other way I can tell
that? Is bounded collections even possible in apache beam streaming
job?
It is definitely possible as I mentioned above. If you have access to the input PCollection, you can use the isBounded function to determine if it is bounded. See here for an example. You have access to input PCollections when expanding PTransforms (hence during job submission). I don't believe you have access to this at runtime.

Apache Beam - Parallelize Google Cloud Storage Blob Downloads While Maintaining Grouping of Blobs

I’d like to be able to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS). i.e.PCollection<Iterable<String>> --> PCollection<Iterable<String>> where the starting PCollection is an Iterable of file paths and the resulting PCollection is Iterable of file contents. Alternatively, PCollection<String> --> PCollection<Iterable<String>> would also work and perhaps even be preferable, where the starting PCollection is a glob pattern, and the resulting PCollection is an iterable of file contents which matched the glob.
My use-case is that at a point in my pipeline I have as input PCollection<String>. Each element of the PCollection is a GCS glob pattern. It’s important that files which match the glob be grouped together because the content of the files–once all files in a group are read–need to be grouped downstream in the pipeline. I originally tried using FileIO.matchAll and a subsequently GroupByKey . However, the matchAll, window, and GroupByKey combination lacked any guarantee that all files matching the glob would be read and in the same window before performing the GroupByKey transform (though I may be misunderstanding Windowing). It’s possible to achieve the desired results if a large time span WindowFn is applied, but it’s still probabilistic rather than a guarantee that all files will be read before grouping. It’s also the main goal of my pipeline to maintain the lowest possible latency.
So my next, and currently operational, plan was to use an AsyncHttpClient to fan out fetching file contents via GCS HTTP API. I feel like this goes against the grain in Beam and is likely sub-optimal in terms of parallelization.
So I’ve started investigating SplittableDoFn . My current plan is to allow splitting such that each entity in the input Iterable (i.e. each matched file from the glob pattern) could be processed separately. I've been able to modify FileIO#MatchFn (defined here in the Java SDK) to provide mechanics for PCollection<String> -> PCollection<Iterable<String>> transform between input of GCS glob patterns and output of Iterable of matches for the glob.
The challenge I’ve encountered is: how do I go about grouping/gathering the split invocations back into a single output value in my DoFn? I’ve tried using stateful processing and using a BagState to collect file contents along the way, but I realized part way along that the ProcessElement method of a splittable DoFn may only accept ProcessContext and Restriction tuples, and no other args therefore no StateId args referring to a StateSpec (throws an invalid argument error at runtime).
I noticed in the FilePatternWatcher example in the official SDF proposal doc that a custom tracker was created wherein FilePath Objects kept in a set and presumably added to the set via tryClaim. This seems as though it could work for my use-case, but I don’t see/understand how to go about implementing a #SplitRestriction method using a custom RestrictionTracker.
I would be very appreciative if anyone were able to offer advice. I have no preference for any particular solution, only that I want to achieve the ability to maintain a grouping of entities within a single PCollection element, but parallelize the fetching of those entities from Google Cloud Storage (GCS).
Would joining the output PCollections help you?
PCollectionList
.of(collectionOne)
.and(collectionTwo)
.and(collectionThree)
...
.apply(Flatten.pCollections())

Apache Beam: Reading in PCollection as PBegin for a pipeline

I'm debugging this beam pipeline and my end goal is to write all of the strings in a PCollection to a text file.
I've set a breakpoint at the point after the the PCollection I want to inspect is created and what I've been trying to do is create a new Pipeline that
Reads in this output PCollection as the inital input
Prints it to a file (using `TextIO.write().to("/Users/my/local/fp"))
I'm struggling with #1 of how to read in the PCollection as initial input.
The skeleton of what I've been trying:
Pipeline p2 = Pipeline.create();
p2.apply(// READ IN THE PCOLLECTION HERE)
.apply(TextIO.write().to("/Users/my/local/fp")));
p2.run()
Any thoughts or suggestions would be appreciated
In order to read a pcollection into input, you need to read it from a source. I.e. some data stored in BigQuery, Google Cloud Storage, etc. There are specific source transforms you can use to read from each of these locations. Depending on where you have stored your data you will need to use the correct source and pass in the relevant parameters (i.e. the GCS path, BigQuery table)
Please take a look at the Minimal Word Count Example on the apache beam website (Full source on github). I suggest starting from this code and iterating on it until you build the pipeline you need.
In this example files are read from GCS
p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
Please also see this guide on using IOs and also this list of beam IO transforms. If you just want a basic example working, you can use Create.of to read from variables in your program.

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Resources