Optimizing repeated transformations in Apache Beam/DataFlow - google-cloud-dataflow

I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:
p | GroupByKey() | FlatMap(...)
p | combiners.Top.PerKey(...) | FlatMap(...)
both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?

As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.
The Dataflow Runner does not currently support this optimization.

Related

GNU parallel saturates one server instead of distributing jobs equally

I am using GNU parallel 20160222. I have four servers configured in my ~/.parallel/sshloginfile:
48/big1
48/big2
8/small1
8/small2
when I run, say, 32 jobs, I'd expect parallel to start eight on each server. Or even better, two or three each on small1 and small2, and twelve or so each on big1 and big2. But what it is doing is starting 8 jobs on small2 and the remaining jobs locally.
Here is my invocation (I actually use a --profile but I removed it for simplicity):
parallel --verbose --workdir . --sshdelay 0.2 --controlmaster --sshloginfile .. \
"my_cmd {} | gzip > {}.gz" ::: $(seq 1 32)
Here is the main question:
Is there an option missing that would do a more equal allocation of jobs?
Here is another related question:
Is there a way to specify --memfree, --load, etc. per server? Especially --memfree.
I recall GNU Parallel used to fill job slots "from one end". This did not matter if you had way more jobs than job slots: All job slots (both local and remote) would fill up.
It did, however, matter if you had fewer jobs. So it was changed, so GNU Parallel today gives jobs to sshlogins in a round robin fashion - thus spreading it more evenly.
Unfortunately I do not recall which version this change was done. But your can tell if you version does it by running:
parallel -vv -t
and look at which sshlogin is being used.
Re: --memfree
You can build your own using --limit.
I am curious why you want different limits for different servers. The idea behind --memfree is that it is set to the amount of RAM that a single job takes. So if there is enough RAM for a single job, a new job should be started - no matter the server.
You clearly have another situation, so explain about that.
Re: upgrading
Look into parallel --embed.

Prevent fusion in Apache Beam / Dataflow streaming (python) pipelines to remove pipeline bottleneck

We are currently working on a streaming pipeline on Apache Beam with DataflowRunner. We are reading messages from Pub/Sub and do some processing on them and afterwards we window them in slidings windows (currently the window size is 3 seconds and the interval is 3 seconds as well). Once the window is fired we do some post-processing on the elements inside the window. This post-processing step is significantly larger than the window size, it takes about 15 seconds.
The apache beam code of the pipeline:
input = ( pipeline | beam.io.ReadFromPubSub(subscription=<subscription_path>)
| beam.Map(process_fn))
windows = input | beam.WindowInto(beam.window.SlidingWindows(3, 3),
trigger=AfterCount(30),
accumulation_mode = AccumulationModel.DISCARDING)
group = windows | beam.GroupByKey()
group | beam.Map(post_processing_fn)
As you know, Dataflow tries to perform some optimizations on your pipeline steps. In our case it fusions everything together from the windowing onwards (clustered operations: 1/ processing 2/ windowing + post-processing) which is causing a slow sequential post-processing of all the windows by just 1 worker. We see logs every 15 seconds that the pipeline is processing the next window. However, we would like to have multiple workers picking up separate windows instead of the workload going to a single worker.
Therefore we were looking for ways to prevent this fusion from happening so Dataflow separates the window from post-processing of the windows. In that way we would expect Dataflow to be able to assign multiple workers again to the post-processing of fired windows.
What we have tried so far:
Increase the number of workers to 20, 30 or even 40 but without effect. Only the steps before the windowing gets assigned to multiple workers
Running the pipeline for 5 or 10 minutes but we noticed no worker re-allocation to help on this larger post-processing step after the windowing
After the windowing, put them back into a global window
Simulate another GroupByKey with a dummy key (as mentioned in https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#preventing-fusion) but without any success.
The last two actions indeed created a third clustered operation (1/ processing 2/ windowing 3/ post-processing ) but we noticed that still the same worker is executing everything after the windowing.
Is there any solution that can resolve this problem statement?
The current workaround we are now considering is to build another streaming pipeline which receives the windows so these worker can process the windows in parallel but it is cumbersome..
You have done the right thing to break fusion in your elements. I suspect there may be an issue getting you into trouble.
For streaming, a single key always gets processed in the same worker. By any chance, are all or most of your records assigned to a single key? If so, your processing will be done in a single worker.
Something that you can do to prevent this is to make the window a part of the key, so that the elements for multiple windows can be processed in different workers even though they have the same key:
class KeyIntoKeyPlusWindow(core.DoFn):
def process(self, element, window=core.DoFn.WindowParam):
key, values = element
yield ((key, window), element)
group = windows | beam.ParDo(KeyIntoKeyPlusWindow() | beam.GroupByKey()
And once you've done that, you can apply your post-processing:
group | beam.Map(post_processing_fn)

Dataflow working: Combine functions

I have multiple custom combine functions which I call as such:
e.g. I have 'data' calculated previously in the pipeline.
cd1 = data | customCombFn1()
cd2 = data | customCombFn2()
cd3 = data | customCombFn3()
How does the pipeline work in the above case ? Is the 'data' evaluated again and again ? Or are cd1, cd2, and cd3 evaluated as a by-product of the pipeline ?
Your data object is a PCollection. Applying a combine transformation on a PCollection creates another PCollection, most often containing much fewer elements.
There would be no 're-evaluation', as you call it. PCollection is typically produced on multiple workers and immediately consumed by transformations that need it. If that is not possible in a given case, PCollection will typically be stored for processing at a later point.
Generally speaking, Cloud Dataflow service automatically applies optimizations to users' pipeline. In most cases, including this one, it allows users to focus on their business logic instead of the underlying execution considerations.

Iterative processing in Dataflow

As shown here Dataflow pipelines are represented by a fixed DAG. I'm wondering if it's possible to implement a pipeline where the processing proceeds until a dynamically evaluated condition is satisfied based on the data computed so far.
Here's some pseudo code to illustrate what I'd like to implement:
PCollection pco = null
while(true):
pco = pco.apply(someTransform())
if (conditionSatisfied(pco)):
break
pco.Write()
It seems like you really want iterative computations. Right now Dataflow does not provide support for that, but we are aware that it is a very important use case and we are working on finding the right set of APIs to express it.
For now your workarounds are:
Iteratively run whole pipelines (run pipeline, inspect output, run again if the condition is not satisfied, etc). This has the obvious downside of pipeline setup and teardown overhead.
Build a pipeline with a hard-coded number of iterations by .apply()'ing in a loop unconditionally, then run the whole pipeline.
A combination of the two, e.g. run fixed 5-iteration pipelines until you're satisfied with the result.

Your advice on a Hadoop MapReduce job

I have 2 files stored on a HDFS filesystem:
tbl_userlog: <website url (non canonical)> <tab> <username> <tab> <timestamp>
example: www.website.com, foobar87, 201101251456
tbl_websites: <website url (canonical)> <tab> <total hits>
example: website.com, 25889
I have written an Hadoop sequence of jobs which joins the 2 files on the website, performs a filter on the amount of total hits > n per website and then counts for each user the amount of websites he has visited which has > n total hits. The details of the sequence are as following:
A Map-only job which canonicizes the url in tbl_userlog (i.e. removes www, http:// and https:// from the url field)
A Map-only job which sorts tbl_websites on the url
An identity Map-Reduce job which takes the output of the 2 previous jobs as KeyValueTextInput and feeds them to a CompositeInput in order to make use of Hadoop native joining feature defined with jobConf.set("mapred.join.expr", CompositeInputFormat.compose("inner" (...))
A Map and Reduce job which filters the result of the previous job on total hits > n in its Map phase, groups the results on the in the shuffling phase, and performs the count on the number of websites for each user in the Reduce phase.
In order to chain these steps, I just call the jobs sequentially in the described order. Each individual job outputs its results into HDFS which the following job in the chain then retrieves and processes in turn.
As I am new to Hadoop, I would like to ask for your counseling:
Is there a better way to chain these jobs? In this configuration all intermediate results are written to HDFS and then read back.
Do you see any design flaw in this job, or could it be written more elegantly by making use of some Hadoop feature that I have missed?
I am using Apache Hadoop 0.20.2 and using higher-level frameworks such as Pig or Hive is not possible in the scope of the project.
Thanks in advance for your replies!
I think what you have will work with a couple of caveats. Before I start listing them, I want to make two definitions clear. A map-only job is a job that has a defined Mapper and run's with 0 reducers. If the job is running with > 0 IdentityReducers, then the job is not a map-only job. A reduce-only job is a job that has a define Reducer and run's with an IdentityMapper.
Your first job, can be a map-only job, since all you're doing is canonicalizing URLs. But if you want to use CompositeInputFormat, you should run with an IdentityReducer with more than 0 reducer's.
For your second job, I don't know what you mean by a map-only job that sorts. Sorting by it's very nature is a reduce side task. You probably mean that it has a define Mapper but no Reducer. But in order for the URLs to be sorted, you should run with an IdentityReducer with more than 0 reducer's.
Your third job is an interesting idea, but you have to be careful with CompositeInputFormat. There are two conditions that must be met for you to be able to use this input format. The first is that there has to be the same number of files in both input directories. This can be achieved by setting the same number of reducer's for Job1 and Job2. The second condition is that the input files CANNOT be splittable. This can be achieved by using a non splittable compression such as bzip.
This job sounds good. Although you can filter website that have < n hits in the reducer of the previous job and save yourself some I/O.
There's obviously more than one solution to a problem in software, so while you're solution would work, I wouldn't recommend it. Having 4 MapReduce jobs for this task is a bit expensive IMHO. The implementation I have in mind is a M-R-R workflow that uses Secondary Sort.
As far as chaining jobs is concerned, you should have a look at Oozie, which is a workflow manager. I have yet to use it, but that's where I'd start.

Resources