I'm trying to get a sample of the items in PCollection using the Python SDK on Dataflow / Beam.
While it's not documented, Sample.FixedSizeGlobally(n) exists.
When testing, it seems to return a PCollection with a single item: a list containing the samples, rather than a PCollection with the samples. Is that correct?
Is doing this the best way of turning that single-item PCollection into a PCollection of the items?
| Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)
Currently, yes. The Sample.FixedSizeGlobally() transform returns a PCollection with a single list element. You can turn it into a PCollection of single elements like you said:
Sample.FixedSizeGlobally(sample_size)
| beam.FlatMap(lambda x: x)
We'll make sure to add a PC-PC transform - and we also welcome your contributions to Beam : ) - But in the meantime, that's what we've got.
Related
Let's say I have a pipeline, and I have a series of ParDo operations where element keys change. How can I ensure that elements for the same key some to the same worker without having to do a GroupByKey with windowing?
input_pcoll = p | beam.ReadFromXYZ(...)
rekeyed_pcoll = (input_pcoll
| beam.FlatMap(some_operation)
| beam.Map(lambda x: (compute_new_key(x), x['value'])))
After this, I would like to have elements of the same key go to the same worker without having to run a GroupByKey that uses windowing or triggering.
There are two ways to accomplish this.
The first one is by doing a GroupByKey, and having a trigger that triggers after every single element. Something like so:
keys_together_pcoll = (rekeyed_pcoll
| beam.WindowInto(window.GlobalWindows()
trigger=AfterCount(1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: x[1]))
result_pcoll = (keys_together_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Granted, this is a little awkward.
Another way to do this is to make your DoFn stateful. This will force the runner to shuffle the elements into their corresponding workers by key. Something like this:
class DoFnWithElementsInCorrespondingWorkers(beam.DoFn):
UNUSED_STATE = BagStateSpec('unused', VarIntCoder())
def process(self,
element,
unused=beam.DoFn.StateParam(UNUSED_STATE)):
# .. My processing
result_pcoll = (rekeyed_pcoll
| beam.ParDo(DoFnWithElementsInCorrespondingWorkers()))
Why does this happen?
Remember that in Beam (and Flink, and similar systems), state is organized by key, so if you insert a stateful DoFn, Beam will recognize that elements need to be shuffled into the correct workers according to their keys.
I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.
Using Apache Beam I am doing computations - and if they succeed I'd like to write the output to one sink, and if there is a failure I'd like to write that to another sink.
Is there any way to handle metadata or content based routing in Apache Beam?
I've used Apache Camel extensively, and so in my mind based on the outcome of a previous transform, I should route a message to a different sink using a router (perhaps determined by a metadata flag I set on the message header). Is there an analogous capability with Apache Beam, or would I instead just have a sequential transform that inspects the PCollection and handles writing to sinks within the transform?
Ideally I'd like this logic (written verbosely for attempted clarity)
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | ([success_failure_router]
| 'sucess_sink' >> beam.io.WriteToText('/path/to/file')
| 'failure_sink' >> beam.io.WriteStringsToPubSub('mytopic'))
However.. I suspect the 'Beam' way of handling this is
result = my_pcollections | 'compute_stuff' >> beam.Map(lambda (pcollection): my_compute_func(pcollection))
result | 'write_results_appropriately' >> write_results_appropriately(result))
...
def write_results_appropriately(result):
if result == ..:
# success, write to file
else:
# failure, write to topic
Thanks,
Kevin
High-level:
I am not sure of specifics of the Python API in this case, but from high level it looks like this:
par-dos support multiple outputs;
outputs are identified by the tag you give them (e.g. "correct-elements", "invalid-elements");
in your main par-do you write to multiple outputs choosing the output using your criteria;
each output is represented by a separate PCollection;
then you get the separate PCollections representing the tagged outputs from your par-do;
then apply different sinks to each of the tagged PCollections;
In detail see the section
https://beam.apache.org/documentation/programming-guide/#additional-outputs
From: https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
You can insert a GroupByKey and ungroup after your first ParDo. The Dataflow service never fuses ParDo operations across an aggregation.
This is what I came up with in python - is this reasonable / is there a simpler way?
def prevent_fuse(collection):
return (
collection
| beam.Map(lambda x: (x, 1))
| beam.GroupByKey()
| beam.FlatMap(lambda x: (x[0] for v in x[1]))
)
EDIT, in response to Ben Chambers' question
We want to prevent fusion because we have a collection which generates a much larger collection, and we need parallelization across the larger collection. If it fuses, I only get one worker across the larger collection.
Apache Beam SDK 2.3.0 adds the experimental Reshuffle transform, which is the Python alternative to the Reshuffle.viaRandomKey operation mentioned by #BenChambers. You can use it in place of your custom prevent_fuse code.
That should work. There are other ways, but they partly depend on what you are trying to do and why you want to prevent fusion. Keep in mind that fusion is an important optimization to improve the performance of your pipeline.
Could you elaborate on why you want to prevent fusion?
A small adjustment to my original proposal - if each item is too large, that will fail will fail. You need to force them into multiple items, so using a constant key doesn't work. So here, you can supply a key function which needs to differentiate the objects and be small, like a hash.
That said, still not sure this is the best way, or whether something simpler (beam.Partition?) would work. And would be good for Beam to supply an explicit primitive.
def prevent_fuse(collection, key=None):
"""
prevent a dataflow PCol fusing with the next PCol
supply a key function if the items are too big to use as keys
"""
key = key or (lambda x: x)
return (
collection
| beam.Map(lambda v: (key(v), v))
| beam.GroupByKey()
| beam.FlatMap(lambda kv: (v for v in kv[1]))
)
Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.