Apache Beam Python - SQL Transform with named PCollection Issue - google-cloud-dataflow

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in SQLTransform query, named PCollections can also be provided as below.
Code Block
class EmployeeType(typing.NamedTuple):
name:str
age:int
beam.coders.registry.register_coder(EmployeeType, beam.coders.RowCoder)
pcol = p | "Create" >> beam.Create([EmployeeType(name="ABC", age=10)]).with_output_types(EmployeeType)
(
{'a':pcol} | SqlTransform(
""" SELECT age FROM a """)
| "Map" >> beam.Map(lambda row: row.age)
| "Print" >> beam.Map(print)
)
p.run()
However the below code block errors out with error
Caused by: org.apache.beam.vendor.calcite.v1_28_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'a' not found
Apache Beam SDK used is 2.35.0, are there any known limitation in using named PCollection

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Dataflow stream python windowing

i am new in using dataflow. I have following logic :
Event is added to pubsub
Dataflow reads pubsub and gets the event
From event i am looking into MySQL to find relations in which segments this event have relation and list of relations is returned with this step. This segments are independent from one another.
Each segment can be divided to two tables in MySQL results for email and mobile and they are independent as well.
Each segment have rules that can be 1 to n . I would like to process this step in parallel and collect all results. I have tried to use Windows but i am not sure how to write the logic so when i get the combined results from all rules inside one segment all of them will be collected at end function and write the final logic inside MySQL depending from rule results ( boolean ).
Here is so far what i have :
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
processEmails = (ReadData
| 'GetSubscribersWithRulesForEmails' >> beam.ParDo(GetSubscribersWithRules(options, 'email'))
| 'ProcessSubscribersSegmentsForEmails' >> beam.ParDo(ProcessSubscribersSegments(options, 'email'))
)
processMobiles = (ReadData
| 'GetSubscribersWithRulesForMobiles' >> beam.ParDo(GetSubscribersWithRules(options, 'mobile'))
| 'ProcessSubscribersSegmentsForMobiles' >> beam.ParDo(ProcessSubscribersSegments(options, 'mobile'))
)
#for sake of testing only window for email is written
windowThis = (processEmails
| beam.WindowInto(
beam.window.FixedWindows(1),
trigger=beam.transforms.trigger.Repeatedly(
beam.transforms.trigger.AfterProcessingTime(1 * 10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)
In this case, because all of your elements have the exact same timestamp, I would use their message ID, and their timestamp to group them with Session windows. It would be something like this:
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
# At this point, ReadData contains (key, value) pairs with a timestamp.
# (Now we perform all of the processing
processEmails = (ReadData | ....)
processMobiles = (ReadData | .....)
# Now we window by sessions with a 1-second gap. This is okay because all of
# the elements for any given key have the exact same timestamp.
windowThis = (processEmails
| beam.WindowInto(beam.window.Sessions(1)) # Default trigger is fine
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)

How to set coder for Google Dataflow Pipeline in Python?

I am creating a custom Dataflow job in Python to ingest data from PubSub to BigQuery. Table has many nested fields.
Where Can I set Coder in this pipeline?
avail_schema = parse_table_schema_from_json(bg_out_schema)
coder = TableRowJsonCoder(table_schema=avail_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name")
| 'Map' >> beam.Map(coder))
# transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
Error: Map can be used only with callable objects. Received TableRowJsonCoder instead.
In the code above, the coder is applied to the message read from the PubSub which is text.
WriteToBigQuery works with both, dictionary and TableRow. json.load emits dict so you can simply use the output from it to write to BigQuery without apply any coder. Note, the field in dictionary has to match Table Schema.
To avoid coder issue I would suggest using following code.
avail_schema = parse_table_schema_from_json(bg_out_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name"))
transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

Is it possible to do a zip operation in apache beam on two PCollections?

I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.

How do parallelize apache-beam (Dataflow) pipeline DAGs

I am using apache-beam 2.5.0 python SDK
Attaching the code snippet, in a pipeline, I am taking i/p from pubsub topic parsing it and want to perform some operation on it, when I ran it with DataflowRunner it runs fine but it seems that "data processing fun1", "data processing fun2" "data processing fun3" are running in sequential, I need it to run in parallel.
I am new to dataflow.
Is there a way to parallelize it?
def run():
parser = argparse.ArgumentParser()
args, pipeline_args = parser.parse_known_args()
options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=options) as p:
data = (p | "Read Pubsub Messages" >>
beam.io.ReadFromPubSub(topic=config.pub_sub_topic)
| "Parse messages " >> beam.Map(parse_pub_sub_message_with_bq_data)
)
data | "data processing fun1 " >> beam.ParDo(Fun1())
data | "data processing fun2" >> beam.ParDo(Fun2())
data | "data processing fun3" >> beam.ParDo(Fun3())
if __name__ == '__main__':
run()
Why do you need these functions to run at the same time?
Beam / Dataflow take your graph, and try to optimize things that can run in the same thread. This called fusion optimization, and it's mentioned in the Flume Java paper.
The point is that it will usually be more efficient to run those functions one by one on the same thread, rather than interchange data between multiple processing threads or VMs, to parallelize the processing.
If your funtions must run more or less in parallel, you can add a beam.Reshuffle transform before the downstream transforms:
data = (p
| beam.io.ReadFromPubSub(topic)
| beam.Map(parse_messages))
# After the data has been shuffled, it may be consumed by multiple workers
data | beam.Reshuffle() | beam.ParDo(Fun1())
data | beam.Reshuffle() | beam.ParDo(Fun2())
data | beam.Reshuffle() | beam.ParDo(Fun3())
Let me know if I can add some detail into this.

Resources