Dataflow stream python windowing - google-cloud-dataflow

i am new in using dataflow. I have following logic :
Event is added to pubsub
Dataflow reads pubsub and gets the event
From event i am looking into MySQL to find relations in which segments this event have relation and list of relations is returned with this step. This segments are independent from one another.
Each segment can be divided to two tables in MySQL results for email and mobile and they are independent as well.
Each segment have rules that can be 1 to n . I would like to process this step in parallel and collect all results. I have tried to use Windows but i am not sure how to write the logic so when i get the combined results from all rules inside one segment all of them will be collected at end function and write the final logic inside MySQL depending from rule results ( boolean ).
Here is so far what i have :
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
processEmails = (ReadData
| 'GetSubscribersWithRulesForEmails' >> beam.ParDo(GetSubscribersWithRules(options, 'email'))
| 'ProcessSubscribersSegmentsForEmails' >> beam.ParDo(ProcessSubscribersSegments(options, 'email'))
)
processMobiles = (ReadData
| 'GetSubscribersWithRulesForMobiles' >> beam.ParDo(GetSubscribersWithRules(options, 'mobile'))
| 'ProcessSubscribersSegmentsForMobiles' >> beam.ParDo(ProcessSubscribersSegments(options, 'mobile'))
)
#for sake of testing only window for email is written
windowThis = (processEmails
| beam.WindowInto(
beam.window.FixedWindows(1),
trigger=beam.transforms.trigger.Repeatedly(
beam.transforms.trigger.AfterProcessingTime(1 * 10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)

In this case, because all of your elements have the exact same timestamp, I would use their message ID, and their timestamp to group them with Session windows. It would be something like this:
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
# At this point, ReadData contains (key, value) pairs with a timestamp.
# (Now we perform all of the processing
processEmails = (ReadData | ....)
processMobiles = (ReadData | .....)
# Now we window by sessions with a 1-second gap. This is okay because all of
# the elements for any given key have the exact same timestamp.
windowThis = (processEmails
| beam.WindowInto(beam.window.Sessions(1)) # Default trigger is fine
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)

Related

DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Apache Beam Python - SQL Transform with named PCollection Issue

I am trying to execute the below code in which I am using Named Tuple for PCollection and SQL transform for doing a simple select.
As per the video link (4:06) : https://www.youtube.com/watch?v=zx4p-UNSmrA.
Instead of using PCOLLECTION in SQLTransform query, named PCollections can also be provided as below.
Code Block
class EmployeeType(typing.NamedTuple):
name:str
age:int
beam.coders.registry.register_coder(EmployeeType, beam.coders.RowCoder)
pcol = p | "Create" >> beam.Create([EmployeeType(name="ABC", age=10)]).with_output_types(EmployeeType)
(
{'a':pcol} | SqlTransform(
""" SELECT age FROM a """)
| "Map" >> beam.Map(lambda row: row.age)
| "Print" >> beam.Map(print)
)
p.run()
However the below code block errors out with error
Caused by: org.apache.beam.vendor.calcite.v1_28_0.org.apache.calcite.sql.validate.SqlValidatorException: Object 'a' not found
Apache Beam SDK used is 2.35.0, are there any known limitation in using named PCollection

Google Dataflow Python Apache Beam Windowing delay issue

I have a simple pipeline that receives data from PubSub, prints it and then at every 10 seconds fires a window into a GroupByKey and prints that message again.
However this window seems to be delaying sometimes. Is this a google limitation or is there something wrong with my code:
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndPrinter())
| beam.WindowInto(window.FixedWindows(10), allowed_lateness=0, accumulation_mode=AccumulationMode.DISCARDING, trigger=AfterProcessingTime(10) )
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker()))
Edit for the most recent code. I removed the triggers however the problem persists:
class ExtractorAndCounter(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [("Message", json.loads(element)["Message"])]
class PrinterWorker(beam.DoFn):
def __init__(self):
beam.DoFn.__init__(self)
def process(self, element, *args, **kwargs):
import logging
logging.info(element)
return [str(element)]
class DefineTimestamp(beam.DoFn):
def process(self, element, *args, **kwargs):
from datetime import datetime
return [(str(datetime.now()), element)]
def run(argv=None, save_main_session=True):
"""Build and run the pipeline."""
parser = argparse.ArgumentParser()
parser.add_argument(
'--output_topic',
required=True,
help=(
'Output PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
'--input_topic',
help=(
'Input PubSub topic of the form '
'"projects/<PROJECT>/topics/<TOPIC>".'))
group.add_argument(
'--input_subscription',
help=(
'Input PubSub subscription of the form '
'"projects/<PROJECT>/subscriptions/<SUBSCRIPTION>."'))
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
pipeline_options.view_as(StandardOptions).streaming = True
with beam.Pipeline(options=pipeline_options) as pipe:
messages = (
pipe
| beam.io.ReadFromPubSub(subscription=known_args.input_subscription).with_output_types(bytes)
| 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'Ex' >> beam.ParDo(ExtractorAndCounter())
| beam.WindowInto(window.FixedWindows(10))
| 'group' >> beam.GroupByKey()
| 'PRINTER' >> beam.ParDo(PrinterWorker())
| 'encode' >> beam.Map(lambda x: x.encode('utf-8'))
| beam.io.WriteToPubSub(known_args.output_topic))
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
So what this basically ask the pipeline to do is to group elements to 10 second windows and fire every window after 10 seconds have passed since the first element was received for each window (and discard the rest of the data for that window). Was that your intention ?
Assuming this was the case, note that triggering depends on the time elements were received by the system as well as the time the first element is received for each window. Probably this is why you are seeing some variation in your results.
I think if you need more consistent grouping for your elements you should use event time triggers instead of processing time triggers.
All the triggers are based on best effort means they will be fired sometime after the specified duration, 10 sec in this case. Generally it happen close to the time specified but a few seconds delay is possible.
Also, the triggers are set for Key+Window. The window is derived from event time.
It is possible that the first GBK pint at 10:30:04 is due to the first element which as at 10:29:52
The 2nd GBK print at 10:30:07 is due to the first element at 10:29:56
So it will be good to print the window and event timestamp for each element and then co-relate the data.

Can I use apoc.periodic.updates to return a query output?

I have a query that reads a set of ID's from a csv file, searches for those nodes in the database and writes the results to a csv. I'm trying to get this query to run as quickly as possible and was wondering if I could parallelise the read operation using apoc.periodic.iterate:
http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.5/cypher-execution/commit-batching/
I've written a query that does what I need but really I just want to find out how to run this query as quickly as possible.
Here's the current, version of the query:
CALL apoc.export.csv.query('CALL apoc.load.csv(\'file:///edge.csv\') YIELD map as edge
MATCH (n:paper)
WHERE n.paper_id = edge.`From` OR n.paper_id = edge.`To`
RETURN n.paper_title',
'node.csv', {});
This query creates the resulting node.csv file that I want but as edge.csv grows in size the operation can slow down considerably.
What I was hoping to do was something like this:
CALL apoc.periodic.iterate(
'LOAD CSV WITH HEADERS FROM \'file:///edge.csv\' as row RETURN row',
'CALL apoc.export.csv.query(\'MATCH (n:paper) WHERE n.paper_id = row.`From` OR n.paper_id = row.`To` RETURN DISTINCT(n.paper_id) AS paper_id\', \'nodePar.csv\', {})'
, {batchSize:10, iterateList:true, parallel:true, failedParams:0})
;
This query will run but produce no output except for the following message:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| batches | total | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch | operations | wasTerminated | failedParams |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 14463 | 144629 | 0 | 144629 | 0 | 0 | 0 | {} | {total: 14463, committed: 14463, failed: 0, errors: {}} | {total: 144629, committed: 144629, failed: 0, errors: {}} | FALSE | {} |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
My main question is: can apoc.periodic.iterate be used in this way to accelerate this query and if so, how?
And then alternatively, is there any other way to speedup this query as the edge.csv file grows in size?

How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns.
Table A
C1 C2 C3
-----------
a a 1
a b 1
a c 1
Table B
C1 C2 C3 # Notes if comparing B to A
-------------------------------------
a a 1 # No Change to the key a + a
a b 2 # Key a + b Changed from 1 to 2
# Deleted key a + c with value 1
a d 1 # Added key a + d
I basically want to be able to make/report the comparison notes.
Or from a Beam perspective I may want to Just output up to 4 labeled PCollections: Unchanged, Changed, Added, Deleted. How do I do this and what would the PCollections look like?
What you want to do here, basically, is join two tables and compare the result of that, right? You can look at my answer to this question, to see the two ways in which you can join two tables (Side inputs, or CoGroupByKey).
I'll also code a solution for your problem using CoGroupByKey. I'm writing the code in Python because I'm more familiar with the Python SDK, but you'd implement similar logic in Java:
def make_kv_pair(x):
""" Output the record with the x[0]+x[1] key added."""
return ((x[0], x[1]), x)
table_a = (p | 'ReadTableA' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysA' >> beam.Map(make_kv_pair)
table_b = (p | 'ReadTableB' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysB' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'deleted', 'unchanged']
class FilterDoFn(beam.DoFn):
def process((key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | WriteToText(...)
key_collections.changed | WriteToText(...)
key_collections.added | WriteToText(...)
key_collections.removed | WriteToText(...)

Resources