DataFlowRunner + Beam in streaming mode with a SideInput AsDict hangs - google-cloud-dataflow

I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
beam.transforms.window.FixedWindows(1e-6),
trigger=beam.transforms.trigger.Repeatedly(beam.transforms.trigger.AfterCount(1)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING))
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?

Related

Splitting file processing by initial keys

Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
๐Ÿ“‚state-data/
|-๐Ÿ“œAL.zip
|-๐Ÿ“œAK.zip
|-๐Ÿ“œ...
|-๐Ÿ“œWY.zip
๐Ÿ“‚county-data/
|-๐Ÿ“‚AL/
|-๐Ÿ“œCOUNTY1.csv
|-๐Ÿ“œCOUNTY2.csv
|-๐Ÿ“œ...
|-๐Ÿ“œCOUNTY68.csv
|-๐Ÿ“‚AK/
|-๐Ÿ“œ...
|-๐Ÿ“‚.../
|-๐Ÿ“‚WY/
|-๐Ÿ“œ...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.

Dataflow stream python windowing

i am new in using dataflow. I have following logic :
Event is added to pubsub
Dataflow reads pubsub and gets the event
From event i am looking into MySQL to find relations in which segments this event have relation and list of relations is returned with this step. This segments are independent from one another.
Each segment can be divided to two tables in MySQL results for email and mobile and they are independent as well.
Each segment have rules that can be 1 to n . I would like to process this step in parallel and collect all results. I have tried to use Windows but i am not sure how to write the logic so when i get the combined results from all rules inside one segment all of them will be collected at end function and write the final logic inside MySQL depending from rule results ( boolean ).
Here is so far what i have :
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
processEmails = (ReadData
| 'GetSubscribersWithRulesForEmails' >> beam.ParDo(GetSubscribersWithRules(options, 'email'))
| 'ProcessSubscribersSegmentsForEmails' >> beam.ParDo(ProcessSubscribersSegments(options, 'email'))
)
processMobiles = (ReadData
| 'GetSubscribersWithRulesForMobiles' >> beam.ParDo(GetSubscribersWithRules(options, 'mobile'))
| 'ProcessSubscribersSegmentsForMobiles' >> beam.ParDo(ProcessSubscribersSegments(options, 'mobile'))
)
#for sake of testing only window for email is written
windowThis = (processEmails
| beam.WindowInto(
beam.window.FixedWindows(1),
trigger=beam.transforms.trigger.Repeatedly(
beam.transforms.trigger.AfterProcessingTime(1 * 10)),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)
In this case, because all of your elements have the exact same timestamp, I would use their message ID, and their timestamp to group them with Session windows. It would be something like this:
testP = beam.Pipeline(options=options)
ReadData = (
testP | 'ReadData' >> beam.io.ReadFromPubSub(subscription=str(options.pubsubsubscriber.get())).with_output_types(bytes)
| 'Decode' >> beam.Map(lambda x: x.decode('utf-8'))
| 'GetSegments' >> beam.ParDo(getsegments(options))
)
# At this point, ReadData contains (key, value) pairs with a timestamp.
# (Now we perform all of the processing
processEmails = (ReadData | ....)
processMobiles = (ReadData | .....)
# Now we window by sessions with a 1-second gap. This is okay because all of
# the elements for any given key have the exact same timestamp.
windowThis = (processEmails
| beam.WindowInto(beam.window.Sessions(1)) # Default trigger is fine
| beam.CombinePerKey(beam.combiners.ToListCombineFn())
| beam.ParDo(print_windows)
)

How to set coder for Google Dataflow Pipeline in Python?

I am creating a custom Dataflow job in Python to ingest data from PubSub to BigQuery. Table has many nested fields.
Where Can I set Coder in this pipeline?
avail_schema = parse_table_schema_from_json(bg_out_schema)
coder = TableRowJsonCoder(table_schema=avail_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name")
| 'Map' >> beam.Map(coder))
# transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
Error: Map can be used only with callable objects. Received TableRowJsonCoder instead.
In the code above, the coder is applied to the message read from the PubSub which is text.
WriteToBigQuery works with both, dictionary and TableRow. json.load emits dict so you can simply use the output from it to write to BigQuery without apply any coder. Note, the field in dictionary has to match Table Schema.
To avoid coder issue I would suggest using following code.
avail_schema = parse_table_schema_from_json(bg_out_schema)
with beam.Pipeline(options=options) as p:
# Read the text from PubSub messages.
lines = (p | beam.io.ReadFromPubSub(subscription="projects/project_name/subscriptions/subscription_name"))
transformed = lines| 'Parse JSON to Dict' >> beam.Map(json.loads)
transformed | 'Write to BigQuery' >> beam.io.WriteToBigQuery("Project:DataSet.Table", schema=avail_schema, write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)

Is it possible to do a zip operation in apache beam on two PCollections?

I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.

How to do MapReduce on Riak with Erlang to get even values from all the keys storing the numbers from 1 to 1000

I am trying to do mapreduce on Riak with Erlang. I am having data like the following:
Bucket = "Numbers"
{Keys,values} = {Random key,1},{Random key,2}........{Random key,1000}.
Now, I am storing 1000 values from 1 to 1000, where all the keys are autogenerated by the term undefined given as a parameter, so all the keys will have values starting from 1 to 1000.
So I want the data from only the values that are even numbers. Using mapreduce, how can I achieve this?
You would construct phase functions as described in http://docs.basho.com/riak/latest/dev/advanced/mapreduce/
One possible map function:
Mapfun = fun(Object, _KeyData, _Arg) ->
%% get the object value, convert to integer and check if even
Value = list_to_integer(binary_to_term(riak_object:get_value(Object))),
case Value rem 2 of
0 -> [Value];
1 -> []
end
end.
Although you probably want to not completely fail in the event you encounter a sibling:
Mapfun = fun(Object, _KeyData, _Arg) ->
Values = riak_object:get_values(Object),
case length(Values) of %% checking for siblings
1 -> %% only 1 value == no siblings
I = list_to_integer(binary_to_term(hd(Values))),
case I rem 2 of
0 -> [I]; %% value is even
1 -> [] %% value is odd
end;
_ -> [] %% What should happen with siblings?
end
end.
There may also be other cases you need to either prevent or check for: the value containing non-numeric characters, empty value, deleted values(tombsones), just to name a few.
Edit:
A word of caution: Doing a full-bucket MapReduce job will require Riak to read every value from the disk, this could cause extreme latency and timeout on a sizeable data set. Probably not something you want to do in production.
A full example of peforming MapReduce (limited to the numbers 1 to 200 for space considerations):
Assuming that you have cloned and built the riak-erlang-client
Using the second Mapfun from above
erl -pa {path-to-riak-erlang-client}/ebin
Define a reduce function to sort the list
Reducefun = fun(List,_) ->
lists:sort(List)
end.
Attach to the local Riak server
{ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 8087).
Generate some test data
[ riakc_pb_socket:put(
Pid,
riakc_obj:new(
<<"numbers">>,
list_to_binary("Key" ++ V),V
)
) || V <- [ integer_to_list(Itr) || Itr <- lists:seq(1,200)]],
The function to perform a MapReduce with this client is
mapred(pid(), mapred_inputs(), [mapred_queryterm()])
The mapred_queryterm is a list of phase specification of the form {Type, FunTerm, Arg, Keep} as defined in the readme. For this example, there are 2 phases:
a map phase that selects only even numbers
{map, Mapfun, none, true}
a reduce phase that sorts the result
{reduce, Reducefun, none, true}
Perform the MapReduce query
{ok,Results} = riakc_pb_socket:mapred(
Pid, %% The socket pid from above
<<"numbers">>, %% Input is the bucket
[{map,{qfun,Mapfun},none,true},
{reduce,{qfun,Reducefun},none,true}]
),
Results will be a list of [{_Phase Index_, _Phase Output_}] with a separate entry for each phase for which Keep was true, in this example both phases are marked keep, so in this example Results will be
[{0,[_map phase result_]},{1,[_reduce phase result_]}]
Print out the result of each phase:
[ io:format("MapReduce Result of phase ~p:~n~P~n",[P,Result,500])
|| {P,Result} <- Results ].
When I ran this, my output was:
MapReduce Result of phase 0:
[182,132,174,128,8,146,18,168,70,98,186,118,50,28,22,112,82,160,114,106,12,26,
124,14,194,64,122,144,172,96,126,162,58,170,108,44,90,104,6,196,40,154,94,
120,76,48,150,52,4,62,140,178,2,142,100,166,192,66,16,36,38,88,102,68,34,32,
30,164,110,42,92,138,86,54,152,116,156,72,134,200,148,46,10,176,198,84,56,78,
130,136,74,190,158,24,184,180,80,60,20,188]
MapReduce Result of phase 1:
[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,
56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,
104,106,108,110,112,114,116,118,120,122,124,126,128,130,132,134,136,138,140,
142,144,146,148,150,152,154,156,158,160,162,164,166,168,170,172,174,176,178,
180,182,184,186,188,190,192,194,196,198,200]
[ok,ok]

Resources