Apache Beam - Mean Aggregation for each key in a PCollection - google-cloud-dataflow

I have a PCollection which consists of an ID column and seven value columns. There are several rows for each ID.
I would like to compute the mean of the seven columns per unique ID.
Is there a way to achieve this without going through each element programmatically and creating key/value pair per each element?

table_pcoll = ....
def per_column_average(rows, ignore_elms=[ID_INDEX]):
return [sum([row[idx] if idx not in ignore_elms else 0
for row in rows])/len(row[0])
for idx, _ in enumerate(rows[0])]
keyed_averaged_elm = (table_pcoll
| beam.Map(lambda x: (x[ID_INDEX], x))
| beam.GroupByKey()
| beam.Map(lambda x: (x[0], per_column_average(rows))
Sorry about the nasty one-liner. I hope that helps.

Related

Need help explaining this formula provided to me

I recently posted on here to get help with a formula, here is the link...https://stackoverflow.com/questions/75068029/vlook-up-style-forumla-but-range-is-2-cells A user called rockinfreakshow was really awesome and provided a great solution for me. I'm not very experienced and don't understand what the formula at all but I'd love to be able to add more attributes to it. Is anyone able to help break it down for me ?
I havent tried anything here, it's totally out of my realm of understanding
=MAKEARRAY(COUNTA(B2:B),COUNTA(D1:O1),LAMBDA(r,c,IF(REGEXMATCH(LAMBDA(ax,bx,IFS(REGEXMATCH(ax,"Mixed")*REGEXMATCH(INDEX(C2:C,r),"Blend")*REGEXMATCH(INDEX(C2:C,r),"Filter"),"BLEND-"&bx&"|FILTER-"&bx,REGEXMATCH(ax,"Mixed")*NOT(REGEXMATCH(INDEX(C2:C,r),"Blend"))*REGEXMATCH(INDEX(C2:C,r),"Filter"),"ESP-"&bx&"|FILTER-"&bx,REGEXMATCH(ax,"Mixed")*NOT(REGEXMATCH(INDEX(C2:C,r),"Filter")),"BLEND-"&bx&"|ESP-"&bx,LEN(ax),SUBSTITUTE(ax&"-"&bx,"Espresso","ESP")))(regexextract(INDEX(B2:B,r),"([^\s]*?) Subscription"),IFNA(SWITCH(REGEXEXTRACT(INDEX(C2:C,r),"Small|Medium|Large"),"Small",250,"Medium",450,"Large",900),SWITCH(REGEXEXTRACT(INDEX(B2:B,r),"Medium|Large"),"Medium",225,"Large",450))),"(?i)"&INDEX(D1:O1,,c)),1,)))
see the WHY LAMBDA? part of this answer to understand the LAMBDA
the formula contains 2x LAMBDA and there are a total of 4 placeholders which translates to:
r - COUNTA(B2:B)
c - COUNTA(D1:O1)
ax - REGEXEXTRACT(INDEX(B2:B, r), "([^\s]*?) Subscription")
bx - IFNA(SWITCH(REGEXEXTRACT(INDEX(C2:C, r), "Small|Medium|Large"),
"Small", 250, "Medium", 450, "Large", 900),
SWITCH(REGEXEXTRACT(INDEX(B2:B, r), "Medium|Large"),
"Medium", 225, "Large", 450))
r counts how many items are in B column
c counts how many items are in row 1 of range D1:O1
ax extracts the word from B column that precedes the word Subscription
bx is a bit complex but essentially it extracts from C column word Small or Medium or Large and replaces it with 250, 450 or 900 respectively. then if C column does not contain one of those 3 words it checks for Medium or Large within B column and assigns 225 or 450 respectively
what we are left with is the core of the formula:
IFS( REGEXMATCH(ax, "Mixed")*
REGEXMATCH(INDEX(C2:C, r), "Blend")*
REGEXMATCH(INDEX(C2:C, r), "Filter"), "BLEND-"&bx&"|FILTER-"&bx,
___________________________________________________________________________
REGEXMATCH(ax, "Mixed")*
NOT(REGEXMATCH(INDEX(C2:C, r), "Blend"))*
REGEXMATCH(INDEX(C2:C, r), "Filter"), "ESP-"&bx&"|FILTER-"&bx,
___________________________________________________________________________
REGEXMATCH(ax, "Mixed")*
NOT(REGEXMATCH(INDEX(C2:C, r), "Filter")), "BLEND-"&bx&"|ESP-"&bx,
___________________________________________________________________________
LEN(ax), SUBSTITUTE(ax&"-"&bx, "Espresso", "ESP"))
for better visualization, the IFS formula contains only 4 elements. each of these 4 elements acts as a switch - if there is a match x we get output y. for example let's dissect the first element...
REGEXMATCH(ax, "Mixed")*
REGEXMATCH(INDEX(C2:C, r), "Blend")*
REGEXMATCH(INDEX(C2:C, r), "Filter"), "BLEND-"&bx&"|FILTER-"&bx
there are 3x REGEXMATCHes multiplied by each other. whenever there is such multiplication in array formulae it translates as AND logic gate (if there would be + it would mean OR logic gate) eg.:
1 * 1 = 1
1 * 0 = 0
0 * 1 = 0
0 * 0 = 0
REGEXMATCH outputs TRUE or FALSE so if we get 3x TRUE the whole argument is considered as TRUE (because 1 * 1 * 1 = 1) so we proceed to output our first switch
therefore if B column contains Mixed and C column contains Blend and C column contains Filter then we output Blend-000|Filter-000 where 000 stands for a specific number determined from bx placeholder/formula and also you can notice the | (which btw stands for OR logic within the regex) but in this case, it's just a unique symbol to join stuff for REGEXMATCH. which REGEXMATCH is this for you may ask? ...this one:
so the output of IFS formula is the input for most outer REGEXMATCH and we check if the IFS output matches something within D1:O1 range. IF yes then output 1 otherwise output nothing. shortened:
IF(REGEXMATCH(IFS(...), "(?i)"&INDEX(D1:O1,,c), 1, )
(?i) in regex means "case insensitive". it is there just for safety reasons because regex is by default case sensitive.
and we reached the MAKEARRAY formula that creates an array of numbers across the whole range with height r and width c where output is the result of IF eg. either 1 or empty cell

Is it possible to do a zip operation in apache beam on two PCollections?

I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.

How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns.
Table A
C1 C2 C3
-----------
a a 1
a b 1
a c 1
Table B
C1 C2 C3 # Notes if comparing B to A
-------------------------------------
a a 1 # No Change to the key a + a
a b 2 # Key a + b Changed from 1 to 2
# Deleted key a + c with value 1
a d 1 # Added key a + d
I basically want to be able to make/report the comparison notes.
Or from a Beam perspective I may want to Just output up to 4 labeled PCollections: Unchanged, Changed, Added, Deleted. How do I do this and what would the PCollections look like?
What you want to do here, basically, is join two tables and compare the result of that, right? You can look at my answer to this question, to see the two ways in which you can join two tables (Side inputs, or CoGroupByKey).
I'll also code a solution for your problem using CoGroupByKey. I'm writing the code in Python because I'm more familiar with the Python SDK, but you'd implement similar logic in Java:
def make_kv_pair(x):
""" Output the record with the x[0]+x[1] key added."""
return ((x[0], x[1]), x)
table_a = (p | 'ReadTableA' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysA' >> beam.Map(make_kv_pair)
table_b = (p | 'ReadTableB' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysB' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'deleted', 'unchanged']
class FilterDoFn(beam.DoFn):
def process((key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | WriteToText(...)
key_collections.changed | WriteToText(...)
key_collections.added | WriteToText(...)
key_collections.removed | WriteToText(...)

Traversing the graphdb to the nth degree

I'm at a loss.
Scenario: get depth of 2 from Joe (#'s represent 'Person'. Letters represent 'Position')
0 E
| |
1 B
/ | \ / | \
2 JOE 3 C A D
/|\ /|\
0 0 0 F G H
/\ | | /\ | |
0 0 0 0 I J K L
Catch is, a person is tied to a position. Position has relationships to each other but a person doesn't have a relationship to another person. so it goes something like:
Joe<-[:occupied_by]-(PositionA)-[:authority_held_by]->
(PositionB)-[:occupied_by]->Sam
This query:
Match (:Identity {value:"1234"})-[:IDENTIFIES]->(posStart:Position)
-[:IS_OCCUPIED_BY]->(perStart:Person)
Optional Match p=(perStart)<-[:IS_OCCUPIED_BY]-(posStart)
-[r:AUTHORITY_HELD_BY*..1]-(posEnd:Position)-[:IS_OCCUPIED_BY]->
(perEnd:Person) Return p
does get me what I need but always returns the first column as the original node it started with (perStart). I want to return it in a way where the first two column always represent the start node and the second column to represent the end node.
PositionA, PositionB (and we can infer this means A-[:authority_held_by]->B
If we had bi directional relationship, such as, A-[:authority_held_by]->B and B-[:manages]->A
I wouldn't mind what's in the first or the second column as we can have the third column represent the relationship
PositionB, PositionA, [:manages]
but we are trying to stay away from bi-directional relationship
Ultimately I want something like:
PositionA, PositionB (inferring, A-[:A_H_B]->B)
PositionB, PositionE (inferring, B-[:A_H_B]->E)
PositionF, PositionA (inferring, F-[:A_H_B]->A)
PositionG, PositionA (inferring, G-[:A_H_B]->A)
Is this possible with cypher or do I have to do some black magic? :)
I hope I explained throughly and understandably.. Thank you so much in advance!
would replacing Return p with-
RETURN nodes(p)[0] AS START, LAST(nodes(p)) as last
work?

Erlang list comprehension, once again

I'm trying to get a list comprehension working, which intention is to verify that each element X in List is followed by X+Incr (or an empty list). Later, I shall use that list and compare it with a list generated with lists:seq(From,To,Incr).
The purpose is to practice writing test cases and finding test properties.
I've done the following steps:
1> List.
[1,3,5,8,9,11,13]
2> Incr.
2
3> List2=[X || X <- List, (tl(List) == []) orelse (hd(tl(List)) == X + Incr)].
[1]
To me, it seem that my list comprehension only takes the first element in List, running that through the filter/guards, and stops, but it should do the same for EACH element in List, right?
I would like line 3 returning a list, looking like: [1,2,9,11,13].
Any ideas of how to modify current comprehension, or change my approach totally?
PS. I'm using eqc-quickcheck, distributed via Quviq's webpage, if that might change how to solve this.
The problem with your list comprehension is that List always refers to the entire list. Thus this condition allows only those X that are equal to the second element of List minus Incr:
(hd(tl(List)) == X + Incr)
The second element is always 3, so this condition only holds for X = 1.
A list comprehension cannot "look ahead" to other list elements, so this should probably be written as a recursive function:
check_incr([], _Incr) ->
true;
check_incr([_], _Incr) ->
true;
check_incr([A, B | Rest], Incr) ->
A + Incr == B andalso check_incr([B | Rest], Incr).
Maybe I'm misunderstanding you, but a list comprehension is supposed to be "creating a list based on existing lists". Here's one way to generate your list using a list comprehension without using lists:seq:
> Start = 1, Inc = 2, N = 6.
6
> [Start + X*Inc || X <- lists:seq(0,N)].
[1,3,5,7,9,11,13]
You could do something like this:
> lists:zipwith(fun (X, Y) -> Y - X end, [0 | List], List ++ [0]).
[1,2,2,2,2,2,2,-13]
Then check that all elements are equal to Incr, except the first that should be equal to From and the last that should be greater or equal than -To.
One quick comment is that the value List does NOT change when in the comprehension is evaluated, it always refers to the initial list. It is X which steps over all the elements in the list. This means that your tests will always refer to the first elements of the list. As a list comprehension gives you element of a list at a time it is generally not a good tool to use when you want to compare elements in the list.
There is no way with a list comprehension to look at successive sublists which is what you would need (like MAPLIST in Common Lisp).

Resources