I have a PCollection[str] and I want to generate random pairs.
Coming from Apache Spark, my strategy was to:
copy the original PCollection
randomly shuffle it
zip it with the original PCollection
However I can't seem to find a way to zip 2 PCollections...
This is interesting and a not very common use case because, as #chamikara says, there is no order guarantee in Dataflow. However, I thought about implementing a solution where you shuffle the input PCollection and then pair consecutive elements based on state . I have found some caveats in the way but I thought it might be worth sharing anyway.
First, I have used the Python SDK but the Dataflow Runner does not support stateful DoFn's yet. It works with the Direct Runner but: 1) it is not scalable and 2) it's difficult to shuffle the records without multi-threading. Of course, an easy solution for the latter is to feed an already shuffled PCollection to the pipeline (we can use a different job to pre-process the data). Otherwise, we can adapt this example to the Java SDK.
For now, I decided to try to shuffle and pair it with a single pipeline. I don't really know if this helps or makes things more complicated but code can be found here.
Briefly, the stateful DoFn looks at the buffer and if it is empty it puts in the current element. Otherwise, it pops out the previous element from the buffer and outputs a tuple of (previous_element, current_element):
class PairRecordsFn(beam.DoFn):
"""Pairs two consecutive elements after shuffle"""
BUFFER = BagStateSpec('buffer', PickleCoder())
def process(self, element, buffer=beam.DoFn.StateParam(BUFFER)):
try:
previous_element = list(buffer.read())[0]
except:
previous_element = []
unused_key, value = element
if previous_element:
yield (previous_element, value)
buffer.clear()
else:
buffer.add(value)
The pipeline adds keys to the input elements as required to use a stateful DoFn. Here there will be a trade-off because you can potentially assign the same key to all elements with beam.Map(lambda x: (1, x)). This would not parallelize well but it's not a problem as we are using the Direct Runner anyway (keep it in mind if using the Java SDK). However, it will not shuffle the records. If, instead, we shuffle to a large amount of keys we'll get a larger number of "orphaned" elements that can't be paired (as state is preserved per key and we assign them randomly we can have an odd number of records per key):
pairs = (p
| 'Create Events' >> beam.Create(data)
| 'Add Keys' >> beam.Map(lambda x: (randint(1,4), x))
| 'Pair Records' >> beam.ParDo(PairRecordsFn())
| 'Check Results' >> beam.ParDo(LogFn()))
In my case I got something like:
INFO:root:('one', 'three')
INFO:root:('two', 'five')
INFO:root:('zero', 'six')
INFO:root:('four', 'seven')
INFO:root:('ten', 'twelve')
INFO:root:('nine', 'thirteen')
INFO:root:('eight', 'fourteen')
INFO:root:('eleven', 'sixteen')
...
EDIT: I thought of another way to do so using the Sample.FixedSizeGlobally combiner. The good thing is that it shuffles the data better but you need to know the number of elements a priori (otherwise we'd need an initial pass on the data) and it seems to return all elements together. Briefly, I initialize the same PCollection twice but apply different shuffle orders and assign indexes in a stateful DoFn. This will guarantee that indexes are unique across elements in the same PCollection (even if no order is guaranteed). In my case, both PCollections will have exactly one record for each key in the range [0, 31]. A CoGroupByKey transform will join both PCollections on the same index thus having random pairs of elements:
pc1 = (p
| 'Create Events 1' >> beam.Create(data)
| 'Sample 1' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 1' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 1' >> beam.Map(lambda x: (1, x))
| 'Assign Index 1' >> beam.ParDo(IndexAssigningStatefulDoFn()))
pc2 = (p
| 'Create Events 2' >> beam.Create(data)
| 'Sample 2' >> combine.Sample.FixedSizeGlobally(NUM_ELEMENTS)
| 'Split Sample 2' >> beam.ParDo(SplitFn())
| 'Add Dummy Key 2' >> beam.Map(lambda x: (2, x))
| 'Assign Index 2' >> beam.ParDo(IndexAssigningStatefulDoFn()))
zipped = ((pc1, pc2)
| 'Zip Shuffled PCollections' >> beam.CoGroupByKey()
| 'Drop Index' >> beam.Map(lambda (x, y):y)
| 'Check Results' >> beam.ParDo(LogFn()))
Full code here
Results:
INFO:root:(['ten'], ['nineteen'])
INFO:root:(['twenty-three'], ['seven'])
INFO:root:(['twenty-five'], ['twenty'])
INFO:root:(['twelve'], ['twenty-one'])
INFO:root:(['twenty-six'], ['twenty-five'])
INFO:root:(['zero'], ['twenty-three'])
...
How about applying a ParDo transform to both PCollections that attach keys to elements and running the two PCollections through a CoGroupByKey transform ?
Please note that Beam does not guarantee order of elements in a PCollection so output elements might get reordered after any step but seems like this should be OK for your use-case since you just need some random order.
Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...