How to access "key" in combine.perKey in beam

How to access "key" in combine.perKey in beam - google-cloud-dataflow

In How to create custom Combine.PerKey in beam sdk 2.0, I asked and got a correct answer on how to create a custom Combine.PerKey in the new beam sdk 2.0. However, I now need to create a custom combinePerKey such that within my custom CombinePerKey logic, I need to be able to access the contents of the key. This was easily possible in dataflow 1.x, but in the new beam sdk 2.0, I'm unsure how to do so. Any little code snippet/example would be extremely useful.
EDIT #1 (per Ben Chambers's request)
The real use case is hard to explain, but I'm going to try:
We have a 3d space composed of millions of little hills. We try to determine the apex of these millions of hills as follows: we create billions of "rectangular probes" for the whole 3d space, and then we ask each of these billions of probes to "move" in a greedy way to the apex. Once it hits the apex, it stops. The probe then returns the apex and itself. The apex is the KEY for which we'll do a custom combine by key.
Now, the custom combine function is going to finally return a final object (called a feature) which is derived from all the probes that reach the same apex (ie the same key). When generating this "feature" object, we need to know infomration about the final apex/key (ie the top of the hill). Hence, I need this key info.
One way to solve this is using a group by key, but that was slow (at least in df 1.x); we got it to be fast (in df 1.x) using a custom combine fn. So, we'd like the key. That said, groupbykey works in beam skd 2.0.
Alternatively, we could stick the "apex" information into the "probe" objects itself, but this means that each of our billions of probe objects now needs to be tripled in size just to hold this apex information (and this apex information repeats itself, since there are only say 1 million apexes but 1 billion probes, so this intuitively feels highly inefficient.)

Rather than relying on the CombineFn to compute the entire result, could you instead have the ComibneFn compute some partial result based only on information about the probes? Then your Combine.perKey(...) returns a PCollection<KV<Apex, InfoAboutProbes>> and you can use a ParDo to combine the information about the apex with the summary information about the probes. This allows you to use the CombineFn for efficiently combining information about many probes, while using a ParDo to access the key.

Related

'GetPositionLowerLimits' and 'GetPositionUpperLimits' does not support get by ModelInstance as other functions

What I find in Drake API is that 'GetPositionLowerLimits' and 'GetPositionUpperLimits' does not support get by ModelInstance as other functions.
Anybody knows how to query this when I have multiple robots and I am interested in the Lower and Upper Limits for every robot?

I believe you can take the result of GetPositionLowerLimits (or GetPositionUpperLimits) and feed it through GetPositionsFromArray to select just one model instance at a time.
The model instances documentation has some more information.

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.

This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

Beam/Dataflow design pattern to enrich documents based on database queries

Evaluating Dataflow, and am trying to figure out if/how to do the following.
My apologies if anything in the above is trivial--trying to wrap our heads around Dataflow before we make a decision on using Beam, or something else like Spark, etc.
General use case is for machine learning:
Ingesting documents which are individually processed.
In addition to easy-to-write transforms, we'd like to enrich each document based on queries against databases (that are largely key-value stores).
A simple example would be a gazetteer: decompose the text into ngrams, and then check if those ngrams reside in some database, and record (within a transformed version of the original doc) the entity identifier given phrases map to.
How to do this efficiently?
NAIVE (although possibly tricky with the serialization requirement?):
Each document could simply query the database individually (similar to Querying a relational database through Google DataFlow Transformer), but, given that most of these are simple key-value stores, it seems like there should be a more efficient way to do this (given the real problems with database query latency).
SCENARIO #1: Improved?:
Current strawman is to store the tables in Bigquery, pull them down (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py), and then use them as side inputs, that are used as key-value lookups within the per-doc function(s).
Key-value tables range from generally very small to not-huge (100s of MBs, maybe low GBs). Multiple CoGroupByKey with same key apache beam ("Side inputs can be arbitrarily large - there is no limit; we have seen pipelines successfully run using side inputs of 1+TB in size") suggests this is reasonable, at least from a size POV.
1) Does this make sense? Is this the "correct" design pattern for this scenario?
2) If this is a good design pattern...how do I actually implement this?
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L53 shows feeding the result to the document function as an AsList.
i) Presumably, AsDict is more appropriate here, for the above use case? So I'd probably need to run some transformations first on the Bigquery output to separate it into key, value tuple; and make sure that the keys are unique; and then use it as a side input.
ii) Then I need to use the side input in the function.
What I'm not clear on:
for both of these, how to manipulate the output coming off of the Bigquery pull is murky to me. How would I accomplish (i) (assuming it is necessary)? Meaning, what does the data format look like (raw bytes? strings? is there a good example I can look into?)
Similarly, if AsDict is the correct way to pass it into the func, can I just reference things like a dict normally is used in python? e.g., side_input.get('blah') ?
SCENARIO #2: Even more improved? (for specific cases):
The above scenario--if achievable--definitely does seem like it is superior continuous remote calls (given the simple key-value lookup), and would be very helpful for some of our scenarios. But if I take a scenario like a gazetteer lookup (like above)...is there an even more optimized solution?
Something like, for every doc, writing our all the ngrams as keys, with values as the underlying indices (docid+indices within the doc), and then doing some sort of join between these ngrams and the phrases in our gazeteer...and then doing another set of transforms to recover the original docs (now w/ their new annotations).
I.e., let Beam handle all of the joins/lookups directly?
Theoretical advantage is that Beam may be a lot quicker in doing this than, for each doc, looping over all of the ngrams and doing a check if the ngram is in the side_input.
Other key issues:
3) If this is a good way to do things, is there any trick to making this work well in the streaming scenario? Text elsewhere suggests that the side input caching works more poorly outside the batch scenario. Right now, we're focused on batch, but streaming will become relevant in serving live predictions.
4) Any Beam-related reason to prefer Java>Python for any of the above? We've got a good amount of existing Python code to move to Dataflow, so would heavily prefer Python...but not sure if there are any hidden issues with Python in the above (e.g., I've noticed Python doesn't support certain features or I/O).
EDIT: Strawman? for the example ngram lookup scenario (should generalize strongly to general K:V lookup)
Phrases = get from bigquery
Docs (indexed by docid) (direct input from text or protobufs, e.g.)
Transform: phrases -> (phrase, entity) tuples
Transform: docs -> ngrams (phrase, docid, coordinates [in document])
CoGroupByKey key=phrase: (phrase, entity, docid, coords)
CoGroupByKey key=docid, group((phrase, entity, docid, coords), Docs)
Then we can iteratively finalize each doc, using the set of (phrase, entity, docid, coords) and each Doc

Regarding the scenarios for your pipeline:
Naive scenario
You are right that per-element querying of a database is undesirable.
If your key-value store is able to support low-latency lookups by reusing an open connection, you can define a global connection that is initialized once per worker instead of once per bundle. This should be acceptable your k-v store supports efficient lookups over existing connections.
Improved scenario
If that's not feasible, then BQ is a great way to keep and pull in your data.
You can definitely use AsDict side inputs, and simply go side_input[my_key] or side_input.get(my_key).
Your pipeline could look something like so:
kv_query = "SELECT key, value FROM my:table.name"
p = beam.Pipeline()
documents_pcoll = p | ReadDocuments()
additional_data_pcoll = (p
| beam.io.BigQuerySource(query=kv_query)
# Make row a key-value tuple.
| 'format bq' >> beam.Map(lambda row: (row['key'], row['value'])))
enriched_docs = (documents_pcoll
| 'join' >> beam.Map(lambda doc, query: enrich_doc(doc, query[doc['key']]),
query=AsDict(additional_data_pcoll)))
Unfortunately, this has one shortcoming, and that's the fact that Python does not currently support arbitrarily large side inputs (it currently loads all of the K-V into a single Python dictionary). If your side-input data is large, then you'll want to avoid this option.
Note This will change in the future, but we can't be sure ATM.
Further Improved
Another way of joining two datasets is to use CoGroupByKey. The loading of documents, and of K-V additional data should not change, but when joining, you'd do something like so:
# Turn the documents into key-value tuples as well[
documents_kv_pcoll = (documents_pcoll
| 'format docs' >> beam.Map(lambda doc: (doc['key'], doc)))
enriched_docs = ({'docs': documents_kv_pcoll, 'additional_data': additional_data_pcoll}
| beam.CoGroupByKey()
| 'enrich' >> beam.Map(lambda x: enrich_doc(x['docs'][0], x['additional_data'][0]))
CoGroupByKey will allow you to use arbitrarily large collections on either side.
Answering your questions
You can see an example of using BigQuery as a side input in the cookbook. As you can see there, the data comes parsed (I believe that it comes in their original data types, but it may come in string/unicode). Check the docs (or feel free to ask) if you need to know more.
Currently, Python streaming is in alpha, and it does not support side inputs; but it does support shuffle features such as CoGroupByKey. Your pipeline using CoGroupByKey should work well in streaming.
A reason to prefer Java over Python is that all these features work in Java (unlimited-size side inputs, streaming side inputs). But it seems that for your use case, Python may have all you need.
Note: The code snippets are approximate, but you should be able to debug them using the DirectRunner.
Feel free to ask for clarification, or to ask about other aspects if you feel like it'd help.

Best Practice ETL with Dataflow and Lookup

What's the best practice to implement a standard streaming ETL process which writes fact and some smaller dimensional tables to BigQuery?
I'm trying to understand how to handle the following things:
How to do a simple dimension lookup in a streaming pipeline?
In case the answer is sideInput - how to handle lookups for values that don't exist yet in the dimension? How to update the sideInput?

When side inputs receive late data on a specific window, they will be recomputed. If you do the lookup after this, then you'll be able to see the element in the side input.
Currently, the Beam model does not include semantics for re-triggering of the ParDo that consumes the side input, so you'd need to somehow make sure to (re)do de lookup after the side input has been computed.

using butterworth filter in a case structure

I'm trying to use butterworth filter. The input data comes from an "index array" module (the data is acquired through DAQ and I want to process the voltage signal which is in an array of waveforms). when I use this filter in a case structure, it doesn't work. yet, when I use the filters in the "waveform conditioning" section, there is no problem. what exactly is the difference between these two types of filters?
a little add on to my problem: the second picture is from when i tried to reassemble the initial combination, and the error happened

You are comparing offline filtering to online filtering.
In LabVIEW, the PtbyPt-VIs are intended to be used in an online setting, that is - iteratively.
For each new sample that is obtained, it would be input directly into the VI. The VI stores the states of the previous iterations to perform the filtering.
The "normal" filter VIs are intended for offline analysis and expects an array containing the full data of the signal.
The following whitepaper explains Point-by-Point-VIs. Note that this paper is quite old, so it should explain the concepts - but might be otherwise outdated.
http://www.ni.com/pdf/manuals/370152b.pdf

If VoltageBuf is an array of consecutive values of the same signal (the one that you want to filter) you only need to connect VoltageBuf directly to the filter.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart