Beam/Dataflow design pattern to enrich documents based on database queries - google-cloud-dataflow

Evaluating Dataflow, and am trying to figure out if/how to do the following.
My apologies if anything in the above is trivial--trying to wrap our heads around Dataflow before we make a decision on using Beam, or something else like Spark, etc.
General use case is for machine learning:
Ingesting documents which are individually processed.
In addition to easy-to-write transforms, we'd like to enrich each document based on queries against databases (that are largely key-value stores).
A simple example would be a gazetteer: decompose the text into ngrams, and then check if those ngrams reside in some database, and record (within a transformed version of the original doc) the entity identifier given phrases map to.
How to do this efficiently?
NAIVE (although possibly tricky with the serialization requirement?):
Each document could simply query the database individually (similar to Querying a relational database through Google DataFlow Transformer), but, given that most of these are simple key-value stores, it seems like there should be a more efficient way to do this (given the real problems with database query latency).
SCENARIO #1: Improved?:
Current strawman is to store the tables in Bigquery, pull them down (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py), and then use them as side inputs, that are used as key-value lookups within the per-doc function(s).
Key-value tables range from generally very small to not-huge (100s of MBs, maybe low GBs). Multiple CoGroupByKey with same key apache beam ("Side inputs can be arbitrarily large - there is no limit; we have seen pipelines successfully run using side inputs of 1+TB in size") suggests this is reasonable, at least from a size POV.
1) Does this make sense? Is this the "correct" design pattern for this scenario?
2) If this is a good design pattern...how do I actually implement this?
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L53 shows feeding the result to the document function as an AsList.
i) Presumably, AsDict is more appropriate here, for the above use case? So I'd probably need to run some transformations first on the Bigquery output to separate it into key, value tuple; and make sure that the keys are unique; and then use it as a side input.
ii) Then I need to use the side input in the function.
What I'm not clear on:
for both of these, how to manipulate the output coming off of the Bigquery pull is murky to me. How would I accomplish (i) (assuming it is necessary)? Meaning, what does the data format look like (raw bytes? strings? is there a good example I can look into?)
Similarly, if AsDict is the correct way to pass it into the func, can I just reference things like a dict normally is used in python? e.g., side_input.get('blah') ?
SCENARIO #2: Even more improved? (for specific cases):
The above scenario--if achievable--definitely does seem like it is superior continuous remote calls (given the simple key-value lookup), and would be very helpful for some of our scenarios. But if I take a scenario like a gazetteer lookup (like above)...is there an even more optimized solution?
Something like, for every doc, writing our all the ngrams as keys, with values as the underlying indices (docid+indices within the doc), and then doing some sort of join between these ngrams and the phrases in our gazeteer...and then doing another set of transforms to recover the original docs (now w/ their new annotations).
I.e., let Beam handle all of the joins/lookups directly?
Theoretical advantage is that Beam may be a lot quicker in doing this than, for each doc, looping over all of the ngrams and doing a check if the ngram is in the side_input.
Other key issues:
3) If this is a good way to do things, is there any trick to making this work well in the streaming scenario? Text elsewhere suggests that the side input caching works more poorly outside the batch scenario. Right now, we're focused on batch, but streaming will become relevant in serving live predictions.
4) Any Beam-related reason to prefer Java>Python for any of the above? We've got a good amount of existing Python code to move to Dataflow, so would heavily prefer Python...but not sure if there are any hidden issues with Python in the above (e.g., I've noticed Python doesn't support certain features or I/O).
EDIT: Strawman? for the example ngram lookup scenario (should generalize strongly to general K:V lookup)
Phrases = get from bigquery
Docs (indexed by docid) (direct input from text or protobufs, e.g.)
Transform: phrases -> (phrase, entity) tuples
Transform: docs -> ngrams (phrase, docid, coordinates [in document])
CoGroupByKey key=phrase: (phrase, entity, docid, coords)
CoGroupByKey key=docid, group((phrase, entity, docid, coords), Docs)
Then we can iteratively finalize each doc, using the set of (phrase, entity, docid, coords) and each Doc

Regarding the scenarios for your pipeline:
Naive scenario
You are right that per-element querying of a database is undesirable.
If your key-value store is able to support low-latency lookups by reusing an open connection, you can define a global connection that is initialized once per worker instead of once per bundle. This should be acceptable your k-v store supports efficient lookups over existing connections.
Improved scenario
If that's not feasible, then BQ is a great way to keep and pull in your data.
You can definitely use AsDict side inputs, and simply go side_input[my_key] or side_input.get(my_key).
Your pipeline could look something like so:
kv_query = "SELECT key, value FROM my:table.name"
p = beam.Pipeline()
documents_pcoll = p | ReadDocuments()
additional_data_pcoll = (p
| beam.io.BigQuerySource(query=kv_query)
# Make row a key-value tuple.
| 'format bq' >> beam.Map(lambda row: (row['key'], row['value'])))
enriched_docs = (documents_pcoll
| 'join' >> beam.Map(lambda doc, query: enrich_doc(doc, query[doc['key']]),
query=AsDict(additional_data_pcoll)))
Unfortunately, this has one shortcoming, and that's the fact that Python does not currently support arbitrarily large side inputs (it currently loads all of the K-V into a single Python dictionary). If your side-input data is large, then you'll want to avoid this option.
Note This will change in the future, but we can't be sure ATM.
Further Improved
Another way of joining two datasets is to use CoGroupByKey. The loading of documents, and of K-V additional data should not change, but when joining, you'd do something like so:
# Turn the documents into key-value tuples as well[
documents_kv_pcoll = (documents_pcoll
| 'format docs' >> beam.Map(lambda doc: (doc['key'], doc)))
enriched_docs = ({'docs': documents_kv_pcoll, 'additional_data': additional_data_pcoll}
| beam.CoGroupByKey()
| 'enrich' >> beam.Map(lambda x: enrich_doc(x['docs'][0], x['additional_data'][0]))
CoGroupByKey will allow you to use arbitrarily large collections on either side.
Answering your questions
You can see an example of using BigQuery as a side input in the cookbook. As you can see there, the data comes parsed (I believe that it comes in their original data types, but it may come in string/unicode). Check the docs (or feel free to ask) if you need to know more.
Currently, Python streaming is in alpha, and it does not support side inputs; but it does support shuffle features such as CoGroupByKey. Your pipeline using CoGroupByKey should work well in streaming.
A reason to prefer Java over Python is that all these features work in Java (unlimited-size side inputs, streaming side inputs). But it seems that for your use case, Python may have all you need.
Note: The code snippets are approximate, but you should be able to debug them using the DirectRunner.
Feel free to ask for clarification, or to ask about other aspects if you feel like it'd help.

Related

TensorFlowFederated: Passing tensor to tff.federated_computation

I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

How to access "key" in combine.perKey in beam

In How to create custom Combine.PerKey in beam sdk 2.0, I asked and got a correct answer on how to create a custom Combine.PerKey in the new beam sdk 2.0. However, I now need to create a custom combinePerKey such that within my custom CombinePerKey logic, I need to be able to access the contents of the key. This was easily possible in dataflow 1.x, but in the new beam sdk 2.0, I'm unsure how to do so. Any little code snippet/example would be extremely useful.
EDIT #1 (per Ben Chambers's request)
The real use case is hard to explain, but I'm going to try:
We have a 3d space composed of millions of little hills. We try to determine the apex of these millions of hills as follows: we create billions of "rectangular probes" for the whole 3d space, and then we ask each of these billions of probes to "move" in a greedy way to the apex. Once it hits the apex, it stops. The probe then returns the apex and itself. The apex is the KEY for which we'll do a custom combine by key.
Now, the custom combine function is going to finally return a final object (called a feature) which is derived from all the probes that reach the same apex (ie the same key). When generating this "feature" object, we need to know infomration about the final apex/key (ie the top of the hill). Hence, I need this key info.
One way to solve this is using a group by key, but that was slow (at least in df 1.x); we got it to be fast (in df 1.x) using a custom combine fn. So, we'd like the key. That said, groupbykey works in beam skd 2.0.
Alternatively, we could stick the "apex" information into the "probe" objects itself, but this means that each of our billions of probe objects now needs to be tripled in size just to hold this apex information (and this apex information repeats itself, since there are only say 1 million apexes but 1 billion probes, so this intuitively feels highly inefficient.)
Rather than relying on the CombineFn to compute the entire result, could you instead have the ComibneFn compute some partial result based only on information about the probes? Then your Combine.perKey(...) returns a PCollection<KV<Apex, InfoAboutProbes>> and you can use a ParDo to combine the information about the apex with the summary information about the probes. This allows you to use the CombineFn for efficiently combining information about many probes, while using a ParDo to access the key.

Best Practice ETL with Dataflow and Lookup

What's the best practice to implement a standard streaming ETL process which writes fact and some smaller dimensional tables to BigQuery?
I'm trying to understand how to handle the following things:
How to do a simple dimension lookup in a streaming pipeline?
In case the answer is sideInput - how to handle lookups for values that don't exist yet in the dimension? How to update the sideInput?
When side inputs receive late data on a specific window, they will be recomputed. If you do the lookup after this, then you'll be able to see the element in the side input.
Currently, the Beam model does not include semantics for re-triggering of the ParDo that consumes the side input, so you'd need to somehow make sure to (re)do de lookup after the side input has been computed.

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?
I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.
This is how a BTDigg-like site works:
You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
index the information you find in there, map it to the info-hash
Provide a front-end for that index
If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.
So, the claim that you don't store .torrent files nor any content is true.
It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".
EDIT:
Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.
For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:
PDF of his paper:
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
His presentation at DEFCON 18 (parts 1 & 2)
http://www.youtube.com/watch?v=v4Q_F4XmNEc
http://www.youtube.com/watch?v=mO3DfLtKPGs
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
The method used in Section 3 seems to suggest a database to store all the torrent data is required. While performance is better, it may not be a true DHT search engine.
Section 8, while less efficient, seems to be a DHT search engine as long as the keywords are the store values.
From Section 3, Bootstrapping Bittorent Search:
"The system handles user queries by treating the
concatenation of each torrent's filenames and description as a
document in the typical information retrieval model and using an
inverted index to match keywords to torrents. This has the advantage
of being well supported by popular open-source relational DBMSs. We
rank the search results according to the popularity of the torrent,
which we can infer from the number of peers listed in the DHT"
From Section 8, Related Work:
the usual approach to distributing search using a DHT is
with an inverted index, by storing each (keyword, list of matching
documents) pair as a key-value pair in the DHT. Joung et al. [17]
describe this approach and point out its performance problems: the
Zipf distribution of keywords among files results in very skewed load
balance, document information is replicated once for each keyword in
the document, and it is difficult to rank documents in a distributed
environment
It is divided into two steps.
To achieve bep_0005 protocol got infohash, you do not need to implement all protocol requires only now find_node (request), get_peers (response), announce_peer (response). Here's one of my open source dhtspider.
To achieve bep_0009 protocol got metainfo index it, here are my own a bittorrent search engine, every day can get unique infohash 300w +, effective metainfo 50w +.

Resources