When are placeholders necessary? - machine-learning

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?

According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.

A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.

I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Related

How to change clipping and noise parameters during differentially private training with Tensorflow Federated

I'm using Tensorflow Federated (TFF) to train with differential privacy. Currently I am creating a Tensorflow Privacy NormalizedQuery and then passing it into a TFF DifferentiallyPrivateFactory to create an AggregationProcess:
_weights_type = tff.learning.framework.weights_type_from_model(placeholder_model)
query = tensorflow_privacy.GaussianSumQuery(l2_norm_clip=10.0, stddev=0.1)
query = tensorflow_privacy.NormalizedQuery(query, 20)
agg_proc = tff.aggregators.DifferentiallyPrivateFactory(query)
agg_proc = agg_proc.create(_weights_type.trainable)
After broadcasting the server state to clients I run a client update function and then use the AggregationProcess like this:
agg_output = agg_proc.next(
server_state.delta_aggregate_state,
client_outputs.weights_delta)
This works great, however I want to experiment with changing the l2_norm_clip and stddev several times during training (making clipping bigger and smaller at various training rounds) but it seems I can only set these parameters when I create the AggregationProcess.
Is is possible to change these parameters during training somehow?
I can think of two ways to do what you want: the easy way and the right way.
The right way is to make a new type of DPQuery that keeps track of the training round in its global state and adjusts the clip and stddev the way you want in its get_noised_result function. Then you can pass this new DPQuery to tff.aggregators.DifferentiallyPrivateFactory and use it like normal.
The easy way is to directly hack into the server_state.delta_aggregate_state. Somewhere in there you should find the global state of the DPQuery which should contain the l2_norm_clip and stddev which you can manipulate directly between rounds. This approach may be brittle because the aggregator state and the DPQuery state representations may be subject to change.

TensorFlowFederated: Passing tensor to tff.federated_computation

I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes

How does one dynamically add new parameters to optimizers in Pytorch?

I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.
So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).
So what I want to make sure is:
embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
no issues with stored info of past parameters e.g. if its using some sort of momentum
How does one do this? Any sample code that works?
Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"
You can append params at any time to the optimizer:
import torch
import torch.optim as optim
model = torch.nn.Linear(2, 2)
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })
#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)
That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.
Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:
# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)
However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.
Additionally, this first step already raises some questions:
How does my respective nn.Module know about the new embedding parameter?
The __setattr__ method of nn.Module takes care of that (see here)
Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...
Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.
And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.
Anyways, if you don't care, a simple
# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)
will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:
# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
old_state_dict['param_groups'][xyz]['params'].append(pid)
old_state_dict['state'][pid] = { ... } # your new state def here, depending on your optimizer
However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)
So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.
If you want to customize initial params:
from itertools import chain
l1 = nn.Linear(3,3)
l2 = nn.Linear(2,3)
optimizer = optim.SGD(chain(l1.parameters(), l2.parameters()), lr=0.01, momentum=0.9)
The key is that the first param of constructor receives iterator.

Beam/Dataflow design pattern to enrich documents based on database queries

Evaluating Dataflow, and am trying to figure out if/how to do the following.
My apologies if anything in the above is trivial--trying to wrap our heads around Dataflow before we make a decision on using Beam, or something else like Spark, etc.
General use case is for machine learning:
Ingesting documents which are individually processed.
In addition to easy-to-write transforms, we'd like to enrich each document based on queries against databases (that are largely key-value stores).
A simple example would be a gazetteer: decompose the text into ngrams, and then check if those ngrams reside in some database, and record (within a transformed version of the original doc) the entity identifier given phrases map to.
How to do this efficiently?
NAIVE (although possibly tricky with the serialization requirement?):
Each document could simply query the database individually (similar to Querying a relational database through Google DataFlow Transformer), but, given that most of these are simple key-value stores, it seems like there should be a more efficient way to do this (given the real problems with database query latency).
SCENARIO #1: Improved?:
Current strawman is to store the tables in Bigquery, pull them down (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py), and then use them as side inputs, that are used as key-value lookups within the per-doc function(s).
Key-value tables range from generally very small to not-huge (100s of MBs, maybe low GBs). Multiple CoGroupByKey with same key apache beam ("Side inputs can be arbitrarily large - there is no limit; we have seen pipelines successfully run using side inputs of 1+TB in size") suggests this is reasonable, at least from a size POV.
1) Does this make sense? Is this the "correct" design pattern for this scenario?
2) If this is a good design pattern...how do I actually implement this?
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L53 shows feeding the result to the document function as an AsList.
i) Presumably, AsDict is more appropriate here, for the above use case? So I'd probably need to run some transformations first on the Bigquery output to separate it into key, value tuple; and make sure that the keys are unique; and then use it as a side input.
ii) Then I need to use the side input in the function.
What I'm not clear on:
for both of these, how to manipulate the output coming off of the Bigquery pull is murky to me. How would I accomplish (i) (assuming it is necessary)? Meaning, what does the data format look like (raw bytes? strings? is there a good example I can look into?)
Similarly, if AsDict is the correct way to pass it into the func, can I just reference things like a dict normally is used in python? e.g., side_input.get('blah') ?
SCENARIO #2: Even more improved? (for specific cases):
The above scenario--if achievable--definitely does seem like it is superior continuous remote calls (given the simple key-value lookup), and would be very helpful for some of our scenarios. But if I take a scenario like a gazetteer lookup (like above)...is there an even more optimized solution?
Something like, for every doc, writing our all the ngrams as keys, with values as the underlying indices (docid+indices within the doc), and then doing some sort of join between these ngrams and the phrases in our gazeteer...and then doing another set of transforms to recover the original docs (now w/ their new annotations).
I.e., let Beam handle all of the joins/lookups directly?
Theoretical advantage is that Beam may be a lot quicker in doing this than, for each doc, looping over all of the ngrams and doing a check if the ngram is in the side_input.
Other key issues:
3) If this is a good way to do things, is there any trick to making this work well in the streaming scenario? Text elsewhere suggests that the side input caching works more poorly outside the batch scenario. Right now, we're focused on batch, but streaming will become relevant in serving live predictions.
4) Any Beam-related reason to prefer Java>Python for any of the above? We've got a good amount of existing Python code to move to Dataflow, so would heavily prefer Python...but not sure if there are any hidden issues with Python in the above (e.g., I've noticed Python doesn't support certain features or I/O).
EDIT: Strawman? for the example ngram lookup scenario (should generalize strongly to general K:V lookup)
Phrases = get from bigquery
Docs (indexed by docid) (direct input from text or protobufs, e.g.)
Transform: phrases -> (phrase, entity) tuples
Transform: docs -> ngrams (phrase, docid, coordinates [in document])
CoGroupByKey key=phrase: (phrase, entity, docid, coords)
CoGroupByKey key=docid, group((phrase, entity, docid, coords), Docs)
Then we can iteratively finalize each doc, using the set of (phrase, entity, docid, coords) and each Doc
Regarding the scenarios for your pipeline:
Naive scenario
You are right that per-element querying of a database is undesirable.
If your key-value store is able to support low-latency lookups by reusing an open connection, you can define a global connection that is initialized once per worker instead of once per bundle. This should be acceptable your k-v store supports efficient lookups over existing connections.
Improved scenario
If that's not feasible, then BQ is a great way to keep and pull in your data.
You can definitely use AsDict side inputs, and simply go side_input[my_key] or side_input.get(my_key).
Your pipeline could look something like so:
kv_query = "SELECT key, value FROM my:table.name"
p = beam.Pipeline()
documents_pcoll = p | ReadDocuments()
additional_data_pcoll = (p
| beam.io.BigQuerySource(query=kv_query)
# Make row a key-value tuple.
| 'format bq' >> beam.Map(lambda row: (row['key'], row['value'])))
enriched_docs = (documents_pcoll
| 'join' >> beam.Map(lambda doc, query: enrich_doc(doc, query[doc['key']]),
query=AsDict(additional_data_pcoll)))
Unfortunately, this has one shortcoming, and that's the fact that Python does not currently support arbitrarily large side inputs (it currently loads all of the K-V into a single Python dictionary). If your side-input data is large, then you'll want to avoid this option.
Note This will change in the future, but we can't be sure ATM.
Further Improved
Another way of joining two datasets is to use CoGroupByKey. The loading of documents, and of K-V additional data should not change, but when joining, you'd do something like so:
# Turn the documents into key-value tuples as well[
documents_kv_pcoll = (documents_pcoll
| 'format docs' >> beam.Map(lambda doc: (doc['key'], doc)))
enriched_docs = ({'docs': documents_kv_pcoll, 'additional_data': additional_data_pcoll}
| beam.CoGroupByKey()
| 'enrich' >> beam.Map(lambda x: enrich_doc(x['docs'][0], x['additional_data'][0]))
CoGroupByKey will allow you to use arbitrarily large collections on either side.
Answering your questions
You can see an example of using BigQuery as a side input in the cookbook. As you can see there, the data comes parsed (I believe that it comes in their original data types, but it may come in string/unicode). Check the docs (or feel free to ask) if you need to know more.
Currently, Python streaming is in alpha, and it does not support side inputs; but it does support shuffle features such as CoGroupByKey. Your pipeline using CoGroupByKey should work well in streaming.
A reason to prefer Java over Python is that all these features work in Java (unlimited-size side inputs, streaming side inputs). But it seems that for your use case, Python may have all you need.
Note: The code snippets are approximate, but you should be able to debug them using the DirectRunner.
Feel free to ask for clarification, or to ask about other aspects if you feel like it'd help.

Why do we need to explicitly update the moving_mean and moving_variance in TensorFlow's Batch normalization in tf.contrib.layers.batch_norm?

To Long To Read: How can I use Batch Normalization with tf.contrib.layers.batch_norm without having to explicitly tell session to update the moving_statistics (moving_mean and moving_variance) or not?
A few months ago I provided an answer to How could I use Batch Normalization in TensorFlow? and noticed a few weird details that I wanted to address. First it seems that the implementation that I provide seems repetitive with respect to the is_training variable. Recall my suggested code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=True,
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=False,
reuse=True, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
in it I have a train_phase variable that just holds a tf boolean tf.placeholder(tf.bool, name='phase_train'). As you can see, it is used to decide if the batch norm layer should be in inference mode or not. However, the variable seemed a little redundant, since it seems I have two variables that specify the same thing twice. i.e. once in train_phase and another in is_training. Is that really necessary?
I thought about it a bit and it seems I might to be able to remove the hard coded (is_training=True/False) with the (pseudo)code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=get_bool(train_phase),
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn, lambda: bn)
return z
which seems to make the train_phase variable completely redundant/silly. This actually highlights my most important point, is the train_phase variable and tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) even necessary? Which actually brings up my biggest complaint about the code (though I think this code might not even run because when defining the graph the placeholder train_phase might not even have a value but you get the idea).
Honestly I find having to even explicitly define train_phase very dangerous because it seems very unnecessary for users to have to handle the inference/training mode of Batch Norm this explicitly. Though, "normal" users of Batch Norm should always update the moving_mean,moving_variance with the train data and any standard user of Batch Norm should not be updating moving_mean,moving_variance with test statistics at any time. Since the user is required to do:
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
it can bring cause really bad bugs for users that shouldn't even exist in the first place (at least in my opinion). Furthermore, it seems weird to have to explicitly say what the phase_train is because whenever one trains, one uses an optimizer, so it should be incredibly clear when that code is called that it should be true. Maybe this is a terrible idea but it feels like the optimizer or the session should be setting that to true automatically rather than relying on the user to do it right.
I understand that sometimes users are allowed more flexibility to be more creative but I can't really appreciate how this (even for a researcher) be a good feature. Maybe I am just using the library incorrectly or being paranoic, but should the user really be forced to be so explicit when using batch norm? Is there some way around this?
As a side point, having the phase_train be part of the model also makes the code be a bit more ugly and confusing than it feels necessary because it seems to me that its unavoidable to have a line of code where the session is being used to check if the batch norm flag is on or not. The code I am trying to avoid writing is the logic:
if batch_norm:
# during training
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
else:
# with no batch norm
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
it just feels totally unnecessary. It feels the during training the model should know if it should be updating the variables or not.
As quick (really ugly) solution to the last problem with the if condition in the session, one can always define phase_train as part of the model (or at least as part of the graph) and accordingly set it equal to true and/or false when appropriate but when one doesn't actually use the batch norm layer, one actually does not use the phase_train placeholder in the model even if we set it have a value in the session.run. i.e. the sessions sets it to true or false, but when BN is not being used, it doesn't even matter what one sets it equal to since its not actually being used. Obviously, this makes the code really confusing (since one is defining some variable one doesn't even need), but I can't seem to find a way to hide the phase_train variable. For the moment this is what I am going for because it seems really ugly to have to split (or duplicate) my code between lines that have:
sess.run(fetches=..., feed_dict={...,phase_train=False})
and the ones that don't have it all:
sess.run(fetches=..., feed_dict={...})
Ideally I want the second solution and have batch norm work regardless if I use the silly phase_train variable.
I don't really have a complete answer to your question, but I have a few observations:
The standard practice seems to be to build slightly different graphs for training and for inference, each built with or without is_training enabled.
The batch_norm layer is designed so that you can use an arg_scope to set is_training=True for all layers in your model. For example, take a look at how the Inceptionv3 model is defined here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3.py#L571 . This at least makes it much more convenient to set is_training once in your Python code that builds a model and to have it apply everywhere.
Tensorflow's underlying infrastructure doesn't distinguish between training and inference time—it's just running graphs of operators. tf.Session doesn't really know anything about Neural Networks, training, or inference, so it isn't the right place for this kind of logic.
One could imagine that an Optimizer should rewrite the graph to enable is_training for those operators that support it. I don't have a strong opinion about this; you might try filing a Tensorflow Github issue making that feature request to see what others think about it. It might seem a bit too "magical".
Hope that helps!

Resources