What is the difference between variable_ops_scope and variable_scope? - machine-learning

In TensorFlow, there are two scope functions: variable_ops_scope and variable_scope. The first one has a signature as following:
variable_op_scope(values, name_or_scope, default_name,initializer,
regularizer, caching_device, partitioner, reuse)
What does the first parameter values mean? default_name is only used when name_or_scope is None, so why this function need to take these two parameters? One parameter should be enough.
In general, what is the difference between these two scopes?

variable_ops_scope is a wrapper for variable_scope. Just like tf.variable_scope, but performs 2 more things:
Validate that values come from the same graph
If name_or_scope is None, the default_name will be used and will be uniquified if needed. Note that, if name_or_scope is not None, it will be used and but not be uniquified, and default_name will not be used.

Related

TensorFlowFederated: Passing tensor to tff.federated_computation

I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes

Defining dask worker resources for a dataframe operation

I am applying multiple operations to a dask dataframe. Can I define distributed worker resource requirements for particular operation?
e.g. I call something like:
df.fillna(value="").map_partitions(...).map(...)
I want to specify resource requirement for map_partitions() (potentially different than the ones for map()), but seems like the method does not accept resources parameter.
PS. Alternatively, I figured out that I can call client.persist() after map_partitions() and specify resources in this call, but this immediately triggers the computation.
You can specify resource constraints on particular parts of your computation when you call compute or persist by providing the intermediate collection.
x = dd.read_csv(...)
y = x.map_partitions(func)
z = y.map(func2)
z.compute(resources={tuple(y._keys()): {'GPU': 1}})
Thank you for the question, I went to include a link to the documentation about this feature and found that it was undocumented. I'll fix shortly.
It looks like today there is a bug where the intermediate keys may be optimized out in some situations (though this is less likely for dataframe operation), so you may also want to pass the optimize_graph=False keyword.
z.compute(resources={tuple(y._keys()): {'GPU': 1}}, optimize_graph=False)
See https://github.com/dask/distributed/pull/1362

Parameter sharing in network with nn.SpatialBatchNormalization

I have a network with three parallel branches, and I want to share all their parameters so that they are identical at the end of the training.
Let some_model be a standard nn.Sequential module made of cudnn.SpatialConvolution, nn.PReLU, nn.SpatialBatchNormalization. Additionally, there is a nn.SpatialDropout, but its probability is set to 0, so it has no effect.
ptb=nn.ParallelTable()
ptb:add(some_model)
ptb:add(some_model:clone('weight','bias', 'gradWeight','gradBias'))
ptb:add(some_model:clone('weight','bias', 'gradWeight','gradBias'))
triplet=nn.Sequential()
triplet:add(ptb)
I don't think the loss function is relevant, but just in case, I use nn.DistanceRatioCriterion. To check that all weights are correctly shared, I pass a table of three identical examples {A,A,A} to the network. Obviously, if the weights are correctly shared, then the output of all three branches should be the same. This holds at the moment of network initialization, but once the paramerters have been updated (say, after one mini-batch iteration), the results of the three branches become different. Through layer by layer inspection, I have noticed that this discrepancy in the output comes from the nn.SpatialBatchNormalization layers in some_model. Therefore, it seems that the parameters from those layers are not properly shared. Following this, I have tried calling clone with the additional parameters running_mean and running_std, but the ouptut of the batchnorm layers still differ. Moreover, this seems to be cancelling the sharing of all other network parameters as well. What is the proper way of sharing parameters between nn.SpatialBatchNormalization modules?
Ok, I found the solution! It seems that the parameter running_std has been changed to running_var since the discussion I had linked to in the question. Calling the constructor with
ptb:add(some_model:clone('weight','bias', 'gradWeight','gradBias','running_mean','running_var'))
Solves the problem.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Why do we need to explicitly update the moving_mean and moving_variance in TensorFlow's Batch normalization in tf.contrib.layers.batch_norm?

To Long To Read: How can I use Batch Normalization with tf.contrib.layers.batch_norm without having to explicitly tell session to update the moving_statistics (moving_mean and moving_variance) or not?
A few months ago I provided an answer to How could I use Batch Normalization in TensorFlow? and noticed a few weird details that I wanted to address. First it seems that the implementation that I provide seems repetitive with respect to the is_training variable. Recall my suggested code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=True,
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=False,
reuse=True, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
in it I have a train_phase variable that just holds a tf boolean tf.placeholder(tf.bool, name='phase_train'). As you can see, it is used to decide if the batch norm layer should be in inference mode or not. However, the variable seemed a little redundant, since it seems I have two variables that specify the same thing twice. i.e. once in train_phase and another in is_training. Is that really necessary?
I thought about it a bit and it seems I might to be able to remove the hard coded (is_training=True/False) with the (pseudo)code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=get_bool(train_phase),
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn, lambda: bn)
return z
which seems to make the train_phase variable completely redundant/silly. This actually highlights my most important point, is the train_phase variable and tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) even necessary? Which actually brings up my biggest complaint about the code (though I think this code might not even run because when defining the graph the placeholder train_phase might not even have a value but you get the idea).
Honestly I find having to even explicitly define train_phase very dangerous because it seems very unnecessary for users to have to handle the inference/training mode of Batch Norm this explicitly. Though, "normal" users of Batch Norm should always update the moving_mean,moving_variance with the train data and any standard user of Batch Norm should not be updating moving_mean,moving_variance with test statistics at any time. Since the user is required to do:
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
it can bring cause really bad bugs for users that shouldn't even exist in the first place (at least in my opinion). Furthermore, it seems weird to have to explicitly say what the phase_train is because whenever one trains, one uses an optimizer, so it should be incredibly clear when that code is called that it should be true. Maybe this is a terrible idea but it feels like the optimizer or the session should be setting that to true automatically rather than relying on the user to do it right.
I understand that sometimes users are allowed more flexibility to be more creative but I can't really appreciate how this (even for a researcher) be a good feature. Maybe I am just using the library incorrectly or being paranoic, but should the user really be forced to be so explicit when using batch norm? Is there some way around this?
As a side point, having the phase_train be part of the model also makes the code be a bit more ugly and confusing than it feels necessary because it seems to me that its unavoidable to have a line of code where the session is being used to check if the batch norm flag is on or not. The code I am trying to avoid writing is the logic:
if batch_norm:
# during training
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
else:
# with no batch norm
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
it just feels totally unnecessary. It feels the during training the model should know if it should be updating the variables or not.
As quick (really ugly) solution to the last problem with the if condition in the session, one can always define phase_train as part of the model (or at least as part of the graph) and accordingly set it equal to true and/or false when appropriate but when one doesn't actually use the batch norm layer, one actually does not use the phase_train placeholder in the model even if we set it have a value in the session.run. i.e. the sessions sets it to true or false, but when BN is not being used, it doesn't even matter what one sets it equal to since its not actually being used. Obviously, this makes the code really confusing (since one is defining some variable one doesn't even need), but I can't seem to find a way to hide the phase_train variable. For the moment this is what I am going for because it seems really ugly to have to split (or duplicate) my code between lines that have:
sess.run(fetches=..., feed_dict={...,phase_train=False})
and the ones that don't have it all:
sess.run(fetches=..., feed_dict={...})
Ideally I want the second solution and have batch norm work regardless if I use the silly phase_train variable.
I don't really have a complete answer to your question, but I have a few observations:
The standard practice seems to be to build slightly different graphs for training and for inference, each built with or without is_training enabled.
The batch_norm layer is designed so that you can use an arg_scope to set is_training=True for all layers in your model. For example, take a look at how the Inceptionv3 model is defined here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3.py#L571 . This at least makes it much more convenient to set is_training once in your Python code that builds a model and to have it apply everywhere.
Tensorflow's underlying infrastructure doesn't distinguish between training and inference time—it's just running graphs of operators. tf.Session doesn't really know anything about Neural Networks, training, or inference, so it isn't the right place for this kind of logic.
One could imagine that an Optimizer should rewrite the graph to enable is_training for those operators that support it. I don't have a strong opinion about this; you might try filing a Tensorflow Github issue making that feature request to see what others think about it. It might seem a bit too "magical".
Hope that helps!

Resources