How does one dynamically add new parameters to optimizers in Pytorch? - machine-learning

I was going through this post in the pytorch forum, and I also wanted to do this. The original post removes and adds layers but I think my situation is not that different. I also want to add layers or more filters or word embeddings. My main motivation is that the AI agent does not know the whole vocabulary/dictionary in advance because its large. I prefer strongly (for the moment) to not do character by character RNNs.
So what will happen for me is when the agent starts a forward pass it might find new words it has never seen and will need to add them to the embedding table (or perhaps add new filters before it starts the forward pass).
So what I want to make sure is:
embeddings are added correctly (at the right time, when a new computation graph is made) so that they are updatable by the optimizer
no issues with stored info of past parameters e.g. if its using some sort of momentum
How does one do this? Any sample code that works?

Just to add an answer to the title of your question: "How does one dynamically add new parameters to optimizers in Pytorch?"
You can append params at any time to the optimizer:
import torch
import torch.optim as optim
model = torch.nn.Linear(2, 2)
# Initialize optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, momentum=0.9)
extra_params = torch.randn(2, 2)
optimizer.param_groups.append({'params': extra_params })
#then you can print your `extra_params`
print("extra params", extra_params)
print("optimizer params", optimizer.param_groups)

That is a tricky question, as I would argue that the answer is "depends", in particular on how you want to deal with the optimizer.
Let's start with your specific problem - an embedding. In particular, you are asking on how to add embeddings to allow for a larger vocabulary dynamically. My first advice is, that if you have a good sense of an upper boundary of your vocabulary size, make the embedding large enough to cope with it from the beginning, as this is more efficient, and as you will need the memory eventually anyway. But this is not what you asked. So - to dynamically change your embedding, you'll need to overwrite your old one with a new one, and inform your optimizer of the change. You can simply do that whenever you run into an exception with your old embedding, in a try ... except block. This should roughly follow this idea:
# from within whichever module owns the embedding
# remember the already trained weights
old_embedding_weights = self.embedding.weight.data
# create a new embedding of the new size
self.embedding = nn.Embedding(new_vocab_size, embedding_dim)
# initialize the values for the new embedding. this does random, but you might want to use something like GloVe
new_weights = torch.randn(new_vocab_size, embedding_dim)
# as your old values may have been updated, you want to retrieve these updates values
new_weights[:old_vocab_size] = old_embedding_weights
self.embedding.weights.data.copy_(new_weights)
However, you should not do this for every single new word you receive, as this copying takes time (and a whole lot of memory, as the embedding exists twice for a short time - if you're nearly out memory, just make your embedding large enough from the start). So instead increase the size dynamically by a couple of hundred slots at a time.
Additionally, this first step already raises some questions:
How does my respective nn.Module know about the new embedding parameter?
The __setattr__ method of nn.Module takes care of that (see here)
Second, why don't I simply change my parameter? That's already pointing towards some of the problems of changing the optimizer: pytorch internally keeps references by object ID. This means that if you change your object, all these references will point towards a potentially incompatible object, as its properties have changed. So we should simply create a new parameter instead.
What about other nn.Parameters or nn.Modules that are not embeddings? These you treat the same. You basically just instantiate them, and attach them to their parent module. The __setattr__ method will take care of the rest. So you can do so completely dyncamically ...
Except, of course, the optimizer. The optimizer is the only other thing that "knows" about your parameters except for your main model-module. So you need to let the optimizer know of any change.
And this is tricky, if you want to be sophisticated about it, and very easy if you don't care about keeping the optimizer state. However, even if you want to be sophisticated about it, there is a very good reason why you probably should not do this anyways. More about that below.
Anyways, if you don't care, a simple
# simply overwrite your old optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)
will do. If you care, however, you want to transfer your old state, you can do so the same way that you can store, and later load parameters and optimizer states from disk: using the .state_dict() and .load_state_dict() methods. This, however, does only work with a twist:
# extract the state dict from your old optimizer
old_state_dict = optimizer.state_dict()
# create a new optimizer
optimizer = optim.SGD(model.parameters())
new_state_dict = optimizer.state_dict()
# the old state dict will have references to the old parameters, in state_dict['param_groups'][xyz]['params'] and in state_dict['state']
# you now need to find the parameter mismatches between the old and new statedicts
# if your optimizer has multiple param groups, you need to loop over them, too (I use xyz as a placeholder here. mostly, you'll only have 1 anyways, so just replace xyz with 0
new_pars = [p for p in new_state_dict['param_groups'][xyz]['params'] if not p in old_state_dict['param_groups'][xyz]['params']]
old_pars = [p for p in old_state_dict['param_groups'][xyz]['params'] if not p in new_state_dict['param_groups'][xyz]['params']]
# then you remove all the outdated ones from the state dict
for pid in old_pars:
old_state_dict['state'].pop(pid)
# and add a new state for each new parameter to the state:
for pid in new_pars:
old_state_dict['param_groups'][xyz]['params'].append(pid)
old_state_dict['state'][pid] = { ... } # your new state def here, depending on your optimizer
However, here's the reason why you should probably never update your optimizer like this, but should instead re-initialize from scratch, and just accept the loss of state information: When you change your computation graph, you change forward and backward computation for all parameters along your computation path (if you do not have a branching architecture, this path will be your entire graph). This more specifically means, that the input to your functions (=layer/nn.Module) will be different if you change some function (=layer/nn.Module) applied earlier, and the gradients will change if you change some function (=layer/nn.Module) applied later. That in turn invalidates the entire state of your optimizer. So if you keep your optimizer's state around, it will be a state computed for a different computation graph, and will probably end up in catastrophic behavior on part of your optimizer, if you try to apply it to a new computation graph. (I've been there ...)
So - to sum it up: I'd really recommend to try to keep it simple, and to only change a parameter as conservatively as possible, and not to touch the optimizer.

If you want to customize initial params:
from itertools import chain
l1 = nn.Linear(3,3)
l2 = nn.Linear(2,3)
optimizer = optim.SGD(chain(l1.parameters(), l2.parameters()), lr=0.01, momentum=0.9)
The key is that the first param of constructor receives iterator.

Related

Can I assume the running order of LeafSystem's CalcOutput function?

I am working on a LeafSystem like this:
class exampleLeafSystem(LeafSystem):
def __init__(self, plant):
self._plant = plant
self._plant_context = plant.CreateDefaultContext()
self.DeclareVectorIutputPort("q_v", BasicVector(6))
self.DeclareVectorOutputPort("tau", BasicVector(3), self.TauCalcOutput)
self.DeclareVectorOutputPort("xe", BasicVector(3), self.xeCalcOutput)
def TauCalcOutput(self, context, output):
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
# Do some calculation with self._plant and self._plant_context to get the output
output.SetFromVector(tau)
def xeCalcOutput(self, context, output):
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
# Do some calculation with self._plant and self._plant_context to get the output
output.SetFromVector(xe)
In the two methods TauCalcOutput and xeCalcOutput here, I need frist update the MultibodyPlant's state information and then do the calculation to compute the output. However, since I do not know the order of the two CalcOutput method being called, in order for this two method to use the newest state information, I have to write
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
in both methods, which seems a bit unnecessary. If I can assume the order of the CalcOutput methods being runned, for example, say that TauCalcOutput always runs first, then I can only have
q_v = self.get_input_port(0).Eval(context)
self._plant.SetPositionsAndVelocities(self._plant_context, q_v)
in TauCalcOutput method, without worrying that XeCalcOutput method will use MultibodyPlant's state information that is one step lagged.
So my question is that is there a specific order for the CalcOutput methods being called?
No. The contract is that output ports can be called in any order at any time, and are only called when they are evaluated (e.g. by a downstream system). The order that they will be called will depend on the other systems in the Diagram; they might not all get called during a single simulation step (e.g. if one system is consumed by a discrete time system with time step 0.1, and another by time step 0.2), or may not be called at all if they are not connected. Users can even call get_output_port().Eval() manually.
For a general approach to avoiding duplicate computation in output ports, you should store the result of the shared computation in the Context, either as state or as a "cache entry". See DeclareCacheEntry for more details.
For this workflow, specifically, perhaps the simplest solution is to check whether the positions and velocities are already set to the same value, just to avoid repeating the kinematics evaluation, for instance as you see here:
https://github.com/RobotLocomotion/drake/blob/6e6e37ffa677362245773f13c0628f0042b47414/multibody/inverse_kinematics/kinematic_constraint_utilities.cc#L47-L54

How to change clipping and noise parameters during differentially private training with Tensorflow Federated

I'm using Tensorflow Federated (TFF) to train with differential privacy. Currently I am creating a Tensorflow Privacy NormalizedQuery and then passing it into a TFF DifferentiallyPrivateFactory to create an AggregationProcess:
_weights_type = tff.learning.framework.weights_type_from_model(placeholder_model)
query = tensorflow_privacy.GaussianSumQuery(l2_norm_clip=10.0, stddev=0.1)
query = tensorflow_privacy.NormalizedQuery(query, 20)
agg_proc = tff.aggregators.DifferentiallyPrivateFactory(query)
agg_proc = agg_proc.create(_weights_type.trainable)
After broadcasting the server state to clients I run a client update function and then use the AggregationProcess like this:
agg_output = agg_proc.next(
server_state.delta_aggregate_state,
client_outputs.weights_delta)
This works great, however I want to experiment with changing the l2_norm_clip and stddev several times during training (making clipping bigger and smaller at various training rounds) but it seems I can only set these parameters when I create the AggregationProcess.
Is is possible to change these parameters during training somehow?
I can think of two ways to do what you want: the easy way and the right way.
The right way is to make a new type of DPQuery that keeps track of the training round in its global state and adjusts the clip and stddev the way you want in its get_noised_result function. Then you can pass this new DPQuery to tff.aggregators.DifferentiallyPrivateFactory and use it like normal.
The easy way is to directly hack into the server_state.delta_aggregate_state. Somewhere in there you should find the global state of the DPQuery which should contain the l2_norm_clip and stddev which you can manipulate directly between rounds. This approach may be brittle because the aggregator state and the DPQuery state representations may be subject to change.

How does Machine Learning algorithm retain learning from previous execution?

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.
Author is using following function for dividing Tran and Test split,
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128
Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.
Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?
The author is also providing a few solutions,
One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).
Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.
The author is giving an example,
You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Sachin Rastogi: I am not able to understand this solution. Could you please help?
For me, these are the answers:
The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.
This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.
As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.
Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).
The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.
I would also suggest to see here for an explanation from the author: https://github.com/ageron/handson-ml/issues/71.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

Why do we need to explicitly update the moving_mean and moving_variance in TensorFlow's Batch normalization in tf.contrib.layers.batch_norm?

To Long To Read: How can I use Batch Normalization with tf.contrib.layers.batch_norm without having to explicitly tell session to update the moving_statistics (moving_mean and moving_variance) or not?
A few months ago I provided an answer to How could I use Batch Normalization in TensorFlow? and noticed a few weird details that I wanted to address. First it seems that the implementation that I provide seems repetitive with respect to the is_training variable. Recall my suggested code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=True,
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=False,
reuse=True, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
in it I have a train_phase variable that just holds a tf boolean tf.placeholder(tf.bool, name='phase_train'). As you can see, it is used to decide if the batch norm layer should be in inference mode or not. However, the variable seemed a little redundant, since it seems I have two variables that specify the same thing twice. i.e. once in train_phase and another in is_training. Is that really necessary?
I thought about it a bit and it seems I might to be able to remove the hard coded (is_training=True/False) with the (pseudo)code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=get_bool(train_phase),
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn, lambda: bn)
return z
which seems to make the train_phase variable completely redundant/silly. This actually highlights my most important point, is the train_phase variable and tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) even necessary? Which actually brings up my biggest complaint about the code (though I think this code might not even run because when defining the graph the placeholder train_phase might not even have a value but you get the idea).
Honestly I find having to even explicitly define train_phase very dangerous because it seems very unnecessary for users to have to handle the inference/training mode of Batch Norm this explicitly. Though, "normal" users of Batch Norm should always update the moving_mean,moving_variance with the train data and any standard user of Batch Norm should not be updating moving_mean,moving_variance with test statistics at any time. Since the user is required to do:
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
it can bring cause really bad bugs for users that shouldn't even exist in the first place (at least in my opinion). Furthermore, it seems weird to have to explicitly say what the phase_train is because whenever one trains, one uses an optimizer, so it should be incredibly clear when that code is called that it should be true. Maybe this is a terrible idea but it feels like the optimizer or the session should be setting that to true automatically rather than relying on the user to do it right.
I understand that sometimes users are allowed more flexibility to be more creative but I can't really appreciate how this (even for a researcher) be a good feature. Maybe I am just using the library incorrectly or being paranoic, but should the user really be forced to be so explicit when using batch norm? Is there some way around this?
As a side point, having the phase_train be part of the model also makes the code be a bit more ugly and confusing than it feels necessary because it seems to me that its unavoidable to have a line of code where the session is being used to check if the batch norm flag is on or not. The code I am trying to avoid writing is the logic:
if batch_norm:
# during training
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
else:
# with no batch norm
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
it just feels totally unnecessary. It feels the during training the model should know if it should be updating the variables or not.
As quick (really ugly) solution to the last problem with the if condition in the session, one can always define phase_train as part of the model (or at least as part of the graph) and accordingly set it equal to true and/or false when appropriate but when one doesn't actually use the batch norm layer, one actually does not use the phase_train placeholder in the model even if we set it have a value in the session.run. i.e. the sessions sets it to true or false, but when BN is not being used, it doesn't even matter what one sets it equal to since its not actually being used. Obviously, this makes the code really confusing (since one is defining some variable one doesn't even need), but I can't seem to find a way to hide the phase_train variable. For the moment this is what I am going for because it seems really ugly to have to split (or duplicate) my code between lines that have:
sess.run(fetches=..., feed_dict={...,phase_train=False})
and the ones that don't have it all:
sess.run(fetches=..., feed_dict={...})
Ideally I want the second solution and have batch norm work regardless if I use the silly phase_train variable.
I don't really have a complete answer to your question, but I have a few observations:
The standard practice seems to be to build slightly different graphs for training and for inference, each built with or without is_training enabled.
The batch_norm layer is designed so that you can use an arg_scope to set is_training=True for all layers in your model. For example, take a look at how the Inceptionv3 model is defined here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3.py#L571 . This at least makes it much more convenient to set is_training once in your Python code that builds a model and to have it apply everywhere.
Tensorflow's underlying infrastructure doesn't distinguish between training and inference time—it's just running graphs of operators. tf.Session doesn't really know anything about Neural Networks, training, or inference, so it isn't the right place for this kind of logic.
One could imagine that an Optimizer should rewrite the graph to enable is_training for those operators that support it. I don't have a strong opinion about this; you might try filing a Tensorflow Github issue making that feature request to see what others think about it. It might seem a bit too "magical".
Hope that helps!

Resources