I have a graph as follows, where the input x has two paths to reach y. They are combined with a gModule that uses cMulTable. Now if I do gModule:backward(x,y), I get a table of two values. Do they correspond to the error derivative derived from the two paths?
But since path2 contains other nn layers, I suppose I need to derive the derivates in this path in a stepwise fashion. But why did I get a table of two values for dy/dx?
To make things clearer, code to test this is as follows:
input1 = nn.Identity()()
input2 = nn.Identity()()
score = nn.CAddTable()({nn.Linear(3, 5)(input1),nn.Linear(3, 5)(input2)})
g = nn.gModule({input1, input2}, {score}) #gModule
mlp = nn.Linear(3,3) #path2 layer
x = torch.rand(3,3)
x_p = mlp:forward(x)
result = g:forward({x,x_p})
error = torch.rand(result:size())
gradient1 = g:backward(x, error) #this is a table of 2 tensors
gradient2 = g:backward(x_p, error) #this is also a table of 2 tensors
So what is wrong with my steps?
P.S, perhaps I have found out the reason because g:backward({x,x_p}, error) results in the same table. So I guess the two values stand for dy/dx and dy/dx_p respectively.
I think you simply made a mistake constructing your gModule. gradInput of every nn.Module has to have exactly the same structure as its input - that is the way backprop works.
Here's an example how to create a module like yours using nngraph:
require 'torch'
require 'nn'
require 'nngraph'
function CreateModule(input_size)
local input = nn.Identity()() -- network input
local nn_module_1 = nn.Linear(input_size, 100)(input)
local nn_module_2 = nn.Linear(100, input_size)(nn_module_1)
local output = nn.CMulTable()({input, nn_module_2})
-- pack a graph into a convenient module with standard API (:forward(), :backward())
return nn.gModule({input}, {output})
end
input = torch.rand(30)
my_module = CreateModule(input:size(1))
output = my_module:forward(input)
criterion_err = torch.rand(output:size())
gradInput = my_module:backward(input, criterion_err)
print(gradInput)
UPDATE
As I said, gradInput of every nn.Module has to have exactly the same structure as its input. So, if you define your module as nn.gModule({input1, input2}, {score}), your gradOutput (the result of the backward pass) will be a table of gradients w.r.t. input1 and input2 which in your case are x and x_p.
The only question remains: why on Earth don't you get an error when call:
gradient1 = g:backward(x, error)
gradient2 = g:backward(x_p, error)
An exception must be raised because the first argument must be not a tensor but a table of two tensors. Well, most (perhaps all) of torch modules during calculating :backward(input, gradOutput) don't use input argument (they usually store a copy of input from the last :forward(input) call). In fact, this argument is so useless that modules don't even bother themselves to verify it.
Related
I am trying to create a physics-informed neural network (PINN) in JAX. I want to differentiate the defined model (neural network) by the input (x). If I set model to jax.grad(params), I get an error.
If I set model to jax.grad(model), I don't get an error, but I don't know if I am able to differentiate the model of the neural network by x.
class MLP(fnn.Module):
#fnn.compact
def __call__(self, x):
x = fnn.Dense(128)(x)
x = fnn.relu(x)
x = fnn.Dense(256)(x)
x = fnn.relu(x)
x = fnn.Dense(10)(x)
return x
model = MLP()
params = model.init(jax.random.PRNGKey(0), jnp.ones([1]))['params']
tx = optax.adam(0.001)
state = TrainState.create(apply_fn=model.apply, params=params, tx=tx)
You can differentiate a model in JAX by (1) defining a function that you want to differentiate, (2) transforming it with jax.grad, jax.jacrev, jax.jacfwd, etc. as appropriate for your application, and (3) passing data to the transformed function.
It's not entirely clear from your question what operation you're hoping to differentiate, but here is an example of computing a forward-mode jacobian of the training state creation with respect to the params:
def f(params):
return TrainState.create(apply_fn=model.apply, params=params, tx=tx)
result = jax.jacfwd(f)(params)
If that doesn't help, I'd suggest editing your question to make clear what operation you're interested in differentiating.
So my goal is basically implementing global top-k subsampling. Gradient sparsification is quite simple and I have already done this building on stateful clients example, but now I would like to use encoders as you have recommended here at page 28. Additionally I would like to average only the non-zero gradients, so say we have 10 clients but only 4 have nonzero gradients at a given position for a communication round then I would like to divide the sum of these gradients to 4, not 10. I am hoping to achieve this by summing gradients at numerator and masks, 1s and 0s, at denominator. Also moving forward I will add randomness to gradient selection so it is imperative that I create those masks concurrently with gradient selection. The code I have right now is
import tensorflow as tf
from tensorflow_model_optimization.python.core.internal import tensor_encoding as te
#te.core.tf_style_adaptive_encoding_stage
class GrandienrSparsificationEncodingStage(te.core.AdaptiveEncodingStageInterface):
"""An example custom implementation of an `EncodingStageInterface`.
Note: This is likely not what one would want to use in practice. Rather, this
serves as an illustration of how a custom compression algorithm can be
provided to `tff`.
This encoding stage is expected to be run in an iterative manner, and
alternatively zeroes out values corresponding to odd and even indices. Given
the determinism of the non-zero indices selection, the encoded structure does
not need to be represented as a sparse vector, but only the non-zero values
are necessary. In the decode mehtod, the state (i.e., params derived from the
state) is used to reconstruct the corresponding indices.
Thus, this example encoding stage can realize representation saving of 2x.
"""
ENCODED_VALUES_KEY = 'stateful_topk_values'
INDICES_KEY = 'indices'
SHAPES_KEY = 'shapes'
ERROR_COMPENSATION_KEY = 'error_compensation'
def encode(self, x, encode_params):
shapes_list = [tf.shape(y) for y in x]
flattened = tf.nest.map_structure(lambda y: tf.reshape(y, [-1]), x)
gradients = tf.concat(flattened, axis=0)
error_compensation = encode_params[self.ERROR_COMPENSATION_KEY]
gradients_and_error_compensation = tf.math.add(gradients, error_compensation)
percentage = tf.constant(0.1, dtype=tf.float32)
k_float = tf.multiply(percentage, tf.cast(tf.size(gradients_and_error_compensation), tf.float32))
k_int = tf.cast(tf.math.round(k_float), dtype=tf.int32)
values, indices = tf.math.top_k(tf.math.abs(gradients_and_error_compensation), k = k_int, sorted = False)
indices = tf.expand_dims(indices, 1)
sparse_gradients_and_error_compensation = tf.scatter_nd(indices, values, tf.shape(gradients_and_error_compensation))
new_error_compensation = tf.math.subtract(gradients_and_error_compensation, sparse_gradients_and_error_compensation)
state_update_tensors = {self.ERROR_COMPENSATION_KEY: new_error_compensation}
encoded_x = {self.ENCODED_VALUES_KEY: values,
self.INDICES_KEY: indices,
self.SHAPES_KEY: shapes_list}
return encoded_x, state_update_tensors
def decode(self,
encoded_tensors,
decode_params,
num_summands=None,
shape=None):
del num_summands, decode_params, shape # Unused.
flat_shape = tf.math.reduce_sum([tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]])
sizes_list = [tf.math.reduce_prod(shape) for shape in encoded_tensors[self.SHAPES_KEY]]
scatter_tensor = tf.scatter_nd(
indices=encoded_tensors[self.INDICES_KEY],
updates=encoded_tensors[self.ENCODED_VALUES_KEY],
shape=[flat_shape])
nonzero_locations = tf.nest.map_structure(lambda x: tf.cast(tf.where(tf.math.greater(x, 0), 1, 0), tf.float32) , scatter_tensor)
reshaped_tensor = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(scatter_tensor, sizes_list), encoded_tensors[self.SHAPES_KEY])]
reshaped_nonzero = [tf.reshape(flat_tensor, shape=shape) for flat_tensor, shape in
zip(tf.split(nonzero_locations, sizes_list), encoded_tensors[self.SHAPES_KEY])]
return reshaped_tensor, reshaped_nonzero
def initial_state(self):
return {self.ERROR_COMPENSATION_KEY: tf.constant(0, dtype=tf.float32)}
def update_state(self, state, state_update_tensors):
return {self.ERROR_COMPENSATION_KEY: state_update_tensors[self.ERROR_COMPENSATION_KEY]}
def get_params(self, state):
encode_params = {self.ERROR_COMPENSATION_KEY: state[self.ERROR_COMPENSATION_KEY]}
decode_params = {}
return encode_params, decode_params
#property
def name(self):
return 'gradient_sparsification_encoding_stage'
#property
def compressible_tensors_keys(self):
return False
#property
def commutes_with_sum(self):
return False
#property
def decode_needs_input_shape(self):
return False
#property
def state_update_aggregation_modes(self):
return {}
I have run some simple tests manually following the steps you outlined here at page 45. It works but I have some questions/problems.
When I use list of tensors of same shape (ex:2 2x25 tensors) as input,x, of encode it works without any issues but when I try to use list of tensors of different shapes (2x20 and 6x10) it gives and error saying
InvalidArgumentError: Shapes of all inputs must match: values[0].shape = [2,20] != values1.shape = [6,10] [Op:Pack] name: packed
How can I resolve this issue? As i said I want to use global top-k so it is essential I encode entire trainable model weights at once. Take the cnn model used here, all the tensors have different shapes.
How can I do the averaging I described at the beginning? For example here you have done
mean_factory = tff.aggregators.MeanFactory(
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # numerator
tff.aggregators.EncodedSumFactory(mean_encoder_fn), # denominator )
Is there a way to repeat this with one output of decode going to numerator and other going to denominator? How can I handle dividing 0 by 0? tensorflow has divide_no_nan function, can I use it somehow or do I need to add eps to each?
How is partition handled when I use encoders? Does each client get a unique encoder holding a unique state for it? As you have discussed here at page 6 client states are used in cross-silo settings yet what happens if client ordering changes?
Here you have recommended using stateful clients example. Can you explain this a bit further? I mean in the run_one_round where exactly encoders go and how are they used/combined with client update and aggregation?
I have some additional information such as sparsity I want to pass to encode. What is the suggested method for doing that?
Here are some answers, hope it helps:
If you want to treat all of the aggregated structure just as a single tensor, use concat_factory as the outermost aggregator. That will concatenate entire structure to a rank-1 Tensor at clients, and then unpack back to the original structure at the end. Example use: tff.aggregators.concat_factory(tff.aggregators.MeanFactory(...))
Note the encoding stage objects are meant to work with a single tensor, so what you describe with identical tensors probably works only accidentally.
There are two options.
a. Modify the client training code such that the weights being passed to the weighted aggregator are already what you want it to be (zero/one
mask). In the stateful clients example you link, that would be here. You will then get what you need by default (by summing the numerator).
b. Modify UnweightedMeanFactory to do exactly the variant of averaging you describe and use that. Start would be modifying this
(and 4.) I think that is what you would need to implement. The same way existing client states are initialized in the example here, you would need extend it to contain the aggregator states, and make sure those are sampled together with the clients, as done here. Then, to integrate the aggregators in the example you would need to replace this hard-coded tff.federated_mean. An example of such integration is in the implementation of tff.learning.build_federated_averaging_process, primarily here
I am not sure what the question is. Perhaps get the previous working (seems like a prerequisite to me), and then clarify and ask in a new post?
I have trained a model and deployed it to tensorflow-serving for inference.
I am getting this error when making a request:
<Response [400]>
{'error': 'transpose expects a vector of size 5. But input(1) is a vector of size 3\n\t [[{{node bidirectional_1/transpose}} = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _class=["loc:#bidirectional_1/TensorArrayUnstack/TensorArrayScatter/TensorArrayScatterV3"], _output_shapes=[[50,?,512]], _device="/job:localhost/replica:0/task:0/device:CPU:0"](embedding_1/embedding_lookup, Attention/transpose/perm)]]'}
The notable difference between this model and the first I deployed that worked without issue is that it contains a Keras custom Layer whereas my successful attempt contained only standard Keras layers.
This is how I am testing the POST request to my tf-serving model:
with open("CNN_last_test_set.pkl", "rb") as fp:
x_arr_test, y_test = pickle.load(fp)
out = x_arr_test[:1, :]
out = out.tolist()
payload = {
"instances": [{'input': [out]}]
}
r = requests.post('http://localhost:9000/v1/models/prod_mod:predict', json=payload)
pred = json.loads(r.content.decode('utf-8'))
To create the tensorflow model object to use with tf-serving I am using this function:
def export_model_custom_layer(filename, export_path_base):
# set the mode to test time.
K.set_learning_phase(0)
model = keras.models.load_model(filename, custom_objects={"Attention": Attention})
sess = K.get_session()
# set the path to save the model and model version
export_version = 1
export_path = os.path.join(
tf.compat.as_bytes(export_path_base),
tf.compat.as_bytes(str(export_version)))
tf.saved_model.simple_save(
sess,
export_path,
inputs={'input': model.input},
outputs={t.name.split(':')[0]: t for t in model.outputs},
legacy_init_op=tf.tables_initializer())
Where I've defined my customer layer as a custom object, in order for this to work I've added this function to my customer layer:
def get_config(self):
config = {
'name': "Attention"
}
base_config = super(Attention, self).get_config()
return dict(list(base_config.items()) + list(config.items()))
When I predict with the model using the same data format as the tf-serving model is receiving using standard keras model.predict(), it works as intended:
class Attention(Layer):...
with open("CNN_last_test_set.pkl", "rb") as fp:
x_arr_test, y_test = pickle.load(fp)
model = keras.models.load_model(r"Data/modelCNN.model", custom_objects={"Attention": Attention})
out = x_arr_test[:1, :]
test1 = out.shape
out = out.tolist()
test = model.predict([out])
>> print(test)
>> [[0.21351092]]
This leads me to believe that the issue is happening either when I export the model from keras to the .pb file or in some way the model is being run in the docker container.
I am not sure what to make of this error but I'm assuming that this is related to my custom layer object considering that it worked with my previous model that only contained standard Keras layers.
Any help is greatly appreciated, thanks!
EDIT: I solved the issue, the problem was that my input data had two extra dimensions than necessary. I realized that when I removed the brackets from around the variable "out" my error changed from being 'transpose expects a vector of size 5' to 'transpose expects a vector of size 4'. So I reshaped my "out" variable from being (1, 50) to (50,) & removed the brackets and the problem resolved itself.
I'm trying to use a placeholder in my graph as a Variable (so I can later optimise something with respect to it), but I don't know the best way to do that. I've tried this:
x = tf.placeholder(tf.float32, shape = [None,1])
x_as_variable = tf.Variable(x, validate_shape = False)
but everytime I build my graph, I get an error when I try to add my loss function:
train = tf.train.AdamOptimizer().minimize(MSEloss)
The error is:
ValueError: as_list() is not defined on an unknown TensorShape.
Even if you're not familiar with the error entirely, I'd really appreciate it if you could guide me on how to build a replicate Variable which takes on the value of my placeholder.
Thanks!
As you've noticed, TensorFlow optimizers (i.e. subclasses of tf.train.Optimizer) operate on tf.Variable objects because they need to be able to assign new values to those objects, and in TensorFlow only variables support an assign operation. If you use a tf.placeholder(), there's nothing to update, because the value of a placeholder is immutable within each step.
So how do you optimize with respect to a fed-in value? I can think of two options:
Instead of feeding a tf.placeholder(), you could first assign a fed-in value to a variable and then optimize with respect to it:
var = tf.Variable(...)
set_var_placeholder = tf.placeholder(tf.float32, ...)
set_var_op = var.assign(set_var_placeholder)
# ...
train_op = tf.train.AdamOptimizer(...).minimize(mse_loss, var_list=[var, ...])
# ...
initial_val = ... # A NumPy array.
sess.run(set_var_op, feed_dict={set_var_placeholder: initial_val})
sess.run(train_op)
updated_val = sess.run(var)
You could use the lower-level tf.gradients() function to get the gradient of the loss with respect to a placeholder in a single step. You could then use that gradient in Python:
var = tf.placeholder(tf.float32, ...)
# ...
mse_loss = ...
var_grad, = tf.gradients(loss, [var])
var_grad_val = sess.run(var_grad, feed_dict={var: ...})
PS. The code in your question, where you define a tf.Variable(tf.placeholder(...), ...) is just defining a variable whose initial value is fed by the placeholder. This probably isn't what you want, because the training op that the optimizer creates will only use the value assigned to the variable, and ignore whatever you feed to the placeholder (after the initialization step).
I am working on a feed forward network in PyBrain. To allow me to compare the effects of varying certain parameters I have initialised the network weights myself. I have done this under the assumption that if the weights are always the same then the output should always be the same. Is this assumption incorrect? Below is the code used to set up the network
n = FeedForwardNetwork()
inLayer = LinearLayer(7, name="in")
hiddenLayer = SigmoidLayer(1, name="hidden")
outLayer = LinearLayer(1, name="out")
n.addInputModule(inLayer)
n.addModule(hiddenLayer)
n.addOutputModule(outLayer)
in_to_hidden = FullConnection(inLayer, hiddenLayer, name="in-to-hidden")
hidden_to_out = FullConnection(hiddenLayer, outLayer, name="hidden-to-out")
n.addConnection(in_to_hidden)
n.addConnection(hidden_to_out)
n.sortModules()
in_to_hidden_params = [
0.27160018, -0.30659429, 0.13443352, 0.4509613,
0.2539234, -0.8756649, 1.25660715
]
hidden_to_out_params = [0.89784474]
net_params = in_to_hidden_params + hidden_to_out_params
n._setParameters(net_params)
trainer = BackpropTrainer(n, ds, learningrate=0.01, momentum=0.8)
UPDATE
It looks like even by seeding the random number generator, reproducibility is still an issue. See the GitHub issue here
I have done this under the assumption that if the weights are always the same then the output should always be the same
The assumption is correct, but your code is not doing so. Your are training your weights, thus they do not end up being the same. Stochastic training methods often permute training samples, and this permutation leads to different results, in particular BackpropTrainer does so:
def train(self):
"""Train the associated module for one epoch."""
assert len(self.ds) > 0, "Dataset cannot be empty."
self.module.resetDerivatives()
errors = 0
ponderation = 0.
shuffledSequences = []
for seq in self.ds._provideSequences():
shuffledSequences.append(seq)
shuffle(shuffledSequences)
If you want repeatable results - seed your random number generators.