Is there a way to share weights across parallel streams of a torch-model?
For example, I have the following model.
mlp = nn.Sequential();
c = nn.Parallel(1,2) -- Parallel container will associate a module to each slice of dimension 1
-- (row space), and concatenate the outputs over the 2nd dimension.
for i=1,10 do -- Add 10 Linear+Reshape modules in parallel (input = 3, output = 2x1)
local t=nn.Sequential()
t:add(nn.Linear(3,2)) -- Linear module (input = 3, output = 2)
t:add(nn.Reshape(2,1)) -- Reshape 1D Tensor of size 2 to 2D Tensor of size 2x1
c:add(t)
end
mlp:add(c)
And now I want to share the weight (including everything, weights, bias, gradients), of the nn.Linear layer above across different numbers of i (so, e.g. nn.Linear(3,2)[1] with nn.Linear(3,2)[9]). What options do I have to share those?
Or is it rather recommended to use a different container/the module-approach?
You can create the module that will be repeated:
t = nn.Sequential()
t:add(nn.Linear(3,2))
t:add(nn.Reshape(2,1))
Then you can use the clone function of torch with additional parameters to share the weights (https://github.com/torch/nn/blob/master/doc/module.md#clonemlp)
mlp = nn.Sequential()
c = nn.Parallel(1,2)
for i = 1, 10 do
c:add(t:clone('weight', 'bias'))
end
mlp:add(c)
Related
hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.
I am trying to create a list based on my neural network outputs and use it in Tensorflow as a loss function.
Assume that results is list of size [1, batch_size] that is output by a neural network. I check to see whether the first value of this list is in a specific range passed in as a placeholder called valid_range, and if it is add 1 to a list. If it is not, add -1. The goal is to make all predictions of the network in the range, so the correct predictions is a tensor of all 1, which I call correct_predictions.
values_list = []
for j in range(batch_size):
a = results[0, j] >= valid_range[0]
b = result[0, j] <= valid_range[1]
c = tf.logical_and(a, b)
if (c == 1):
values_list.append(1)
else:
values_list.append(-1.)
values_list_tensor = tf.convert_to_tensor(values_list)
correct_predictions = tf.ones([batch_size, ], tf.float32)
Now, I want to use this as a loss function in my network, so that I can force all the predictions to be in the specified range. I try to train like this:
loss = tf.reduce_mean(tf.squared_difference(values_list_tensor, correct_predictions))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
gradients, variables = zip(*optimizer.compute_gradients(loss))
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip_threshold)
optimize = optimizer.apply_gradients(zip(gradients, variables))
This, however, has a problem and throws an error on the last optimize line, saying:
ValueError: No gradients provided for any variable: ['<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d4afd0>',
'<tensorflow.python.training.optimizer._RefVariableProcessor object at 0x7f0245d66050>'
...
I tried to debug this in Tensorboard, and I notice that the list I am creating does not appear in the graph, so basically the x part of the loss function is not part of the network itself. Is there some way to accurately create a list based on the predictions of a neural network and use it in the loss function in Tensorflow to train the network?
Please help, I have been stuck on this for a few days now.
Edit:
Following what was suggested in the comments, I decided to use a l2 loss function, multiplying it by the binary vector I had from before values_list_tensor. The binary vector now has values 1 and 0 instead of 1 and -1. This way when the prediction is in the range the loss is 0, else it is the normal l2 loss. As I am unable to see the values of the tensors, I am not sure if this is correct. However, I can view the final loss and it is always 0, so something is wrong here. I am unsure if the multiplication is being done correctly and if values_list_tensor is calculated accurately? Can someone help and tell me what could be wrong?
loss = tf.reduce_mean(tf.nn.l2_loss(tf.matmul(tf.transpose(tf.expand_dims(values_list_tensor, 1)), tf.expand_dims(result[0, :], 1))))
Thanks
To answer the question in the comment. One way to write a piece-wise function is using tf.cond. For example, here is a function that returns 0 in [-1, 1] and x everywhere else:
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32)
y = tf.cond(tf.logical_or(tf.greater(x, 1.0), tf.less(x, -1.0)), lambda : x, lambda : 0.0)
y.eval({x: 1.5}) # prints 1.5
y.eval({x: 0.5}) # prints 0.0
I am trying to compute a weighted output from multiple parallel models using Keras' Merge layer. I'm using Theano backend.
I have L parallel models (Ci). Each of their output layer is a k-sized softmax.
There is one model (N), its output is a L-sized softmax.
Here is what I have so far:
Parallel models (Ci) each with k dimension in the output layer:
model.add(Dense(K, activation='softmax', W_regularizer=l2(0.001),init='normal'))
The weighing model (N), output layer:
model.add(Dense(L, activation='softmax', W_regularizer=l2(0.001), init='normal'))
The merger is as follows:
model.add(Merge(layers=model_group,
mode=lambda model_group: self.merge_fun(model_group, L),
output_shape = (None, k)))
where "model_group" is a (L+1)-length list [N, C1, C2, ..., CL], and merge_fun's signature is:
def merge_fun(self, model_group, L):
Mathematically, I would like the output of the merged layer to be a weighted sum:
out = N[1]x([C11, C12, C13, .., C1k]) + N[2]x([C21, C22, C23, ..., C2k]) + ... + N[L]x([CL1, CL2, CL3, ..., CLk]),
where out is a vector of size k
How can I use the Merge layer to achieve this ?
I know that the magic would probably have to happen in the 'merge_fun', but I am not sure how to perform matrix algebra in Keras. The tensor parameters don't have a "shape" parameter - they have a keras_shape = (None, K or L) - but I am not sure how to combine parallel models' output into a matrix.
I tried using a local evaluation of the following expressions:
K.concatenate([model_group[1], model_group[2]], axis=0)*model_group[0]
and
model_group[0] * K.concatenate([model_group[1], model_group[2]], axis=0)
both of which didn't throw an error, so I can't use this as a guide. After the multiplication, the result returned did not have the keras_shape variable, so I'm not sure what the shape of the result is.
Any suggestions ?
What I advise you is to use a functional API and use this is in a following manner:
Define the L output models:
softmax_1 = Dense(K, activation='softmax', ...))(input_to_softmax_1)
softmax_2 = Dense(K, activation='softmax', ...))(input_to_softmax_2)
...
softmax_L = Dense(K, activation='softmax', ...))(input_to_softmax_L)
Define the merge softmax:
merge_softmax= Dense(L, activation='softmax', ...)(input_to_merge_softmax)
merge_softmax = Reshape((1, L))(merge_softmax)
Merge and reshape the bag of L models:
bag_of_models = merge([softmax_1, ..., softmax_L], mode = 'concat')
bag_of_models = Reshape((L, K))(bag_of_models)
Compute the final merged softmax:
final_result = merge([bag_of_models, merge_softmax], mode = 'dot', dot_axes = [1, 2])
final_result = Reshape((K, ))(final_result)
Of course - depending on your topology - different tensor might be the same (e.g. input to different softmaxes). I tested this on my machine but due to extensive refactoring - I might made mistake - so if you fin one - please inform me.
The solution with Sequential is much less clear and a little bit cumbersome - but if you want one - please write in the comment so I will update my answer.
I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}
I need to perform a custom spatial convolution in Torch. Rather than simply multiplying each input pixel by a weight for that pixel and adding them together with the filter's bias to form each output pixel, I need to do a more complex mathematical function to the input pixels before adding them together.
I know how to do this, but I do not know a GOOD way to do this. The best way I've come up with is to take the full input tensor, create a bunch of secondary tensors that are "views" of the original without allocating additional memory, putting those into a Replicate layer (the output filter count being the replication count), and feeding that into a ParallelTable layer containing a bunch of regular layers that have their parameters shared between filters.
The trouble is, even though this is fine memory-wise with a very manageable overhead, we're talking inputwidth^inputheight^inputdepth^outputdepth mini-networks, here. Maybe there's some way to create massive "long and tall" networks that work on the entire replicated input set at once, but how do I create layers that are partially-connected (like convolutions) instead of fully-connected?
I would have liked to just use inheritance to create a special copy of the regular SpatialConvolution "class" and modify it, but I can't even try because it's implemented in an external C library. I can't just use regular layers before a regular SpatialConvolution layer because I need to do my math with different weights and biases for each filter (shared between applications of the same filter to different input coordinates).
Good question. You made me give some serious thought.
Your approach has a flaw: it does not allow to take advantage of vectorized computations since each mini-network works independently.
My idea is as follows:
Suppose network's input and output are 2D tensors. We can produce (efficiently, without memory copying) an auxiliary 4D tensor
rf_input (kernel_size x kernel_size x output_h x output_w)
such that rf_input[:, :, k, l] is a 2D tensor of size kernel_size x kernel_size containing a receptive field which output[k, l] will be gotten from. Then we iterate over positions inside the kernel rf_input[i, j, :, :] getting pixels at position (i, j) inside all receptive fields and computing their contribution to each output[k, l] at once using vectorization.
Example:
Let our "convolving" function be, for example, a product of tangents of sums:
Then its partial derivative w.r.t. the input pixel at position (s,t) in its receptive field is
Derivative w.r.t. weight is the same.
At the end, of course, we must sum up gradients from different output[k,l] points. For example, each input[m, n] contributes to at most kernel_size^2 outputs as a part of their receptive fields, and each weight[i, j] contributes to all output_h x output_w outputs.
Simple implementation may look like this:
require 'nn'
local CustomConv, parent = torch.class('nn.CustomConv', 'nn.Module')
-- This module takes and produces a 2D map.
-- To work with multiple input/output feature maps and batches,
-- you have to iterate over them or further vectorize computations inside the loops.
function CustomConv:__init(ker_size)
parent.__init(self)
self.ker_size = ker_size
self.weight = torch.rand(self.ker_size, self.ker_size):add(-0.5)
self.gradWeight = torch.Tensor(self.weight:size()):zero()
end
function CustomConv:_get_recfield_input(input)
local rf_input = {}
for i = 1, self.ker_size do
rf_input[i] = {}
for j = 1, self.ker_size do
rf_input[i][j] = input[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]
end
end
return rf_input
end
function CustomConv:updateOutput(_)
local output = torch.Tensor(self.rf_input[1][1]:size())
-- Kernel-specific: our kernel is multiplicative, so we start with ones
output:fill(1)
--
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
output:cmul(ker_pt:add(w):tan())
--
end
end
return output
end
function CustomConv:updateGradInput_and_accGradParameters(_, gradOutput)
local gradInput = torch.Tensor(self.input:size()):zero()
for i = 1, self.ker_size do
for j = 1, self.ker_size do
local ker_pt = self.rf_input[i][j]:clone()
local w = self.weight[i][j]
-- Kernel-specific
local subGradInput = torch.cmul(gradOutput, torch.cdiv(self.output, ker_pt:add(w):tan():cmul(ker_pt:add(w):cos():pow(2))))
local subGradWeight = subGradInput
--
gradInput[{{i, i - self.ker_size - 1}, {j, j - self.ker_size - 1}}]:add(subGradInput)
self.gradWeight[{i, j}] = self.gradWeight[{i, j}] + torch.sum(subGradWeight)
end
end
return gradInput
end
function CustomConv:forward(input)
self.input = input
self.rf_input = self:_get_recfield_input(input)
self.output = self:updateOutput(_)
return self.output
end
function CustomConv:backward(input, gradOutput)
gradInput = self:updateGradInput_and_accGradParameters(_, gradOutput)
return gradInput
end
If you change this code a bit:
updateOutput:
output:fill(0)
[...]
output:add(ker_pt:mul(w))
updateGradInput_and_accGradParameters:
local subGradInput = torch.mul(gradOutput, w)
local subGradWeight = torch.cmul(gradOutput, ker_pt)
then it will work exactly as nn.SpatialConvolutionMM with zero bias (I've tested it).