Variational Autoencoder cross-entropy loss (xent_loss) with 3D convolutional layers - machine-learning

I am adapting this implementation of VAE https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py that I found here https://blog.keras.io/building-autoencoders-in-keras.html
This implementation does not use convolutional layers so everything happens in 1D so to speak. My goal is to implement 3D convolutional layers within this model.
However I run into a shape mismatch at the loss function when running the batches (which are of 128 samples):
def vae_loss(self, x, x_decoded_mean):
xent_loss = original_dim * metrics.binary_crossentropy(x, x_decoded_mean)
#xent_loss.shape >> [128, 40, 20, 40, 1]
kl_loss = - 0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
#kl_loss.shape >> [128]
return K.mean(xent_loss + kl_loss) # >> error shape mismatch
Almost the same question is answered here already Keras - Variational Autoencoder Incompatible shape for a model with 1D convolutional layers, but I can't really understand how to extrapolate the answer to my case wjich has a more complex Input shape.
I have tried this solution:
xent_loss = original_dim * metrics.binary_crossentropy(K.flatten(x), K.flatten(x_decoded_mean))
But I don't know whether it is a valid solution or not from a mathematical point of view, although now the model is running.

Your approach is right but it's highly dependent on K.binary_crossentropy implementation. tensorflow and theano ones should work for you (as far as I know). To make it more clean and not implementation dependent I suggest you the following way:
xent_loss_vec = original_dim * metrics.binary_crossentropy(x, x_decoded_mean)
xent_loss = K.mean(xent_loss_vec, axis=[1, 2, 3, 4])
# xent_loss.shape = (128,)
Now you are taking a mean out of losses for each voxel and thanks to that every valid implementation of binary_crossentropy should work fine for you.

Related

Neural Net - trying to predict that 5 + 5 = 10

I'm learning about Neural Networks and I recently had this idea: trying to give a NN training data of the function $f(x) = 2x$. The question is, can the NN accurately predict that it has to double the input number to give the correct output?
This is just a "mental exercise", to better my understanding of how NNs work.
My Python code doesn't work, here's what I've tried:
Neural Network class:
import numpy as np
class NeuralNetwork:
def __init__(self, inputnodes, hiddennodes, outputnodes, learningrate):
self.inodes = inputnodes
self.hnodes = hiddennodes
self.onodes = outputnodes
self.lr = learningrate
self.wih = np.random.normal(0.0, pow(self.inodes, -0.5), (self.hnodes, self.inodes))
self.who = np.random.normal(0.0, pow(self.hnodes, -0.5), (self.onodes, self.hnodes))
def train(self, inputs_list, targets_list):
inputs = np.array(inputs_list, ndmin=2).T
targets = np.array(targets_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
output_errors = targets - final_outputs
hidden_errors = np.dot(self.who.T, output_errors)
self.who += self.lr * np.dot(
(output_errors * final_outputs * (1.0 - final_outputs)),
np.transpose(hidden_outputs)
)
self.wih += self.lr * np.dot(
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
np.transpose(inputs)
)
def query(self, inputs_list):
inputs = np.array(inputs_list, ndmin=2).T
hidden_outputs = np.dot(self.wih, inputs)
final_outputs = np.dot(self.who, hidden_outputs)
return final_outputs
Training the network and predicting a value:
input_nodes = 1
hidden_nodes = 20
output_nodes = 1
learning_rate = 0.3
nn = NeuralNetwork(input_nodes, hidden_nodes, output_nodes, learning_rate)
for i in range(10):
i += 1
inputs = np.log(i)
targets = np.log(2*i)
nn.train(inputs, targets)
print(nn.query(np.asfarray([4])))
Here's the output I'm getting trying to run this code:
x.py:26: RuntimeWarning: overflow encountered in multiply
(output_errors * final_outputs * (1.0 - final_outputs)),
x.py:31: RuntimeWarning: overflow encountered in multiply
(hidden_errors * hidden_outputs * (1.0 - hidden_outputs)),
[[nan]]
I don't really know how to interpret this, and if my design is correct for this application. Any help would be appreciated.
Thanks.
Some suggestions:
Since the function of interest (f(x)=2x) is linear and requires only one weight, we can vastly simplify the network by having 1 weight and 0 hidden layers. We're trying to debug a problem, so we should simplify as much as possible to eliminate sources of error. Using a hidden layer with multiple hidden nodes implies that we need to find matrices such that W1.dot(W2)=2 because we seek the function x.dot(W1).dot(W2), which is harder because changing 1 weight changes the entire product; finding the correct answer requires aligning all of those weights.
Because the function of interest is linear, we know that any use of nonlinear functions is a distraction. Also, Saturation of sigmoid and tanh functions, or the dying ReLU phenomenon, could introduce additional problems to the optimization dynamics which could prevent us from making progress. See: https://stats.stackexchange.com/questions/301285/what-is-vanishing-gradient
The learning rate is probably too large. I believe this is the problem because you're having numerical overflow; this can happen when the optimizer consistently overshoots the minimum. See: https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive
Scaling the inputs and the targets of a regression problem can dramatically improve the optimizer dynamics. For an example, see https://stats.stackexchange.com/questions/432707/alternating-negative-and-positive-value-of-slope-and-y-intercept-in-gradient-des/432714#432714
Additional tips for training neural networks are here: https://stats.stackexchange.com/questions/352036/what-should-i-do-when-my-neural-network-doesnt-learn/352037#352037
I think you are missing a very important part / building block in artificial neural networks architecture , this block is called the activation function , which tries to normalize output between [0,1] or [-1,1]
so i think attaching (which is very important) an activation function after computing every hidden layer outputs may solve this problem , as data propagating network will maintain normalized values for example between [0,1] so overflow may will not happen
notes
sigmoid activation and tanh are most popular and suitable for you problem
your learning rate maybe slightly large , try use 0.01

Weight initialization in neural networks

Hi I am developing a neural network model using keras.
code
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'he_normal', activation = 'relu', input_dim = 7))
# Adding the second hidden layer
regressor.add(Dense(units = 2, kernel_initializer = 'he_normal', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'he_normal'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
I have been reading about which kernel_initializer to use and came across the link- https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404
it talks about glorot and he initializations. I have tried with different intilizations for weights, but all of them give the same results. I want to understand how important is it do a proper initialization?
Thanks
I'll give you an explanation of how much weights initialisation is important.
Let's suppose our NN has an input layer with 1000 neurons, and suppose we start to initialise weights as they are normal distributed with mean 0 and variance 1 ().
At the second layer, we assume that only 500 first layer's neurons are activated, while the other 500 not.
The neuron's input of the second layer z will be the sum of :
so, it will be even normal distributed but with variance .
This means its value will be |z| >> 1 or |z| << 1, so neurons will saturate. The network will learn slowly at all.
A solution is to initialise weights as where is the number of the inputs of the first layer. In this way z will be and so less spreader, consequently neurons are less prone to saturate.
This trick can help as a start but in deep neural networks, due to the presence of hidden multi-layers, the weights initialisation should be done at each layer. A method may be using the batch normalization
Besides this from your code I can see you'v chosen as cost function the MSE, so it is a quadratic cost function. I don't know if your problem is a classification one, but if this is the case I suggest you to use a cross-entropy function as cost function for increasing the learning rate of your network.

Keras: Is there any workaround to split the output of an intermediate layer without using Lamda layer?

Say, I have a 10x10x4 intermediate output of a convolution layer, which I need to split into 100 1x1x4 volume and apply softmax on each to get 100 outputs from the network. Is there any way to accomplish this without using the Lambda layer? The issue with the Lambda layer in this case is this simple task of splitting takes 100 passes through the lambda layer during forward pass, which makes the network performance very slow for my practical use. Please suggest a quicker way of doing this.
Edit: I had already tried the Softmax+Reshape approach before asking the question. With that approach, I would be getting a 10x10x4 matrix reshaped to a 100x4 Tensor with use of Reshape as the output. What I really need is a multi output network with 100 different outputs. In my application, it is not possible to jointly optimize over the 10x10 matrix, but I get good results by using a network with 100 different outputs with the Lambda layer.
Here are code snippets of my approach using the Keras functional API:
With Lambda layer (slow, gives 100 Tensors of shape (None, 4) as desired):
# Assume conv_output is output from a convolutional layer with shape (None, 10, 10,4)
preds = []
for i in range(10):
for j in range(10):
y = Lambda(lambda x, i,j: x[:, i, j,:], arguments={'i': i,'j':j})(conv_output)
preds.append(Activation('softmax',name='predictions_' + str(i*10+j))(y))
model = Model(inputs=img, outputs=preds, name='model')
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy']
With Softmax+Reshape (fast, but gives Tensor of shape (None, 100, 4))
# Assume conv_output is output from a convolutional layer with shape (None, 10, 10,4)
y = Softmax(name='softmax', axis=-1)(conv_output)
preds = Reshape([100, 4])(y)
model = Model(inputs=img, outputs=preds, name='model')
model.compile(loss='categorical_crossentropy',
optimizer=Adam(),
metrics=['accuracy']
I don't think in the second case it is possible to individually optimize over each of the 100 outputs (probably one can think of it as learning the joint distribution, whereas I need to learn the marginals as in the first case). Please let me know if there is any way to accomplish what I am doing with the Lambda layer in the first code snippet in a faster way
You can use the Softmax layer and set the axis argument to the last axis (i.e. -1) to apply softmax over that axis:
from keras.layers import Softmax
soft_out = Softmax(axis=-1)(conv_out)
Note that the axis argument by default is set to -1, so you may not even need to pass that.

Cross Entropy Loss for Semantic Segmentation Keras

I'm pretty sure this is a silly question but I can't find it anywhere else so I'm going to ask it here.
I'm doing semantic image segmentation using a cnn (unet) in keras with 7 labels. So my label for each image is (7,n_rows,n_cols) using the theano backend. So across the 7 layers for each pixel, it's one-hot encoded. In this case, is the correct error function to use categorical cross-entropy? It seems that way to me but the network seems to learn better with binary cross-entropy loss. Can someone shed some light on why that would be and what the principled objective is?
Binary cross-entropy loss should be used with sigmod activation in the last layer and it severely penalizes opposite predictions. It does not take into account that the output is a one-hot coded and the sum of the predictions should be 1. But as mis-predictions are severely penalizing the model somewhat learns to classify properly.
Now to enforce the prior of one-hot code is to use softmax activation with categorical cross-entropy. This is what you should use.
Now the problem is using the softmax in your case as Keras don't support softmax on each pixel.
The easiest way to go about it is permute the dimensions to (n_rows,n_cols,7) using Permute layer and then reshape it to (n_rows*n_cols,7) using Reshape layer. Then you can added the softmax activation layer and use crossentopy loss. The data should also be reshaped accordingly.
The other way of doing so will be to implement depth-softmax :
def depth_softmax(matrix):
sigmoid = lambda x: 1 / (1 + K.exp(-x))
sigmoided_matrix = sigmoid(matrix)
softmax_matrix = sigmoided_matrix / K.sum(sigmoided_matrix, axis=0)
return softmax_matrix
and use it as a lambda layer:
model.add(Deconvolution2D(7, 1, 1, border_mode='same', output_shape=(7,n_rows,n_cols)))
model.add(Permute(2,3,1))
model.add(BatchNormalization())
model.add(Lambda(depth_softmax))
If tf image_dim_ordering is used then you can do way with the Permute layers.
For more reference check here.
I tested the solution of #indraforyou and think that the proposed method has some mistakes. As the commentsection does not allow for proper code segments, here is what I think would be the fixed version:
def depth_softmax(matrix):
from keras import backend as K
exp_matrix = K.exp(matrix)
softmax_matrix = exp_matrix / K.expand_dims(K.sum(exp_matrix, axis=-1), axis=-1)
return softmax_matrix
This method will expect the ordering of the matrix to be (height, width, channels).

LSTM vs. Hidden Layer Training in Tensorflow

I am messing around with LSTMs and have a conceptual question. I created a matrix of bogus data on the following rules:
For each 1-D list in the matrix:
If previous element is less than 10, then this next element is the previous one plus 1.
Else, this element is sin(previous element)
This way, it is a sequence that is pretty simply based on the previous information. I set up an LSTM to learn the recurrence and ran it to train on the lists one at a time. I have an LSTM layer followed by a fully connected feed-forward layer. It learns the +1 step very easily, but has trouble with the sin step. It will seemingly pick a random number between -1 and 1 when making the next element when the previous one was greater than 10. My question is this: is the training only modifying the variables in my fully connected feed forward layer? Is that why it can't learn the non-linear sin function?
Here's the code snippet in question:
lstm = rnn_cell.LSTMCell(lstmSize)
y_ = tf.placeholder(tf.float32, [None, OS])
outputs, state = rnn.rnn(lstm, x, dtype=tf.float32)
outputs = tf.transpose(outputs, [1, 0, 2])
last = tf.gather(outputs, int(outputs.get_shape()[0]) - 1)
weights = tf.Variable(tf.truncated_normal([lstmSize, OS]))
bias = tf.Variable(tf.constant(0.1, shape=[OS]))
y = tf.nn.elu(tf.matmul(last, weights) + bias)
error = tf.reduce_mean(tf.square(tf.sub(y_, y)))
train_step = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(error)
The error and shape organization seems to be correct, at least in the sense that it does learn the +1 step quickly without crashing. Shouldn't the LSTM be able to handle the non-linear sin function? It seems almost trivially easy, so my guess is that I set something up wrong and the LSTM isn't learning anything.

Resources