I just implemented the generalised dice loss (multi-class version of dice loss) in keras, as described in ref :
(my targets are defined as: (batch_size, image_dim1, image_dim2, image_dim3, nb_of_classes))
def generalized_dice_loss_w(y_true, y_pred):
# Compute weights: "the contribution of each label is corrected by the inverse of its volume"
Ncl = y_pred.shape[-1]
w = np.zeros((Ncl,))
for l in range(0,Ncl): w[l] = np.sum( np.asarray(y_true[:,:,:,:,l]==1,np.int8) )
w = 1/(w**2+0.00001)
# Compute gen dice coef:
numerator = y_true*y_pred
numerator = w*K.sum(numerator,(0,1,2,3))
numerator = K.sum(numerator)
denominator = y_true+y_pred
denominator = w*K.sum(denominator,(0,1,2,3))
denominator = K.sum(denominator)
gen_dice_coef = numerator/denominator
return 1-2*gen_dice_coef
But something must be wrong. I'm working with 3D images that I have to segment for 4 classes (1 background class and 3 object classes, I have a imbalanced dataset). First odd thing: while my train loss and accuracy improve during training (and converge really fast), my validation loss/accuracy are constant trough epochs (see image). Second, when predicting on test data, only the background class is predicted: I get a constant volume.
I used the exact same data and script but with categorical cross-entropy loss and get plausible results (object classes are segmented). Which means something is wrong with my implementation. Any idea what it could be?
Plus I believe it would be usefull to the keras community to have a generalised dice loss implementation, as it seems to be used in most of recent semantic segmentation tasks (at least in the medical image community).
PS: it seems odd to me how the weights are defined; I get values around 10^-10. Anyone else has tried to implement this? I also tested my function without the weights but get same problems.
I think the problem here are your weights. Imagine you are trying to solve a multiclass segmentation problem, but in each image only a few classes are ever present. A toy example of this (and the one which led me to this problem) is to create a segmentation dataset from mnist in the following way.
x = 28x28 image and y = 28x28x11 where each pixel is classified as background if it is below a normalised grayscale value of 0.4, and otherwise is classified as the digit which is the original class of x. So if you see a picture of the number one, you will have a bunch of pixels classified as one, and the background.
Now in this dataset you will only ever have two classes present in the image. This means that, following your dice loss, 9 of the weights will be
1./(0. + eps) = large
and so for every image we are strongly penalising all 9 non-present classes. An evidently strong local minima the network wants to find in this situation is to predict everything as a background class.
We do want to penalise any incorrectly predicted classes which are not in the image, but not so strongly. So we just need to modify the weights. This is how I did it:
def gen_dice(y_true, y_pred, eps=1e-6):
"""both tensors are [b, h, w, classes] and y_pred is in logit form"""
# [b, h, w, classes]
pred_tensor = tf.nn.softmax(y_pred)
y_true_shape = tf.shape(y_true)
# [b, h*w, classes]
y_true = tf.reshape(y_true, [-1, y_true_shape[1]*y_true_shape[2], y_true_shape[3]])
y_pred = tf.reshape(pred_tensor, [-1, y_true_shape[1]*y_true_shape[2], y_true_shape[3]])
# [b, classes]
# count how many of each class are present in
# each image, if there are zero, then assign
# them a fixed weight of eps
counts = tf.reduce_sum(y_true, axis=1)
weights = 1. / (counts ** 2)
weights = tf.where(tf.math.is_finite(weights), weights, eps)
multed = tf.reduce_sum(y_true * y_pred, axis=1)
summed = tf.reduce_sum(y_true + y_pred, axis=1)
# [b]
numerators = tf.reduce_sum(weights*multed, axis=-1)
denom = tf.reduce_sum(weights*summed, axis=-1)
dices = 1. - 2. * numerators / denom
dices = tf.where(tf.math.is_finite(dices), dices, tf.zeros_like(dices))
return tf.reduce_mean(dices)
Related
I'm currently traininig a VAE model.
The images in question are microstructure rocks images (like these).
I defind a compount loss function having the sum 2 folds:
MSE as my images are grayscale but non binary.
KLL divergence.
I was having nan values for loss function, but figured out that a way around this is to use the weighted sum of the 2 losses. I've chosen the weight the MSE by the images size (256x256), so it becomes:
MSE = MSEx256x256
and the KLL divergence by 0.1 factor.
The nan problem was solved then, but my model when predicting just predicts one value for the whole image, so if I predict an output it will be an array of 256*256 values all the same at e.g. 0.502.
Model specs:
10 layers encoder / decoder
Latent vector space of dimension 5
SGD optimizer at lr=0.001
Loss values upon training goes from a billion number to 3000 from 2nd epoch and fluctuates around it
Accuracy upon training or valiudating is below 0.001, I've read this metric is irrelavnt anyway when it comes to VAE
Here is how I sample from the latent vector specs:
sample = Lambda(get_sample_from_dist, output_shape=(latent_dim, ), name='sample')([mu, log_sigma])
def get_sample_from_dist(args):
mean_vec, std_dev_vec = args
eta_vec = K.random_normal(shape=(K.shape(mean_vec)[0], K.int_shape(mean_vec)[1]), mean=0, stddev=1)
return mean_vec + K.exp(std_dev_vec) * eta_vec
and here is how the encoder generate mu and log_sigma:
x is the output of the last encoder layer
mu = Dense(latent_dim, name='latent_mu')(x)
log_sigma = Dense(latent_dim, name='latent_sigma')(x)
and here is my loss
def vae_loss_func(inputs, outputs, mu, log_sigma):
x1 = K.flatten(inputs)
x2 = K.flatten(outputs)
reconstruction_loss = losses.mse(x1, x2)*256**2
kl_loss = -0.5* 0.1*K.sum(1 + log_sigma - K.square(mu) - K.square(K.exp(log_sigma)), axis=-1)
vae_loss = K.mean(reconstruction_loss + kl_loss)
return vae_loss
Any thoughts where things are going wrong?
I tried different weighing factors in the loss function and using strides and dropouts layers, none of these worked. I'm expecting the generated image to be varying in pixel value and evenatually capturing the rock structure.
It makes intuitive sense to me that the label's dimension should be the same as the neural network's last layer's dimension. However, with some experiments using PyTorch, it turns out that it somehow works.
Code:
import torch
import torch.nn as nn
X = torch.tensor([[1],[2],[3],[4]], dtype=torch.float32) # training input
Y = torch.tensor([[2],[4],[6],[8]], dtype=torch.float32) # training label
model = nn.Linear(1,3)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
for epoch in range(10):
y_pred model(X)
loss = nn.MSELoss(Y, y_pred)
loss.backward()
optimizer.zero_grad()
optimizer.step()
In the above code, model = nn.Linear(1,3) is used instead of model = nn.Linear(1,1). As a result, while Y.shape is (4,1), y_pred.shape is (4,3).
The code works with a warning saying that "Using a target size that is different to the input size will likely lead to incorrect results due to broadcasting. "
I got the following output when I executed model(torch.tensor([10], dtype=torch.float32)):
tensor([20.0089, 19.6121, 19.1967], grad_fn=<AddBackward0>)
All three outputs seems correct to me. But how is the loss calculated if the sizes of the data are different?
Should we in any case use a target size that is different to the input size? Is there a benefit for this?
Assuming you are working with batch_size=4, you are using a target with 1 component vs 3 for your predicted tensor. You don't actually see the intermediate results when computing the loss with nn.MSELoss, using the reduction='none' option will allow you to do so:
>>> criterion = nn.MSELoss(reduction='none')
>>> y = torch.rand(2,1)
>>> y_hat = torch.rand(2,3)
>>> criterion(y_hat, y).shape
(2, 3)
Considering this, you can conclude that the target y, being too small, has been broadcasted to the predicted tensor y_hat. Essentially, in your example, you will get the same result (without the warning) as:
>>> y_repeat = y.repeat(1, 3)
>>> criterion(y_hat, y_repeat)
This means that, for each batch, you are L2-optimizing all its components against a single value: MSE(y_hat[0,0], y[0]), MSE(y_hat[0,1], y[0]), and MSE(y_hat[0,2], y[0]), same goes for y[1] and y[2].
The warning is there to make sure you're conscious of this broadcast operation. Maybe this is what you're looking to do, in this case, you should broadcast the target tensor yourself. Otherwise, it won't make sense to do so.
When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.
I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy
I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]
I tried to build a simple MLP with 2 hidden layers and 3 output classes.
What I have done in the model is:
Input images are 120x120 rgb images. Flattened size (3 * 120 * 120)
2 hidden layers of size 100.
Relu activation is used
Output layer has 3 neurons
Code
def model(input, weights, biases):
l_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
l_1 = tf.nn.relu(l_1)
l_2 = tf.add(tf.matmul(l_1, weights['h2']), biases['b2'])
l_2 = tf.nn.relu(l_2)
out = tf.matmul(l_2, weights['out']) + biases['out']
return out
Optimizer
pred = model(input_batch, weights, biases)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
optimizer = tf.train.GradientDescentOptimizer(rate).minimize(cost)
The model however does not work. The accuracy is only equal to that of a random model.
The example followed is this one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/multilayer_perceptron.py
You have a copy-paste typo in def model. First argument name is input while it is x on the next line.
Another trick to use when you suspect that model is not being trained is to run it on the same batch again and again. If implementation is correct and model is being trained it will soon learn that batch by heart yielding 100% accuracy. If it does not then it is an indicator that something is wrong in your implementation.