I want to add perceptual loss in my objective function to the MSE loss. I wrote below code for this:
def custom_objective(y_true, y_pred):
tosub = K.constant([103.939, 116.779, 123.68])
y1 = vgg_model(y_pred * 255. - tosub)
y2 = vgg_model(y_true * 255. - tosub)
loss2 = K.mean(K.square(y2 - y1), axis=-1)
loss1 = K.mean(K.square(y_pred - y_true), axis=-1)
loss = loss1 + loss2
return loss
the problem is that shape of loss1 is something like (BatchSize, 224, 224), but the shape of loss2 is (BatchSize, 7, 7), so it gives me error about incompatible shapes which is right. I want to know how could I add this two properly? should I unravel first? and how?
The loss function should always return a scalar (per sample in the batch or over the whole batch), since we want to minimize it (i.e. you can't minimize a vector, unless you define what you mean by "minimizing a vector"). Therefore, one simple way to reduce this to a scalar is to take the average across all the axes, except the batch axis which is averaged over internally:
loss2 = K.mean(K.square(y2 - y1), axis=[1,2,3])
loss1 = K.mean(K.square(y_pred - y_true), axis=[1,2,3])
loss = loss1 + loss2
Update: Let me clarify that it is OK if the loss function returns a vector or even an n-D array (actually the loss function above returns a vector of length batch_size), but keep in mind that at the end Keras takes the average of returned values and that's the real value of loss (which would be minimized).
Related
I am working on a multi-label classification problem. My gt labels are of shape 14 x 10 x 128, where 14 is the batch_size, 10 is the sequence_length, and 128 is the vector with values 1 if the item in sequence belongs to the object and 0 otherwise.
My output is also of same shape: 14 x 10 x 128. Since, my input sequence was of varying length I had to pad it to make it of fixed length 10. I'm trying to find the loss of the model as follows:
total_loss = 0.0
unpadded_seq_lengths = [3, 4, 5, 7, 9, 3, 2, 8, 5, 3, 5, 7, 7, ...] # true lengths of sequences
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for data in training_dataloader:
optimizer.zero_grad()
# shape of input 14 x 10 x 128
output = model(data)
batch_loss = 0.0
for batch_idx, sequence in enumerate(output):
# sequence shape is 10 x 128
true_seq_len = unpadded_seq_lengths[batch_idx]
# only keep unpadded gt and predicted labels since we don't want loss to be influenced by padded values
predicted_labels = sequence[:true_seq_len, :] # for example, 3 x 128
gt_labels = gt_labels_padded[batch_idx, :true_seq_len, :] # same shape as above, gt_labels_padded has shape 14 x 10 x 128
# loop through unpadded predicted and gt labels and calculate loss
for item_idx, predicted_labels_seq_item in enumerate(predicted_labels):
# predicted_labels_seq_item and gt_labels_seq_item are 1D vectors of length 128
gt_labels_seq_item = gt_labels[item_idx]
current_loss = criterion(predicted_labels_seq_item, gt_labels_seq_item)
total_loss += current_loss
batch_loss += current_loss
batch_loss.backward()
optimizer.step()
Can anybody please check to see if I'm calculating loss correctly. Thanks
Update:
Is this the correct approach for calculating accuracy metrics?
# batch size: 14
# seq length: 10
for epoch in range(10):
TP = FP = TN = FN = 0.
for x, y, mask in tr_dl:
# mask shape: (10,)
out = model(x) # out shape: (14, 10, 128)
y_pred = (torch.sigmoid(out) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
y_pred = y_pred[mask] # y_pred shape: (14, 10, 10, 128)
y_labels = y[mask] # y_labels shape: (14, 10, 10, 128)
# do I flatten y_pred and y_labels?
y_pred = y_pred.flatten()
y_labels = y_labels.flatten()
for idx, prediction in enumerate(y_pred):
if prediction == 1 and y_labels[idx] == 1:
# calculate IOU (overlap of prediction and gt bounding box)
iou = 0.78 # assume we get this iou value for objects at idx
if iou >= 0.5:
TP += 1
else:
FP += 1
elif prediction == 1 and y_labels[idx] == 0:
FP += 1
elif prediction == 0 and y_labels[idx] == 1:
FN += 1
else:
TN += 1
EPOCH_ACC = (TP + TN) / (TP + TN + FP + FN)
It is usually recommended to stick with batch-wise operations and avoid going into single-element processing steps while in the main training loop. One way to handle this case is to make your dataset return padded inputs and labels with additionally a mask that will come useful for loss computation. In other words, to compute the loss term with sequences of varying sizes, we will use a mask instead of doing individual slices.
Dataset
The way to proceed is to make sure you build the mask in the dataset and not in the inference loop. Here I am showing a minimal implementation that you should be able to transfer to your dataset without much hassle:
class Dataset(data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 100
def __getitem__(self, index):
i = random.randint(5, SEQ_LEN) # for demo puporse, generate x with random length
x = torch.rand(i, EMB_SIZE)
y = torch.randint(0, N_CLASSES, (i, EMB_SIZE))
# pad data to fit in batch
pad = torch.zeros(SEQ_LEN-len(x), EMB_SIZE)
x_padded = torch.cat((pad, x))
y_padded = torch.cat((pad, y))
# construct tensor to mask loss
mask = torch.cat((torch.zeros(SEQ_LEN-len(x)), torch.ones(len(x))))
return x_padded, y_padded, mask
Essentially in the __getitem__, we not only pad the input x and target y with zero values, we also construct a simple mask containing the positions of the padded values in the currently processed element.
Notice how:
x_padded, shaped (SEQ_LEN, EMB_SIZE)
y_padded, shaped (SEQ_LEN, N_CLASSES)
mask, shaped (SEQ_LEN,)
are all three tensors which are shape invariant across the dataset, yet mask contains the padding information necessary for us to compute the loss function appropriately.
Inference
The loss you've used nn.BCEWithLogitsLoss, is the correct one since it's a multi-dimensional loss used for binary classification. In other words, you can use it here in this multi-label classification task, considering each one of the 128 logits as an individual binary prediction. Do not use nn.CrossEntropyLoss) as suggested elsewhere, since the softmax will push a single logit (i.e. class), which is the behaviour required for single-label classification tasks.
Therefore, in the training loop, we simply have to apply the mask to our loss.
for x, y, mask in dl:
y_pred = model(x)
loss = mask*bce(y_pred, y)
# backpropagation, loss postprocessing, logs, etc.
This is what you need for the first part of the question, there are already loss functions implemented in tensorflow: https://medium.com/#aadityaura_26777/the-loss-function-for-multi-label-and-multi-class-f68f95cae525. Yours is just tf.nn.weighted_cross_entropy_with_logits, but you need to set the weight.
The second part of the question is not straightforward, because there's conditioning on the IOU, generally, when you do machine learning, you should heavily depend on matrix multiplication, in your case, you probably need to pre-calculate the IOU -> 1 or 0 as a vector, then multiply with the y_pred , element-wise, this will give you the modified y_pred . After that, you can use any accuracy available function to calculate the final result.
if you can use the CROSSENTROPYLOSS instead of BCEWithLogitsLoss there is something called ignore_index. you can use it to exclude your padded sequences. the difference between the 2 losses is the activation function used (softmax vs sigmoid). but I think you can still use the CROSSENTROPYLOSSfor binary classification as well.
I am trying to implement my own loss function for binary classification. To get started, I want to reproduce the exact behavior of the binary objective. In particular, I want that:
The loss of both functions have the same scale
The training and validation slope is similar
predict_proba(X) returns probabilities
None of this is the case for the code below:
import sklearn.datasets
import lightgbm as lgb
import numpy as np
X, y = sklearn.datasets.load_iris(return_X_y=True)
X, y = X[y <= 1], y[y <= 1]
def loglikelihood(labels, preds):
preds = 1. / (1. + np.exp(-preds))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
model = lgb.LGBMClassifier(objective=loglikelihood) # or "binary"
model.fit(X, y, eval_set=[(X, y)], eval_metric="binary_logloss")
lgb.plot_metric(model.evals_result_)
With objective="binary":
With objective=loglikelihood the slope is not even smooth:
Moreover, sigmoid has to be applied to model.predict_proba(X) to get probabilities for loglikelihood (as I have figured out from https://github.com/Microsoft/LightGBM/issues/2136).
Is it possible to get the same behavior with a custom loss function? Does anybody understand where all these differences come from?
Looking at the output of model.predict_proba(X) in each case, we can see that the built-in binary_logloss model returns probabilities, while the custom model returns logits.
The built-in evaluation function takes probabilities as input. To fit the custom objective, we need a custom evaluation function which will take logits as input.
Here is how you could write this. I've changed the sigmoid calculation so that it doesn't overflow if logit is a large negative number.
def loglikelihood(labels, logits):
#numerically stable sigmoid:
preds = np.where(logits >= 0,
1. / (1. + np.exp(-logits)),
np.exp(logits) / (1. + np.exp(logits)))
grad = preds - labels
hess = preds * (1. - preds)
return grad, hess
def my_eval(labels, logits):
#numerically stable logsigmoid:
logsigmoid = np.where(logits >= 0,
-np.log(1 + np.exp(-logits)),
logits - np.log(1 + np.exp(logits)))
loss = (-logsigmoid + logits * (1 - labels)).mean()
return "error", loss, False
model1 = lgb.LGBMClassifier(objective='binary')
model1.fit(X, y, eval_set=[(X, y)], eval_metric="binary_logloss")
model2 = lgb.LGBMClassifier(objective=loglikelihood)
model2.fit(X, y, eval_set=[(X, y)], eval_metric=my_eval)
Now the results are the same.
I'm trying to define a pinbal loss function for implementing a 'quantile regression' in neural network with Keras (with Tensorflow as backend).
The definition is here: pinball loss
It's hard to implement traditional K.means() etc. function since they deal with the whole batch of y_pred, y_true, yet I have to consider each component of y_pred, y_true, and here's my original code:
def pinball_1(y_true, y_pred):
loss = 0.1
with tf.Session() as sess:
y_true = sess.run(y_true)
y_pred = sess.run(y_pred)
y_pin = np.zeros((len(y_true), 1))
y_pin = tf.placeholder(tf.float32, [None, 1])
for i in range((len(y_true))):
if y_true[i] >= y_pred[i]:
y_pin[i] = loss * (y_true[i] - y_pred[i])
else:
y_pin[i] = (1 - loss) * (y_pred[i] - y_true[i])
pinball = tf.reduce_mean(y_pin, axis=-1)
return K.mean(pinball, axis=-1)
sgd = SGD(lr=0.1, clipvalue=0.5)
model.compile(loss=pinball_1, optimizer=sgd)
model.fit(Train_X, Train_Y, nb_epoch=10, batch_size=20, verbose=2)
I attempted to transfer y_pred, y_true is to vectorized data structure so I can cite them with index, and deal with individual components, yet it seems problem occurs due to the lack of knowledge in treating y_pred, y_true individually.
I tried to dive into lines directed by errors, yet I almost get lost.
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'dense_16_target' with dtype float
[[Node: dense_16_target = Placeholder[dtype=DT_FLOAT, shape=[], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
How can I fix it? Thanks!
I’ve figured this out by myself with Keras backend:
def pinball(y_true, y_pred):
global i
tao = (i + 1) / 10
pin = K.mean(K.maximum(y_true - y_pred, 0) * tao +
K.maximum(y_pred - y_true, 0) * (1 - tao))
return pin
This is a more efficient version:
def pinball_loss(y_true, y_pred, tau):
err = y_true - y_pred
return K.mean(K.maximum(tau * err, (tau - 1) * err), axis=-1)
Using an additional parameter and the functools.partial function is IMHO the cleanest way of setting different values for tau:
model.compile(loss=functools.partial(pinball_loss, tau=0.1), optimizer=sgd)
I have a TensorFlow model with some of these characteristics:
state_size = 800,
num_classes = 14313,
batch_size = 10,
num_steps = 16, # width of the tensor
num_layers = 3
x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')
rnn_inputs = [tf.squeeze(i, squeeze_dims=[1]) for i in
tf.split(x_one_hot, num_steps, 1)] # still a list of tensors (batch_size, num_classes)
...
logits = tf.matmul(rnn_outputs, W) + b
predictions = tf.nn.softmax(logits)
Now I want to feed it a np.array (shape = batch_size x num_steps, so 10 x 16) and I get a predictions tensor back.
Weirdly, its shape is 160 x 14313. The latter is the number of classes. But where does 160 come from? I don't understand that. I would like to have a probability for each of my classes, for each of the elements of the batch (which is 10). How did the num_steps become involved, how to I read from this pred. tensor which is the expected element after those 16 numbers?
In this case the 160 comes from the shape as you suspected.
which means that for each batch of 10, has 16 timesteps, this is technically flattened when you do your shape variable.
At this point you have logits of shape 160 * classes. so you can do predictions[i] for each batch which then will have the probability of each class being the desired class.
which is why to get the chosen class you would do something like tf.argmax(predictions, 1) to get a tensor with the classification
this will have a shape of 160 in your case, so it will be the predicted
class for each one of the batches.
In order to get the probabilities, you could use the logits:
def prob(logit):
return 1/(1 + np.exp(-logit)
I am trying to implement a custom Keras objective function:
in 'Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression', Narihira et al.
This is the sum of equations (4) and (6) from the previous picture. Y* is the ground truth, Y a prediction map and y = Y* - Y.
This is my code:
def custom_objective(y_true, y_pred):
#Eq. (4) Scale invariant L2 loss
y = y_true - y_pred
h = 0.5 # lambda
term1 = K.mean(K.sum(K.square(y)))
term2 = K.square(K.mean(K.sum(y)))
sca = term1-h*term2
#Eq. (6) Gradient L2 loss
gra = K.mean(K.sum((K.square(K.gradients(K.sum(y[:,1]), y)) + K.square(K.gradients(K.sum(y[1,:]), y)))))
return (sca + gra)
However, I suspect that the equation (6) is not correctly implemented because the results are not good. Am I computing this right?
Thank you!
Edit:
I am trying to approximate (6) convolving with Prewitt filters. It works when my input is a chunk of images i.e. y[batch_size, channels, row, cols], but not with y_true and y_pred (which are of type TensorType(float32, 4D)).
My code:
def cconv(image, g_kernel, batch_size):
g_kernel = theano.shared(g_kernel)
M = T.dtensor3()
conv = theano.function(
inputs=[M],
outputs=conv2d(M, g_kernel, border_mode='full'),
)
accum = 0
for curr_batch in range (batch_size):
accum = accum + conv(image[curr_batch])
return accum/batch_size
def gradient_loss(y_true, y_pred):
y = y_true - y_pred
batch_size = 40
# Direction i
pw_x = np.array([[-1,0,1],[-1,0,1],[-1,0,1]]).astype(np.float64)
g_x = cconv(y, pw_x, batch_size)
# Direction j
pw_y = np.array([[-1,-1,-1],[0,0,0],[1,1,1]]).astype(np.float64)
g_y = cconv(y, pw_y, batch_size)
gra_l2_loss = K.mean(K.square(g_x) + K.square(g_y))
return (gra_l2_loss)
The crash is produced in:
accum = accum + conv(image[curr_batch])
...and error description is the following one:
*** TypeError: ('Bad input argument to theano function with name "custom_models.py:836" at index 0 (0-based)', 'Expected an array-like
object, but found a Variable: maybe you are trying to call a function
on a (possibly shared) variable instead of a numeric array?')
How can I use y (y_true - y_pred) as a numpy array, or how can I solve this issue?
SIL2
term1 = K.mean(K.square(y))
term2 = K.square(K.mean(y))
[...]
One mistake spread across the code was that when you see (1/n * sum()) in the equations, it is a mean. Not the mean of a sum.
Gradient
After reading your comment and giving it more thought, I think there is a confusion about the gradient. At least I got confused.
There are two ways of interpreting the gradient symbol:
The gradient of a vector where y should be differentiated with respect to the parameters of your model (usually the weights of the neural net). In previous edits I started to write in this direction because that's the sort of approach used to trained the model (eg. gradient descent). But I think I was wrong.
The pixel intensity gradient in a picture, as you mentioned in your comment. The diff of each pixel with its neighbor in each direction. In which case I guess you have to translate the example you gave into Keras.
To sum up, K.gradients() and numpy.gradient() are not used in the same way. Because numpy implicitly considers (i, j) (the row and column indices) as the two input variables, while when you feed a 2D image to a neural net, every single pixel is an input variable. Hope I'm clear.