Can I use my own cost function in keras? - machine-learning

When compiling a model, you pass a parameter loss into the compile function. For instance:
model.compile(loss='mean_squared_error', optimizer='adam')
But I was curious if there is a way in Keras to pass in my own cost function?

Yes, you can. A custom loss can be implemented as a function that would take two tensors, i.e. the predicted y and the ground truth, and returns a scalar. The math employed by the function need to be defined over tensorflow functions for the model to be able to backpropagate values through them. If you need your function to accept more input than just y_pred and y_true, you can wrap your custom loss in a broader function, which takes the extra arguments and returns a function that just needs y_true and y_pred. Two examples follow.
Mixed loss between binary crossentropy and mse
from keras.losses import mean_squared_error, binary_crossentropy
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
crossentropy = binary_crossentropy(y_true, y_pred)
return mse + crossentropy
Weighted mixture (wrapped)
def my_custom_loss_wrapper(mse_weight, xentropy_weight):
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
crossentropy = binary_crossentropy(y_true, y_pred)
return mse_weight * mse + xentropy_weight * crossentropy
return my_custom_loss

Related

How do loss functions know for which model to compute gradients in PyTorch?

I am unsure how PyTorch manges to link the loss function to the model I want it to be computed for. There is never an explicit reference between the loss and the model, such as the one between the model's parameters and the optimizer.
Say for example I want to train 2 networks on the same dataset, so I want to utilize a single pass through the dataset. How would PyTorch link the appropriate loss functions to the appropriate models. Here's code for reference:
import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import shap
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
model = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim=1))
model2 = nn.Sequential(nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 10),
nn.LogSoftmax(dim=1))
# Define the loss
criterion = nn.NLLLoss()
criterion2 = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.003)
optimizer2 = optim.SGD(model2.parameters(), lr=0.003)
epochs = 5
for e in range(epochs):
running_loss = 0
running_loss_2 = 0
for images, labels in trainloader:
# Flatten MNIST images into a 784 long vector
images = images.view(images.shape[0], -1) # batch_size x total_pixels
# Training pass
optimizer.zero_grad()
optimizer2.zero_grad()
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
output2 = model2(images)
loss2 = criterion2(output2, labels)
loss2.backward()
optimizer2.step()
running_loss += loss.item()
running_loss_2 += loss2.item()
print(f"Training loss 1: {running_loss/len(trainloader)}")
print(f"Training loss 2: {running_loss_2/len(trainloader)}")
print()
So once again, how does pytorch know how to compute the appropriate gradients for the appropriate models when loss.backward() and loss2.backward() are called?
Whenever you perform forward operations using one of your model parameters (or any torch.tensor that has attribute requires_grad==True), pytorch builds a computational graph. When you operate on descendents in this graph, the graph is extended. In your case, you have a nn.module called model which will have some trainable model.parameters(), so pytorch will build a graph from your model.parameters() all the way to the loss as you perform the forward operations. The graph is then traversed in reverse during the backward pass to propagate the gradients back to the parameters. For loss in your code above the graph is something like
model.parameters() --> [intermediate variables in model] --> output --> loss
^ ^
| |
images labels
When you call loss.backward() pytorch traverses this graph in reverse to reach all trainable parameters (only the model.parameters() in this case) and updates param.grad for each of them. The optimizer then relies on this information gathered during the backward pass to update the parameter.
For loss2 the story is similar.
The official pytorch tutorials are a good resource for more in-depth information on this.

Using cross-validation to select optimal threshold: binary classification in Keras

I have a Keras model that takes a transformed vector x as input and outputs probabilities that each input value is 1.
I would like to take the predictions from this model and find an optimal threshold. That is, maybe the cutoff value for "this value is 1" should be 0.23, or maybe it should be 0.78, or something else. I know cross-validation is a good tool for this.
My question is how to work this in to training. For example, say I have the following model (taken from here):
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
I train the model and get some output probabilities:
model.fit(train_x, train_y)
predictions = model.predict(train_y)
Now I want to learn the threshold for the value of each entry in predictions that would give the best accuracy, for example. How can I learn this parameter, instead of just choosing one after training is complete?
EDIT: For example, say I have this:
def fake_model(self):
#Model that returns probability that each of 10 values is 1
a_input = Input(shape=(2, 10), name='a_input')
dense_1 = Dense(5)(a_input)
outputs = Dense(10, activation='sigmoid')(dense_1)
def hamming_loss(y_true, y_pred):
return tf.to_float(tf.reduce_sum(abs(y_true - y_pred))) /tf.to_float(tf.size(y_pred))
fakemodel = Model(a_input, outputs)
#Use the outputs of the model; find the threshold value that minimizes the Hamming loss
#Record the final confusion matrix.
How can I train a model like this end-to-end?
If an ROC curve isn't what you are looking for, you could create a custom Keras Layer that takes in the outputs of your original model and tries to learn an optimal threshold given the true outputs and the predicted probabilities.
This layer subtracts the threshold from the predicted probability, multiplies by a relatively large constant (in this case 100) and then applies the sigmoid function. Here is a plot that shows the function at three different thresholds (.3, .5, .7).
Below is the code for the definition of this layer and the creation of a model that is composed solely of it, after fitting your original model, feed it's outputs probabilities to this model and start training for an optimal threshold.
class ThresholdLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(ThresholdLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name="threshold", shape=(1,), initializer="uniform",
trainable=True)
super(ThresholdLayer, self).build(input_shape)
def call(self, x):
return keras.backend.sigmoid(100*(x-self.kernel))
def compute_output_shape(self, input_shape):
return input_shape
out = ThresholdLayer()(input_layer)
threshold_model = keras.Model(inputs=input_layer, outputs=out)
threshold_model.compile(optimizer="sgd", loss="mse")
First, here's a direct answer to your question. You're thinking of an ROC curve. For example, assuming some data X_test and y_test:
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_pred = model.predict(X_test).ravel()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
my_auc = auc(fpr, tpr)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve close-up')
plt.legend(loc='best')
plt.show()
Second, regarding my comment, here's an example of one attempt. It can be done in Keras, or TF, or anywhere, although he does it with XGBoost.
Hope that helps!
First idea I have is kind of brute force.
You compute on a test set a metric separately for each of your input and its corresponding predicted output.
Then for each of them iterate over values for the threshold betzeen 0 and 1 until the metric is optimized for the given input/prediction pair.
For many of the popular metrics of classification quality (accuracy, precision, recall, etc) you just cannot learn the optimal threshold while training your neural network.
This is because these metrics are not differentiable - therefore, gradient updates will fail to set the threshold (or any other parameter) correctly. Therefore, you are forced to optimize a nice smooth loss (like negative log likelihood) during training most of the parameters, and then tune the threshold by grid search.
Of course, you can come up with a smoothed version of your metric and optimize it (and sometimes people do this). But in most cases it is OK to optimize log-likelihood, get a nice probabilistic classifier, and tune the thresholds on top of it. E.g. if you want to optimize accuracy, then you should first estimate class probabilities as accurately as possible (to get close to the perfect Bayes classifier), and then just choose their argmax.

Can a Neural Network learn a simple interpolation?

I’ve tried to train a 2 layer neural network on a simple linear interpolation for a discrete function, I’ve tried lots of different learning rates as well as different activation functions, and it seems like nothing is being learned!
I’ve literally spent the last 6 hours trying to debug the following code, but it seems like there’s no bug! What's the explanation?
from torch.utils.data import Dataset
import os
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import random
LOW_X=255
MID_X=40000
HIGH_X=200000
LOW_Y=torch.Tensor([0,0,1])
MID_Y=torch.Tensor([0.2,0.5,0.3])
HIGH_Y=torch.Tensor([1,0,0])
BATCH_SIZE=4
def x_to_tensor(x):
if x<=MID_X:
return LOW_Y+(x-LOW_X)*(MID_Y-LOW_Y)/(MID_X-LOW_X)
if x<=HIGH_X:
return MID_Y+(x-MID_X)*(HIGH_Y-MID_Y)/(HIGH_X-MID_X)
return HIGH_Y
class XYDataset(Dataset):
LENGTH=10000
def __len__(self):
return self.LENGTH
def __getitem__(self, idx):
x=random.randint(LOW_X,HIGH_X)
y=x_to_tensor(x)
return x,y
class Interpolate(nn.Module):
def __init__(self, num_outputs,hidden_size=10):
super(Interpolate, self).__init__()
self.hidden_size=hidden_size
self.x_to_hidden = nn.Linear(1, hidden_size)
self.hidden_to_out = nn.Linear(hidden_size,num_outputs)
self.activation = nn.Tanh() #I have tried Sigmoid and Relu activations as well
self.softmax=torch.nn.Softmax(dim=1)
def forward(self, x):
out = self.x_to_hidden(x)
out = self.activation(out)
out = self.hidden_to_out(out)
out = self.softmax(out)
return out
dataset=XYDataset()
trainloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=4)
criterion= nn.MSELoss()
def train_net(net,epochs=10,lr=5.137871216190041e-05,l2_regularization=2.181622809797563e-12):
optimizer= optim.Adam(net.parameters(),lr=lr,weight_decay=l2_regularization)
net.train(True)
running_loss=0.0
for epoch in range(epochs):
for i,data in enumerate(trainloader):
inputs,targets=data
inputs,targets=torch.FloatTensor(inputs.float()).view(-1,1),torch.FloatTensor(targets.float())
optimizer.zero_grad()
outputs=net(inputs)
loss=criterion(outputs,targets)
loss.backward()
optimizer.step()
running_loss+=loss.item()
if (len(trainloader)*epoch+i)%200==199:
running_loss=running_loss/(200*BATCH_SIZE)
print('[%d,%5d] loss: %.6f ' % (epoch+1,i+1,running_loss))
running_loss=0.0
for i in range(-11,3):
net=Interpolate(num_outputs=3)
train_net(net,lr=10**i,epochs=1)
print('for learning rate {} net output on low x is {}'.format(i,net(torch.Tensor([255]).view(-1,1))))
Although your problem is quite simple, it is poorly scaled: x ranges from 255 to 200K. This poor scaling leads to numerical instability and overall makes the training process unnecessarily unstable.
To overcome this technical issue, you simply need to scale your inputs to [-1, 1] (or [0, 1]) range.
Note that this scaling is quite ubiquitous in deep-learning: images are scaled to [-1, 1] range (see, e.g., torchvision.transforms.Normalize).
To understand better the importance of scaled responses, you can look into the mathematical analysis done in this paper.
You Can Perform a simple interpolation with a NN however you have to consider the following:
I would recommend the following settings:
For an activation function: for a simple interpolation a identity activation function can turn the NN as a Linear Regressor which may generalize well. However you should consider Rectified Linear Unit (Relu) for big data and Logistic/Tanh for regular size data as other options.
In case of big amounts of data I would select an iterative optimizer for weights as simple gradient descent or Adam. On the other hand if you got few data I would use a Newton approximation LBFGS since you will get a good approximation at weights in a reasonably lower computational time.
Vary the number of neurons in each layer and number of layers performing batch learning to seek better approximations.

Learning curve is the same for training and validation?

I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
]
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)

sklearn metrics for multiclass classification

I have performed GaussianNB classification using sklearn. I tried to calculate the metrics using the following code:
print accuracy_score(y_test, y_pred)
print precision_score(y_test, y_pred)
Accuracy score is working correctly but precision score calculation is showing error as:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
As target is multiclass, can i have the metric scores of precision, recall etc.?
The function call precision_score(y_test, y_pred) is equivalent to precision_score(y_test, y_pred, pos_label=1, average='binary').
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) tells us:
'binary':
Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.
So the problem is that your labels are not binary, but probably one-hot encoded. Fortunately, there are other options which should work with your data:
precision_score(y_test, y_pred, average=None) will return the precision scores for each class, while
precision_score(y_test, y_pred, average='micro') will return the total ratio
of tp/(tp + fp)
The pos_label argument will be ignored if you choose another average option than binary.

Resources