Related
I would like to obtain not only Accuracy and Cohen's kappa values from a k-fold cross validation, but AUC as well. I know how to obtain the avereage Accuracy, Cohen's Kappa, and AUC, as well as the Accuracy and Cohen's kappa for each fold, but I don't know how to obtain an AUC value for each fold.
Here is an example using different data
# load data
data(Sonar)
#rename data
my_data <- Sonar
#apply train control to get accuracy and cohens kappa
fitControl <-
trainControl(
method = "cv",
number = 10,
classProbs = T,
savePredictions = T
)
#run through k fold cross validation
model <- train(
Class ~ .,
data = my_data,
method = "glm",
trControl = fitControl
)
getTrainPerf(model)
#get every accuracy and kappa value
model$resample
I also know that I can use ROC as the metric in the train function and can fit the model to optimize ROC and can then obtain ROC values. But, I would like to optimize cohen's kappa and still see AUC scores for each fold. How might I accomplish this?
I have a Keras model that takes a transformed vector x as input and outputs probabilities that each input value is 1.
I would like to take the predictions from this model and find an optimal threshold. That is, maybe the cutoff value for "this value is 1" should be 0.23, or maybe it should be 0.78, or something else. I know cross-validation is a good tool for this.
My question is how to work this in to training. For example, say I have the following model (taken from here):
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=60, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
I train the model and get some output probabilities:
model.fit(train_x, train_y)
predictions = model.predict(train_y)
Now I want to learn the threshold for the value of each entry in predictions that would give the best accuracy, for example. How can I learn this parameter, instead of just choosing one after training is complete?
EDIT: For example, say I have this:
def fake_model(self):
#Model that returns probability that each of 10 values is 1
a_input = Input(shape=(2, 10), name='a_input')
dense_1 = Dense(5)(a_input)
outputs = Dense(10, activation='sigmoid')(dense_1)
def hamming_loss(y_true, y_pred):
return tf.to_float(tf.reduce_sum(abs(y_true - y_pred))) /tf.to_float(tf.size(y_pred))
fakemodel = Model(a_input, outputs)
#Use the outputs of the model; find the threshold value that minimizes the Hamming loss
#Record the final confusion matrix.
How can I train a model like this end-to-end?
If an ROC curve isn't what you are looking for, you could create a custom Keras Layer that takes in the outputs of your original model and tries to learn an optimal threshold given the true outputs and the predicted probabilities.
This layer subtracts the threshold from the predicted probability, multiplies by a relatively large constant (in this case 100) and then applies the sigmoid function. Here is a plot that shows the function at three different thresholds (.3, .5, .7).
Below is the code for the definition of this layer and the creation of a model that is composed solely of it, after fitting your original model, feed it's outputs probabilities to this model and start training for an optimal threshold.
class ThresholdLayer(keras.layers.Layer):
def __init__(self, **kwargs):
super(ThresholdLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight(name="threshold", shape=(1,), initializer="uniform",
trainable=True)
super(ThresholdLayer, self).build(input_shape)
def call(self, x):
return keras.backend.sigmoid(100*(x-self.kernel))
def compute_output_shape(self, input_shape):
return input_shape
out = ThresholdLayer()(input_layer)
threshold_model = keras.Model(inputs=input_layer, outputs=out)
threshold_model.compile(optimizer="sgd", loss="mse")
First, here's a direct answer to your question. You're thinking of an ROC curve. For example, assuming some data X_test and y_test:
from matplotlib import pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
y_pred = model.predict(X_test).ravel()
fpr, tpr, thresholds = roc_curve(y_test, y_pred)
my_auc = auc(fpr, tpr)
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()
plt.figure(2)
plt.xlim(0, 0.2)
plt.ylim(0.8, 1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Model_name (area = {:.3f})'.format(my_auc))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve close-up')
plt.legend(loc='best')
plt.show()
Second, regarding my comment, here's an example of one attempt. It can be done in Keras, or TF, or anywhere, although he does it with XGBoost.
Hope that helps!
First idea I have is kind of brute force.
You compute on a test set a metric separately for each of your input and its corresponding predicted output.
Then for each of them iterate over values for the threshold betzeen 0 and 1 until the metric is optimized for the given input/prediction pair.
For many of the popular metrics of classification quality (accuracy, precision, recall, etc) you just cannot learn the optimal threshold while training your neural network.
This is because these metrics are not differentiable - therefore, gradient updates will fail to set the threshold (or any other parameter) correctly. Therefore, you are forced to optimize a nice smooth loss (like negative log likelihood) during training most of the parameters, and then tune the threshold by grid search.
Of course, you can come up with a smoothed version of your metric and optimize it (and sometimes people do this). But in most cases it is OK to optimize log-likelihood, get a nice probabilistic classifier, and tune the thresholds on top of it. E.g. if you want to optimize accuracy, then you should first estimate class probabilities as accurately as possible (to get close to the perfect Bayes classifier), and then just choose their argmax.
Where is an explicit connection between the optimizer and the loss?
How does the optimizer know where to get the gradients of the loss without a call liks this optimizer.step(loss)?
-More context-
When I minimize the loss, I didn't have to pass the gradients to the optimizer.
loss.backward() # Back Propagation
optimizer.step() # Gardient Descent
Without delving too deep into the internals of pytorch, I can offer a simplistic answer:
Recall that when initializing optimizer you explicitly tell it what parameters (tensors) of the model it should be updating. The gradients are "stored" by the tensors themselves (they have a grad and a requires_grad attributes) once you call backward() on the loss. After computing the gradients for all tensors in the model, calling optimizer.step() makes the optimizer iterate over all parameters (tensors) it is supposed to update and use their internally stored grad to update their values.
More info on computational graphs and the additional "grad" information stored in pytorch tensors can be found in this answer.
Referencing the parameters by the optimizer can sometimes cause troubles, e.g., when the model is moved to GPU after initializing the optimizer.
Make sure you are done setting up your model before constructing the optimizer. See this answer for more details.
When you call loss.backward(), all it does is compute gradient of loss w.r.t all the parameters in loss that have requires_grad = True and store them in parameter.grad attribute for every parameter.
optimizer.step() updates all the parameters based on parameter.grad
Perhaps this will clarify a little the connection between loss.backward and optim.step (although the other answers are to the point).
# Our "model"
x = torch.tensor([1., 2.], requires_grad=True)
y = 100*x
# Compute loss
loss = y.sum()
# Compute gradient of the loss w.r.t. to the parameters
print(x.grad) # None
loss.backward()
print(x.grad) # tensor([100., 100.])
# MOdify the parameters by subtracting the gradient
optim = torch.optim.SGD([x], lr=0.001)
print(x) # tensor([1., 2.], requires_grad=True)
optim.step()
print(x) # tensor([0.9000, 1.9000], requires_grad=True)
loss.backward() sets the grad attribute of all tensors with requires_grad=True
in the computational graph of which loss is the leaf (only x in this case).
Optimizer just iterates through the list of parameters (tensors) it received on initialization and everywhere where a tensor has requires_grad=True, it subtracts the value of its gradient stored in its .grad property (simply multiplied by the learning rate in case of SGD). It doesn't need to know with respect to what loss the gradients were computed it just wants to access that .grad property so it can do x = x - lr * x.grad
Note that if we were doing this in a train loop we would call optim.zero_grad() because in each train step we want to compute new gradients - we don't care about gradients from the previous batch. Not zeroing grads would lead to gradient accumulation across batches.
Some answers explained well, but I'd like to give a specific example to explain the mechanism.
Suppose we have a function : z = 3 x^2 + y^3.
The updating gradient formula of z w.r.t x and y is:
initial values are x=1 and y=2.
x = torch.tensor([1.0], requires_grad=True)
y = torch.tensor([2.0], requires_grad=True)
z = 3*x**2+y**3
print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)
# print result should be:
x.grad: None
y.grad: None
z.grad: None
Then calculating the gradient of x and y in current value (x=1, y=2)
# calculate the gradient
z.backward()
print("x.grad: ", x.grad)
print("y.grad: ", y.grad)
print("z.grad: ", z.grad)
# print result should be:
x.grad: tensor([6.])
y.grad: tensor([12.])
z.grad: None
Finally, using SGD optimizer to update the value of x and y according the formula:
# create an optimizer, pass x,y as the paramaters to be update, setting the learning rate lr=0.1
optimizer = optim.SGD([x, y], lr=0.1)
# executing an update step
optimizer.step()
# print the updated values of x and y
print("x:", x)
print("y:", y)
# print result should be:
x: tensor([0.4000], requires_grad=True)
y: tensor([0.8000], requires_grad=True)
Let's say we defined a model: model, and loss function: criterion and we have the following sequence of steps:
pred = model(input)
loss = criterion(pred, true_labels)
loss.backward()
pred will have an grad_fn attribute, that references a function that created it, and ties it back to the model. Therefore, loss.backward() will have information about the model it is working with.
Try removing grad_fn attribute, for example with:
pred = pred.clone().detach()
Then the model gradients will be None and consequently weights will not get updated.
And the optimizer is tied to the model because we pass model.parameters() when we create the optimizer.
Short answer:
loss.backward() # do gradient of all parameters for which we set required_grad= True. parameters could be any variable defined in code, like h2h or i2h.
optimizer.step() # according to the optimizer function (defined previously in our code), we update those parameters to finally get the minimum loss(error).
When compiling a model, you pass a parameter loss into the compile function. For instance:
model.compile(loss='mean_squared_error', optimizer='adam')
But I was curious if there is a way in Keras to pass in my own cost function?
Yes, you can. A custom loss can be implemented as a function that would take two tensors, i.e. the predicted y and the ground truth, and returns a scalar. The math employed by the function need to be defined over tensorflow functions for the model to be able to backpropagate values through them. If you need your function to accept more input than just y_pred and y_true, you can wrap your custom loss in a broader function, which takes the extra arguments and returns a function that just needs y_true and y_pred. Two examples follow.
Mixed loss between binary crossentropy and mse
from keras.losses import mean_squared_error, binary_crossentropy
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
crossentropy = binary_crossentropy(y_true, y_pred)
return mse + crossentropy
Weighted mixture (wrapped)
def my_custom_loss_wrapper(mse_weight, xentropy_weight):
def my_custom_loss(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
crossentropy = binary_crossentropy(y_true, y_pred)
return mse_weight * mse + xentropy_weight * crossentropy
return my_custom_loss
I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
]
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)