Related
what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience
I know this question has been asked before, but I have tried all of their solutions and nothing is working for me.
My Problem:
I am running a CNN to classify some images, a typical task, nothing too crazy. I have the following compilation of my model.
model.compile(optimizer = keras.optimizers.Adam(learning_rate = exp_learning_rate),
loss = tf.keras.losses.SparseCategoricalCrossentropy(),
metrics = ['accuracy'])
I fit this on my training dataset, and evaluated on my validation dataset as follows:
history = model.fit(train_dataset, validation_data = validation_dataset, epochs = 5)
And then I evaluated on a separate test set as follows:
model.evaluate(test_dataset)
Which resulted in this:
4/4 [==============================] - 30s 7s/step - loss: 1.7180 - accuracy: 0.8627
However, when I run:
model.predict(test_dataset)
I have the following confusion matrix output:
This clearly isn't 86% accuracy like the .evaluate method tells me. In fact, it's actually 35.39% accuracy. To make sure it wasn't an issue with my testing dataset, I had my model predict on my training and validation datasets and I still got a similar percentage as here (~30%) despite my training, validation accuracy during fitting going up to 96%, 87%, respectively.
Question:
I don't know why .predict and .evaluate are outputting different results? What's happening there? It seems like when I call .predict, it's not using any of the weights that I trained during fitting? (in fact, given that there are 3 classes, this output is no better than just blindly guessing each label). Are the weights from my fitting not being transferred over to my prediction? My loss function is correct (I label encoded my data as tensorflow wishes to be used with sparse_categorical_crossentropy) and when I pass 'accuracy', it will just take the accuracy corresponding to my loss function. All of this should be consistent. But why is there such a discrepancy with the results of .evaluate and .predict? Which one should I trust?
My Attempts to Fix My Issue:
I thought maybe the sparse categorical cross entropy wasn't right, so I one-hot encoded my target labels and used the categorical_crossentropy loss instead. I still have the EXACT same issue as above.
Concerns:
If the .evaluate is incorrect, then doesn't that mean my training accuracy and validation accuracy during fitting are inaccurate as well? Don't those use the .evaluate method as well? If that's the case, then what can I trust? The loss isn't a good indication of if my model is doing well because it is well-known that minimal loss does not imply good accuracy (although the converse is usually true depending on what standard of "good" we're using). How do I gauge my model's effectiveness in the case that my accuracy metrics aren't correct? I don't really know what to look at anymore because I have no other way to gauge if my model is learning, if someone could please help me understand what is happening I would appreciate it so much. I'm so frustrated.
Edit: (10-28-2021: 12:26 AM)
Ok, so I'll provide some more code to really troubleshoot this.
I originally preprocessed my data as such:
image_size = (256, 256)
batch_size = 16
train_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'training',
seed = 24,
batch_size = batch_size
)
val_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'validation',
seed = 24,
batch_size = batch_size
)
Where image_directory is a string with a path containing my images. Now you could probably read documentation, but the image_dataset_from_directory method actually returns a tf.data.Dataset object containing a bunch of batches of the respective (training, validation) data.
I imported the VGG16 architecture to do my classification so I called the respective preprocessing function for VGG16 as follows:
preprocess_input = tf.keras.applications.vgg16.preprocess_input
train_ds = train_ds.map(lambda x, y: (preprocess_input(x), y))
val_ds = val_ds.map(lambda x, y: (preprocess_input(x), y))
This transformed the images into something that was suitable as input for VGG16. Then, in my last processing steps, I did the following validation/test split:
val_batches = tf.data.experimental.cardinality(val_ds)
test_dataset = val_ds.take(val_batches // 3)
validation_dataset = val_ds.skip(val_batches // 3)
Then I proceeded to cache and prefetch my data:
AUTOTUNE = tf.data.AUTOTUNE
train_dataset = train_ds.prefetch(buffer_size = AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size = AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size = AUTOTUNE)
The Problem:
The problem occurs in the method above. I'm still not sure whether or not .evaluate is a true indicator of accuracy for my model. But I realized that the .evaluate and .predict always coincide when my neural network is a keras.Sequential() model. However, (correct me if I'm wrong) what I am suspecting is that VGG16, when imported from keras.applications API, is actually NOT a keras.Sequential() model. Therefore, I don't think that the .predict and .evaluate results actually coincide when I feed my data straight into my model (I was going to post this as an answer, but I don't have sufficient knowledge nor research to confirm that any of what I said is correct, someone please chime in because I like learning things I know little to nothing about, an edit this is for now).
In the end, I worked around my problem by calling Image_Data_Generator() instead of image_dataset_from_directory() as follows:
train_datagen = ImageDataGenerator(
preprocessing_function = preprocess_input,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True
)
val_datagen = ImageDataGenerator(
preprocessing_function = preprocess_input
)
train_ds = train_datagen.flow_from_directory(
train_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = True,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)
test_ds = val_datagen.flow_from_directory(
test_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = False,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)
(NOTE: I got this based off the following link from tensorflow's documentation: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory)
This completes all the preprocessing for me. Then, when I call model.evaluate(test_ds), it returns the exact same result as when I do model.predict_generator(test_ds). After some minor processing of the prediction output, I use the following code for my confusion matrix:
Y_pred = model.predict(test_ds)
y_pred = np.argmax(Y_pred, axis=1)
cf = confusion_matrix(test_ds.classes, y_pred)
sns.heatmap(cf, annot= True, xticklabels = class_names,
yticklabels = class_names)
plt.title('Performance of Model on Testing Set')
This eliminates the discrepancy in the confusion matrix and the result of model.evaluate(test_ds).
The Takeaway:
If you're loading images onto a classification model, and your loss and accuracy match, but you're getting discrepancy between your predictions and loss, accuracy, try preprocessing in every way possible. I usually preprocess my images using the image_dataset_from_directory() method for all my keras.sequential() models, however, for the VGG16 model, which I suspect is not a sequential() model, using the ImageDataGenerator(...).flow_from_directory(...) resulted in the correct format for the model to generate a prediction that is consistent with the performance metrics.
TLDR I didn't answer any of my original questions, but I found a workaround. Sorry if this is spam in any way. As is the nature of most Stack Overflow posts, I hope my turmoil in the last few hours helps someone way in the future.
I had the same problem. And even with the ImageDataGenerator it stayed that odd behaviour.
But I think the problem is the shuffle flag of the validation set.
You changed that from here:
val_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'validation',
seed = 24,
batch_size = batch_size
)
To here:
test_ds = val_datagen.flow_from_directory(
test_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = False,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)
It makes intuitive sense to me that the label's dimension should be the same as the neural network's last layer's dimension. However, with some experiments using PyTorch, it turns out that it somehow works.
Code:
import torch
import torch.nn as nn
X = torch.tensor([[1],[2],[3],[4]], dtype=torch.float32) # training input
Y = torch.tensor([[2],[4],[6],[8]], dtype=torch.float32) # training label
model = nn.Linear(1,3)
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
for epoch in range(10):
y_pred model(X)
loss = nn.MSELoss(Y, y_pred)
loss.backward()
optimizer.zero_grad()
optimizer.step()
In the above code, model = nn.Linear(1,3) is used instead of model = nn.Linear(1,1). As a result, while Y.shape is (4,1), y_pred.shape is (4,3).
The code works with a warning saying that "Using a target size that is different to the input size will likely lead to incorrect results due to broadcasting. "
I got the following output when I executed model(torch.tensor([10], dtype=torch.float32)):
tensor([20.0089, 19.6121, 19.1967], grad_fn=<AddBackward0>)
All three outputs seems correct to me. But how is the loss calculated if the sizes of the data are different?
Should we in any case use a target size that is different to the input size? Is there a benefit for this?
Assuming you are working with batch_size=4, you are using a target with 1 component vs 3 for your predicted tensor. You don't actually see the intermediate results when computing the loss with nn.MSELoss, using the reduction='none' option will allow you to do so:
>>> criterion = nn.MSELoss(reduction='none')
>>> y = torch.rand(2,1)
>>> y_hat = torch.rand(2,3)
>>> criterion(y_hat, y).shape
(2, 3)
Considering this, you can conclude that the target y, being too small, has been broadcasted to the predicted tensor y_hat. Essentially, in your example, you will get the same result (without the warning) as:
>>> y_repeat = y.repeat(1, 3)
>>> criterion(y_hat, y_repeat)
This means that, for each batch, you are L2-optimizing all its components against a single value: MSE(y_hat[0,0], y[0]), MSE(y_hat[0,1], y[0]), and MSE(y_hat[0,2], y[0]), same goes for y[1] and y[2].
The warning is there to make sure you're conscious of this broadcast operation. Maybe this is what you're looking to do, in this case, you should broadcast the target tensor yourself. Otherwise, it won't make sense to do so.
I am having a lot of trouble with a neural network model using R neuralnet() function. When I train a network on all of the data as expected the predictions are very accurate. However, when I split the data into training and test sets, the test predictions are terrible. I cannot figure what all I am doing wrong. I would appreciate any advice or help troubleshooting as I don't think I'll be able to figure this out on my own. Thanks in advance.
I have included the R code and some plots and an example of the data below the full data is 3600 observations.
Best Regards-Pat
UPDATE 05/12/18: BASED ON FEEDBACK THAT THIS LOOKS LIKE OVERFITTING, I TRIED STOPPING THE TRAINING EARLIER AND FOUND THAT THE MSE OF THE TEST PREDICTION NEVER GETS VERY LOW AND IS LOWEST APPROACHING 0 TRAINING EPOCHS AND RISES FROM THERE (PLOT INCLUDED AND CODE APPENDED)
###########
#ANN Models
###########
#Load libraries
library(plyr)
library(ggplot2)
library(gridExtra)
library(neuralnet)
#Retain only numerically coded data from data1 in (data2) for ANN fitting
data2 = data1[,c(3:7)]
#Calculate Min and Max for Scaling
max_data = apply(data2,2,max)
min_data = apply(data2,2,min)
#Scale data 0-1
data2_scaled = scale(data2,center=min_data,scale=max_data-min_data)
#Check data structure
data2_scaled
#Fit neural net model
model_nn1 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=data2_scaled,hidden=c(8,8),stepmax=1000000,threshold=0.01)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Rescale neural net response predictions
pred_nn1 = model_nn1$net.result[[1]][,1]*(max_time-min_time)+min_time
#Compare model predictions to actual values
a03 = cbind.data.frame(data1$time,pred_nn1,data1$machine,data1$app)
colnames(a03) = c("actual","prediction","machine","app")
attach(a03)
p01 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a03,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (ALL DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#Visualize ANN
plot(model_nn1)
#Epochs taken to train "steps"
model_nn1$result.matrix[3,]
#########################
#Testing and Training ANN
#########################>
#Split the data into a test and training set
index = sample(1:nrow(data2_scaled),round(0.80*nrow(data2_scaled)))
train_data = as.data.frame(data2_scaled[index,])
test_data = as.data.frame(data2_scaled[-index,])
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=c(3,2),stepmax=1000000,threshold=0.01)
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
pred_nn2 = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
test_data_time = test_data$time*(max_time-min_time)+min_time
a04 = cbind.data.frame(test_data_time,pred_nn2,data1[-index,2],data1[-index,1])
colnames(a04) = c("actual","prediction","machine","app")
attach(a04)
p01 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=machine),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
p02 = ggplot(a04,aes(x=actual,y=prediction))+
geom_point(aes(color=app),size=1)+
scale_y_continuous("Predicted Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
scale_x_continuous("Actual Execution Time [s]",breaks=seq(0,1000,100),limits=c(0,1000))+
ggtitle("Neural Net Fit (TEST DATA):\nActual vs. Predicted Execution Time")+
geom_abline(intercept=0,slope=1)+
theme_light()
grid.arrange(p01,p02,nrow=1)
#EARLY STOPPING TEST
i = 1000
summary_data = data.frame(matrix(rep(0,4*i),ncol=4))
colnames(summary_data) = c("treshold","epochs","train_mse","test_mse")
for (j in 1:i){
a = runif(1,min=0.01,max=10)
#Train the model
model_nn2 = neuralnet(formula=time~instructions+nodes+machine_num+app_num,data=train_data,hidden=3,stepmax=1000000,threshold=a)
#Calculate Min and Max Response for rescaling
max_time = max(data2$time)
min_time = min(data2$time)
#Predict test data from trained nn
pred_nn2_scaled = compute(model_nn2,test_data[,c(1,2,4,5)])
#Rescale test prediction
pred_test_data_time = pred_nn2_scaled$net.result*(max_time-min_time)+min_time
#Rescale test actual
test_data_time = test_data$time*(max_time-min_time)+min_time
#Rescale train prediction
pred_train_data_time = model_nn2$net.result[[1]][,1]*(max_time-min_time)+min_time
#Rescale train actual
train_data_time = train_data$time*(max_time-min_time)+min_time
#Calculate mse
test_mse = mean((pred_test_data_time-test_data_time)^2)
train_mse = mean((pred_train_data_time-train_data_time)^2)
#Summarize
summary_data[j,1] = a
summary_data[j,2] = model_nn2$result.matrix[3,]
summary_data[j,3] = round(train_mse,3)
summary_data[j,4] = round(test_mse,3)
print(summary_data[j,])
}
plot(summary_data$epochs,summary_data$test_mse,pch=19,xlim=c(0,2000),ylim=c(0,300000),cex=0.5,xlab="Training Steps",ylab="MSE",main="Early Stopping Test: Comparing MSE : TEST=BLACK TRAIN=RED")
points(summary_data$epochs,summary_data$train_mse,pch=19,col=2,cex=0.5)
I would guess that it is overfitting. The network is learning to reproduce the data like a dictionary instead of learning the underlying function in the data. There are various things which can cause this and ways to address them.
Things which cause overfitting are:
The network could be training for too long.
The network could having far more weights than training examples.
Ways to reduce overfitting are:
Create a validation dataset and stop training the network as soon as the
validation set's loss starts increasing. This is a necessity.
Reducing the network size. (Less weights)
Using a regularization technique like weight decay or dropout.
Also, it may be possible that the problem is too difficult for a neural network to solve based on the data it is given. Reproducing training data does not prove that the network can solve the problem, it only proves that the network can remember things like a dictionary.
I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]