Keras discrepancy between .evaluate and .predict - machine-learning

I know this question has been asked before, but I have tried all of their solutions and nothing is working for me.
My Problem:
I am running a CNN to classify some images, a typical task, nothing too crazy. I have the following compilation of my model.
model.compile(optimizer = keras.optimizers.Adam(learning_rate = exp_learning_rate),
loss = tf.keras.losses.SparseCategoricalCrossentropy(),
metrics = ['accuracy'])
I fit this on my training dataset, and evaluated on my validation dataset as follows:
history = model.fit(train_dataset, validation_data = validation_dataset, epochs = 5)
And then I evaluated on a separate test set as follows:
model.evaluate(test_dataset)
Which resulted in this:
4/4 [==============================] - 30s 7s/step - loss: 1.7180 - accuracy: 0.8627
However, when I run:
model.predict(test_dataset)
I have the following confusion matrix output:
This clearly isn't 86% accuracy like the .evaluate method tells me. In fact, it's actually 35.39% accuracy. To make sure it wasn't an issue with my testing dataset, I had my model predict on my training and validation datasets and I still got a similar percentage as here (~30%) despite my training, validation accuracy during fitting going up to 96%, 87%, respectively.
Question:
I don't know why .predict and .evaluate are outputting different results? What's happening there? It seems like when I call .predict, it's not using any of the weights that I trained during fitting? (in fact, given that there are 3 classes, this output is no better than just blindly guessing each label). Are the weights from my fitting not being transferred over to my prediction? My loss function is correct (I label encoded my data as tensorflow wishes to be used with sparse_categorical_crossentropy) and when I pass 'accuracy', it will just take the accuracy corresponding to my loss function. All of this should be consistent. But why is there such a discrepancy with the results of .evaluate and .predict? Which one should I trust?
My Attempts to Fix My Issue:
I thought maybe the sparse categorical cross entropy wasn't right, so I one-hot encoded my target labels and used the categorical_crossentropy loss instead. I still have the EXACT same issue as above.
Concerns:
If the .evaluate is incorrect, then doesn't that mean my training accuracy and validation accuracy during fitting are inaccurate as well? Don't those use the .evaluate method as well? If that's the case, then what can I trust? The loss isn't a good indication of if my model is doing well because it is well-known that minimal loss does not imply good accuracy (although the converse is usually true depending on what standard of "good" we're using). How do I gauge my model's effectiveness in the case that my accuracy metrics aren't correct? I don't really know what to look at anymore because I have no other way to gauge if my model is learning, if someone could please help me understand what is happening I would appreciate it so much. I'm so frustrated.
Edit: (10-28-2021: 12:26 AM)
Ok, so I'll provide some more code to really troubleshoot this.
I originally preprocessed my data as such:
image_size = (256, 256)
batch_size = 16
train_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'training',
seed = 24,
batch_size = batch_size
)
val_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'validation',
seed = 24,
batch_size = batch_size
)
Where image_directory is a string with a path containing my images. Now you could probably read documentation, but the image_dataset_from_directory method actually returns a tf.data.Dataset object containing a bunch of batches of the respective (training, validation) data.
I imported the VGG16 architecture to do my classification so I called the respective preprocessing function for VGG16 as follows:
preprocess_input = tf.keras.applications.vgg16.preprocess_input
train_ds = train_ds.map(lambda x, y: (preprocess_input(x), y))
val_ds = val_ds.map(lambda x, y: (preprocess_input(x), y))
This transformed the images into something that was suitable as input for VGG16. Then, in my last processing steps, I did the following validation/test split:
val_batches = tf.data.experimental.cardinality(val_ds)
test_dataset = val_ds.take(val_batches // 3)
validation_dataset = val_ds.skip(val_batches // 3)
Then I proceeded to cache and prefetch my data:
AUTOTUNE = tf.data.AUTOTUNE
train_dataset = train_ds.prefetch(buffer_size = AUTOTUNE)
validation_dataset = validation_dataset.prefetch(buffer_size = AUTOTUNE)
test_dataset = test_dataset.prefetch(buffer_size = AUTOTUNE)
The Problem:
The problem occurs in the method above. I'm still not sure whether or not .evaluate is a true indicator of accuracy for my model. But I realized that the .evaluate and .predict always coincide when my neural network is a keras.Sequential() model. However, (correct me if I'm wrong) what I am suspecting is that VGG16, when imported from keras.applications API, is actually NOT a keras.Sequential() model. Therefore, I don't think that the .predict and .evaluate results actually coincide when I feed my data straight into my model (I was going to post this as an answer, but I don't have sufficient knowledge nor research to confirm that any of what I said is correct, someone please chime in because I like learning things I know little to nothing about, an edit this is for now).
In the end, I worked around my problem by calling Image_Data_Generator() instead of image_dataset_from_directory() as follows:
train_datagen = ImageDataGenerator(
preprocessing_function = preprocess_input,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True
)
val_datagen = ImageDataGenerator(
preprocessing_function = preprocess_input
)
train_ds = train_datagen.flow_from_directory(
train_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = True,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)
test_ds = val_datagen.flow_from_directory(
test_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = False,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)
(NOTE: I got this based off the following link from tensorflow's documentation: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory)
This completes all the preprocessing for me. Then, when I call model.evaluate(test_ds), it returns the exact same result as when I do model.predict_generator(test_ds). After some minor processing of the prediction output, I use the following code for my confusion matrix:
Y_pred = model.predict(test_ds)
y_pred = np.argmax(Y_pred, axis=1)
cf = confusion_matrix(test_ds.classes, y_pred)
sns.heatmap(cf, annot= True, xticklabels = class_names,
yticklabels = class_names)
plt.title('Performance of Model on Testing Set')
This eliminates the discrepancy in the confusion matrix and the result of model.evaluate(test_ds).
The Takeaway:
If you're loading images onto a classification model, and your loss and accuracy match, but you're getting discrepancy between your predictions and loss, accuracy, try preprocessing in every way possible. I usually preprocess my images using the image_dataset_from_directory() method for all my keras.sequential() models, however, for the VGG16 model, which I suspect is not a sequential() model, using the ImageDataGenerator(...).flow_from_directory(...) resulted in the correct format for the model to generate a prediction that is consistent with the performance metrics.
TLDR I didn't answer any of my original questions, but I found a workaround. Sorry if this is spam in any way. As is the nature of most Stack Overflow posts, I hope my turmoil in the last few hours helps someone way in the future.

I had the same problem. And even with the ImageDataGenerator it stayed that odd behaviour.
But I think the problem is the shuffle flag of the validation set.
You changed that from here:
val_ds = keras.preprocessing.image_dataset_from_directory(
directory = image_directory,
label_mode = 'categorical',
shuffle = True,
validation_split = 0.2,
subset = 'validation',
seed = 24,
batch_size = batch_size
)
To here:
test_ds = val_datagen.flow_from_directory(
test_image_directory,
target_size = (224, 224),
batch_size = 16,
seed = 24,
shuffle = False,
classes = ['class1', 'class2', 'class3'],
class_mode = 'categorical'
)

Related

when setting .eval() my model performs worse than when I set .train()

During the training phase, I select the model parameters with the best performance metric.
if performance_metric.item()>max_performance:
max_performance= performance_metric.item()
torch.save(neural_net.state_dict(), PATH+'/best_model.pt')
This is the neural network model used:
class Neural_Net(nn.Module):
def __init__(self, M,shape_input,batch_size):
super(Neural_Net, self).__init__()
self.lstm = nn.LSTM(shape_input,M)
#self.dense1 = nn.Linear(shape_input,M)
self.dense1 = nn.Linear(M,M) #Used with the LSTM
torch.nn.init.xavier_uniform_(self.dense1.weight)
self.dense2 = nn.Linear(M,M)
torch.nn.init.xavier_uniform_(self.dense2.weight)
self.dense3 = nn.Linear(M,1)
torch.nn.init.xavier_uniform_(self.dense3.weight)
self.drop = nn.Dropout(0.7)
self.bachnorm1 = nn.BatchNorm1d(M)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
self.hidden_cell = (torch.zeros(1,batch_size,M),torch.zeros(1,batch_size,M))
def forward(self, x):
lstm_out, self.hidden_cell = self.lstm(x.view(1 ,len(x), -1), self.hidden_cell)
x = self.drop(self.relu(self.dense1(self.bachnorm1(lstm_out.view(len(x), -1)))))
x = self.drop(self.relu(self.dense2(x)))
x = self.relu(self.dense3(x))
return x
After that I load the model with the best parameters and set the evaluation mode:
neural_net.load_state_dict(torch.load(PATH+'/best_model.pt'))
neural_net.eval()
The results are completely random. When I set train() the performance is similar to the selected best model parameter.
There is an important aspect of the eval() that I am forgetting? Is the batch normalization correctly used? I am using a batch the same size as in the training phase for the test phase.
Without knowing your batch size, training/test dataset size, or the training/test dataset discrepancies, this issue has been discussed on the pytorch forums previously here.
In my experience, it sounds very much like your latent training data representation in your model is significantly different to your validation data representation. The main advice I can provide is for you to try reducing the momentum of your batchnorm layer. It might be worth substituting a layernorm layer instead (which doesn't track a running mean/standard deviation) OR setting track_running_stats=False in the batchnorm1d function and seeing if the problem persists.

PyTorch: Predicting future values with LSTM

I'm currently working on building an LSTM model to forecast time-series data using PyTorch. I used lag features to pass the previous n steps as inputs to train the network. I split the data into three sets, i.e., train-validation-test split, and used the first two to train the model. My validation function takes the data from the validation data set and calculates the predicted valued by passing it to the LSTM model using DataLoaders and TensorDataset classes. Initially, I've got pretty good results with R2 values in the region of 0.85-0.95.
However, I have an uneasy feeling about whether this validation function is also suitable for testing my model's performance. Because the function now takes the actual X values, i.e., time-lag features, from the DataLoader to predict y^ values, i.e., predicted target values, instead of using the predicted y^ values as features in the next prediction. This situation seems far from reality where the model has no clue of the real values of the previous time steps, especially if you forecast time-series data for longer time periods, say 3-6 months.
I'm currently a bit puzzled about tackling this issue and defining a function to predict future values relying on the model's values rather than the actual values in the test set. I have the following function predict, which makes a one-step prediction, but I haven't really figured out how to predict the whole test dataset using DataLoader.
def predict(self, x):
# convert row to data
x = x.to(device)
# make prediction
yhat = self.model(x)
# retrieve numpy array
yhat = yhat.to(device).detach().numpy()
return yhat
You can find how I split and load my datasets, my constructor for the LSTM model, and the validation function below. If you need more information, please do not hesitate to reach out to me.
Splitting and Loading Datasets
def create_tensor_datasets(X_train_arr, X_val_arr, X_test_arr, y_train_arr, y_val_arr, y_test_arr):
train_features = torch.Tensor(X_train_arr)
train_targets = torch.Tensor(y_train_arr)
val_features = torch.Tensor(X_val_arr)
val_targets = torch.Tensor(y_val_arr)
test_features = torch.Tensor(X_test_arr)
test_targets = torch.Tensor(y_test_arr)
train = TensorDataset(train_features, train_targets)
val = TensorDataset(val_features, val_targets)
test = TensorDataset(test_features, test_targets)
return train, val, test
def load_tensor_datasets(train, val, test, batch_size=64, shuffle=False, drop_last=True):
train_loader = DataLoader(train, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
val_loader = DataLoader(val, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
return train_loader, val_loader, test_loader
Class LSTM
class LSTMModel(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
super(LSTMModel, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.lstm = nn.LSTM(
input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, future=False):
h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
out = out[:, -1, :]
out = self.fc(out)
return out
Validation (defined within a trainer class)
def validation(self, val_loader, batch_size, n_features):
with torch.no_grad():
predictions = []
values = []
for x_val, y_val in val_loader:
x_val = x_val.view([batch_size, -1, n_features]).to(device)
y_val = y_val.to(device)
self.model.eval()
yhat = self.model(x_val)
predictions.append(yhat.cpu().detach().numpy())
values.append(y_val.cpu().detach().numpy())
return predictions, values
I've finally found a way to forecast values based on predicted values from the earlier observations. As expected, the predictions were rather accurate in the short-term, slightly becoming worse in the long term. It is not so surprising that the future predictions digress over time, as they no longer depend on the actual values. Reflecting on my results and the discussions I had on the topic, here are my take-aways:
In real-life cases, the real values can be retrieved and fed into the model at each step of the prediction -be it weekly, daily, or hourly- so that the next step can be predicted with the actual values from the previous step. So, testing the performance based on the actual values from the test set may somewhat reflect the real performance of the model that is maintained regularly.
However, for predicting future values in the long term, forecasting, if you will, you need to make either multiple one-step predictions or multi-step predictions that span over the time period you wish to forecast.
Making multiple one-step predictions based on the values predicted the model yields plausible results in the short term. As the forecasting period increases, the predictions become less accurate and therefore less fit for the purpose of forecasting.
To make multiple one-step predictions and update the input after each prediction, we have to work our way through the dataset one by one, as if we are going through a for-loop over the test set. Not surprisingly, this makes us lose all the computational advantages that matrix operations and mini-batch training provide us.
An alternative could be predicting sequences of values, instead of predicting the next value only, say using RNNs with multi-dimensional output with many-to-many or seq-to-seq structure. They are likely to be more difficult to train and less flexible to make predictions for different time periods. An encoder-decoder structure may prove useful for solving this, though I have not implemented it by myself.
You can find the code for my function that forecasts the next n_steps based on the last row of the dataset X (time-lag features) and y (target value). To iterate over each row in my dataset, I would set batch_size to 1 and n_features to the number of lagged observations.
def forecast(self, X, y, batch_size=1, n_features=1, n_steps=100):
predictions = []
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = y.item(0)
with torch.no_grad():
self.model.eval()
for _ in range(n_steps):
X = X.view([batch_size, -1, n_features]).to(device)
yhat = self.model(X)
yhat = yhat.to(device).detach().numpy()
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = yhat.item(0)
predictions.append(yhat)
return predictions
The following line shifts values in the second dimension of the tensor by one so that a tensor [[[x1, x2, x3, ... , xn ]]] becomes [[[xn, x1, x2, ... , x(n-1)]]].
X = torch.roll(X, shifts=1, dims=2)
And, the line below selects the first element from the last dimension of the 3d tensor and sets that item to the predicted value stored in the NumPy ndarray (yhat), [[xn+1]]. Then, the new input tensor becomes [[[x(n+1), x1, x2, ... , x(n-1)]]]
X[..., -1, 0] = yhat.item(0)
Recently, I've decided to put together the things I had learned and the things I would have liked to know earlier. If you'd like to have a look, you can find the links down below. I hope you'll find it useful. Feel free to comment or reach out to me if you agree or disagree with any of the remarks I made above.
Building RNN, LSTM, and GRU for time series using PyTorch
Predicting future values with RNN, LSTM, and GRU using PyTorch

How to train model with features selected by SelectKBest?

I am using SelectKBest() in Sklearn's Pipeline() class to reduce the number of features down from 30 to the 5 best features. When I fit the classifer, I get different test results as expected with feature selection. However I spotted an error in my code which doesn't seem to cause an actual error in runtime.
When I call predict(), I realised that it was still being given all 30 features as input as if feature selection wasn't occurring. Even though I only trained the model on the 5 best features. Surely giving 30 features to an SVM to predict a class will crash if it was only trained on the 5 best features?
In my train_model(df) function, my code looks as follows:
def train_model(df):
x,y = balance_dataset(df)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feature_selection = SelectKBest()
pipe = Pipeline([('sc', preprocessing.MinMaxScaler()),
('feature_selection', feature_selection),
('SVM', svm.SVC(decision_function_shape = 'ovr', kernel = 'poly'))])
candidate_parameters = [{'SVM__C': [0.01, 0.1, 1], 'SVM__gamma': [0.01, 0.1, 1], 'feature_selection__k': [5]}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train )
return clf
However this is when what happens when I call trade():
def trade(df):
clf = train_model(df)
for index, row in trading_set.iterrows():
features = row[:-3] #features is now an array of 30 features, even though model is only trained on 5
if trade_balance > 0:
trades[index] = trade_balance
if clf.predict(features) == 1: #So this should crash and give an input Shape error, but it doesn't
#Rest of code unneccesary#
So my question is, how do I know that the model is really being trained on only the 5 best features?
Your code is correct, and there is no reason why it should throw you any error. You are confused between the pipeline object and the model itself, which is only one block of the pipeline.
In your example, the pipeline is taking 30 features, scaling them, selecting the 5 best, then training an SVM on these 5 best features. So your SVM has been trained on 5 best features, but you still need to pass all 30 features to your pipeline, because your pipeline expects data to come in in the same format as during the training.

Tensorflow low train/test accuracy

I restored a pre-trained model in Tensorflow 1.2 to do some testing work. I assumed the model was well-trained since the loss decreased to very low (0.0001). However, with either the testing samples or training samples, the accuracy ops give me a value which is almost 0. Is this because I'm using the wrong accuracy function or is it because the model is the problem?
Here is the accuracy function, the test_image below is a batch with a single test sample, test_image_label is a single label:
correct_prediction = tf.equal(tf.argmax(GoogleNet(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
with Session() as less:
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={keep_prob: 1.0}))
print(accuracy_vector)
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
The model is defined as GoogleNet(data) above, it is a function that returns the logits of the input batch. The training was done like this:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=train_label_batch, logits=GoogleNet(train_batch)))
train_step = tf.train.MomentumOptimizer(learning_rate, 0.9).minimize(cost, global_step=global_step)
The train_step is ran in every iteration. I think it is worth noting that after restored the model, I cannot run print(GoogleNet(test_image).eval(feed_dict={keep_prob: 1.0})) in the session, with which I intended to take a look at the output of the model. It returns the error of FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Variable_213
[[Node: Variable_213/read = Identity[T=DT_FLOAT, _class=["loc:#Variable_213"], _device="/job:localhost/replica:0/task:0/cpu:0"](Variable_213)]]

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources