I am new in Keras and I learned fitting and evaluating the model.
After evaluating the model one can see the actual predictions made by model.
I am wondering Is it also possible to see the predictions during fitting in Keras? Till now I cant find any code doing this.

Since this question doesn't specify "epochs", and since using callbacks may represent extra computation, I don't think it's exactly a duplication.
With tensorflow, you can use a custom training loop with eager execution turned on. A simple tutorial for creating a custom training loop:
Basically you will:
#transform your data in to a Dataset:
dataset =
(x_train, y_train)).shuffle(some_buffer).batch(batchSize)
#the above is buggy in some versions regarding shuffling, you may need to shuffle
#again between each epoch
#create an optimizer
optimizer = tf.keras.optimizers.Adam()
#create an epoch loop:
for e in range(epochs):
#create a batch loop
for i, (x, y_true) in enumerate(dataset):
#create a tape to record actions
with tf.GradientTape() as tape:
#take the model's predictions
y_pred = model(x)
#calculate loss
loss = tf.keras.losses.binary_crossentropy(y_true, y_pred)
#calculate gradients
gradients = tape.gradient(loss, model.trainable_weights)
#apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_weights)
You can use the y_pred var for doing anything, including getting its numpy_pred = y_pred.numpy() value.
The tutorial gives some more details about metrics and validation loop.


Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history =[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.title('model train vs validation loss')
pyplot.legend(['train', 'validation'], loc='upper right')
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

Difference between doing cross-validation and validation_data/validation_split in Keras

First, I split the dataset into train and test, for example:
X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.4, random_state=999)
I then use GridSearchCV with cross-validation to find the best performing model:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
And by doing this, I have:
A model is trained using k-1 of the folds as training data; the resulting
model is validated on the remaining part of the data (
But then, when reading about Keras fit fuction, the document introduces 2 more terms:
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set apart this fraction
of the training data, will not train on it, and will evaluate the loss
and any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val,
val_sample_weights) on which to evaluate the loss and any model
metrics at the end of each epoch. The model will not be trained on
this data. validation_data will override validation_split.
From what I understand, validation_split (to be overridden by validation_data) will be used as an unchanged validation dataset, meanwhile hold-out set in cross-validation changes during each cross-validation step.
First question: is it necessary to use validation_split or validation_data since I already do cross validation?
Second question: if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
grid_result =, train_labels, validation_data=None, validation_split=0)
Question 3: If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Question 4: Does the validation_split belong to k-1 folds or the hold-out fold, or will it be considered as "test set" (like in the case of cross validation) which will never be used to train the model.
Validation is performed to ensure that the model is not overfitting on the dataset and it would generalize to new data. Since in the parameters grid search you are also doing validation then there is no need to perform the validation step by the Keras model itself during training. Therefore to answer your questions:
is it necessary to use validation_split or validation_data since I already do cross validation?
No, as I mentioned above.
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
No, since by default no validation is done in Keras (i.e. by default we have validation_split=0.0, validation_data=None in fit() method).
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Yes, Keras won't perform the validation when training the model. However note that, as I mentioned above, the grid search procedure would perform validation to better estimate the performance of the model with a specific set of parameters.

Cross Validation in Keras

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

What is the learning rate status when applying keras model fit() iteratively?

I am applying keras model fitting iteratively (within a for loop) due to a large dataset. My goal is to split the dataset into 100 parts, read each part at once and apply the fit() method.
My Question: In each iteration, does the fit() method begins from the initial learning rate (lr=0.1) which I set during model compilation? Or it remembers the last updated learning rate and apply it directly on a new call of the fit() method.
My code sample is as follows:
# Define model
# Set the optimizer
sgd = SGD(lr=0.1, decay=1e-08, momentum=0.9, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
# Fit model and train
for j in range(100):
print('Data extracting from big matrix ...')
X_train = HDF5Matrix(path_train, 'X', start=st, end=ed)
Y_train = HDF5Matrix(path_train, 'y', start=st, end=ed)
print('Fitting model ...'), Y_train, batch_size=100, shuffle='batch', nb_epoch=1,
validation_data=(X_test, Y_test))
The updated learning rate is remembered in the optimizer object model.optimizer, which is just the sgd variable in your example.
In callbacks such as LearningRateScheduler, the learning rate variable is updated (some lines are removed for clarity).
def on_epoch_begin(self, epoch, logs=None):
lr = self.schedule(epoch)
K.set_value(, lr)
However, when decay is used (as in your example), the learning rate variable is not directly updated, but the variable model.optimizer.iterations is updated. This variable records how many batches have been used in model fitting, and the learning rate with decay is computed in SGD.get_updates() by:
lr =
if self.initial_decay > 0:
lr *= (1. / (1. + self.decay * K.cast(self.iterations,
So in either case, as long as the model is not re-compiled, it will use the updated learning rate in the new fit() calls.

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images:
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
metrics=['accuracy'])[images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step[images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step[images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
steps_per_epoch=len(dataset) / batch_size,
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]
