Overfitting after first epoch

Overfitting after first epoch - machine-learning

I am using convolutional neural networks (via Keras) as my model for facial expression recognition (55 subjects). My data set is quite hard and around 450k with 7 classes. I have balanced my training set per subject and per class label.
I implemented a very simple CNN architecture (with real-time data augmentation):
model = Sequential()
model.add(Convolution2D(32, 3, 3, border_mode=borderMode, init=initialization, input_shape=(48, 48, 3)))
model.add(BatchNormalization())
model.add(PReLU())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256))
model.add(BatchNormalization())
model.add(PReLU())
model.add(Dropout(0.5))
model.add(Dense(nb_output))
model.add(Activation('softmax'))
After first epoch, my training loss decreases constantly while validation loss increases. Could overfitting happen that soon? Or is there a problem with my data being confusing? Should I also balance my testing set?

It could be that the task is easy to solve and after one epoch the model has learned enough to solve it, and training for more epochs just increases overfitting.
But if you have balanced the train set and not the test set, what may be happening is that you are training for one task (expression recognition on evenly distributed data) and then you are testing on a slightly different task, because the test set is not balanced.

Related

What is best practice for which CNN fully-connected layers to keep when doing transfer-learning?

I can't seem to find a concrete answer to the question. I am currently doing transfer learning from a VGG19 network, and my target domain is document classification (either solely by visual classification or using CNN's feature extraction for another model).
I want to understand in which cases is it desirable to keep all fully connected layers of the model, and in which cases should I remove the fully connected layers and make a new fully-connected layer on top of the last convolutional layer. What does each of these choices imply for the training, predictions, etc. ?
These are code examples using Keras of what I mean:
Extracting the last fully connected layer:
original_model = VGG19(include_top=True, weights='imagenet', input_shape=(224, 224, 3))
layer_name = 'fc2'
x = Dropout(0.5)(original_model.get_layer(layer_name).output)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
features_model = Model(inputs=original_model.input, outputs=predictions)
adam = optimizers.Adam(lr=0.001)
features_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
features_model.summary()
return features_model
Adding one fully connected layer after the last convolutional layer:
original_model = VGG19(include_top=False, weights='imagenet', input_shape=(224, 224, 3))
x = Flatten()(base_model.output)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
head_model = Model(input=base_model.input, output=predictions)
adam = optimizers.Adam(lr=0.001)
head_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
head_model.summary()
return head_model
Is there a rule of thumb for what to choose when doing transfer-learning?

According to my past experience, applying transfer learning from stock market to business forecast successfully, you should keep original structure, because if you are doing transfer learning, you will want to load weights trained from original structure, without issues regarding differences in neural net architecture. Then you unfreeze parts of the CNN and your neural net training will start training from a high accuracy and adapt weights for the target problem.
However, if you remove a Flatten layer, computational cost will decrease as you will have fewer parameters to train.
I follow the rule of keeping neural nets as simple as possible (equals bigger generalization properties), with high efficiency.
#Kamen, as a complement to your comment, regarding how much data you will need, it depends on the variance of your data. More variance, you will need more layers and weights to learn the details. However, when you increase complexity in the architecture, your neural net will be more prone to overfit, than can be decreased using Dropout, for instance.
As fully connected layers are the more expensive part of a neural net, if you add one or two of them your parameter number will increase a lot, demanding more time to train. With more layers you will get a higher accuracy, but you may have overfit.
For instance, MNIST with 10,000 examples can reach an accuracy bigger than 99% with a quite simple architecture. However, IMAGENET has 1,000,000 examples (155 GB) and then demands a more complex structure, like VGG16.

The number of epochs used in an autoencoder depends on the dimension of the dataset?

I develop a simple autoencoder and to find the right parameters I use a grid search on a small subset of dataset. The number of epochs in output can be used on the training set with higher dimension? The number of epochs depends on the dimension of dataset? or not? E.g. I have much more epochs in a dataset with a big dimension and a lower number of epochs for a small dataset

In general yes, the number of epochs will change if the dataset is bigger.
The number of epochs should not be decided a-priori. You should run the training and monitor the training and validation losses over time and stop training when the validation loss reaches a plateau or start increasing. This technique is called "early stopping" and is a good practice in machine learning.

Why is validation accuracy higher than training accuracy when applying data augmentation?

I am working on an image classification problem in Keras.
I am training the model using model.fit_generator for data augmentation.
While training per epoch, I am also evaluating on validation data.
Training is done on 90% of the data and Validation is done on 10% of the data. The following is my code:
datagen = ImageDataGenerator(
rotation_range=20,
zoom_range=0.3)
batch_size=32
epochs=30
model_checkpoint = ModelCheckpoint('myweights.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='max')
lr = 0.01
sgd = SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=False)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
def step_decay(epoch):
# initialize the base initial learning rate, drop factor, and
# epochs to drop every
initAlpha = 0.01
factor = 1
dropEvery = 3
# compute learning rate for the current epoch
alpha = initAlpha * (factor ** np.floor((1 + epoch) / dropEvery))
# return the learning rate
return float(alpha)
history=model.fit_generator(datagen.flow(xtrain, ytrain, batch_size=batch_size),
steps_per_epoch=xtrain.shape[0] // batch_size,
callbacks[LearningRateScheduler(step_decay),model_checkpoint],
validation_data = (xvalid, yvalid),
epochs = epochs, verbose = 1)
However, upon plotting the training accuracy and validation accuracy (as well as the training loss and validation loss), I noticed the validation accuracy is higher than training accuracy (and likewise, validation loss is lower than training loss). Here are my resultant plots after training (please note that validation is referred to as "test" in the plots):
When I do not apply data augmentation, the training accuracy is higher than the validation accuracy.From my understanding, the training accuracy should typically be greater than validation accuracy. Can anyone give insights why this is not the case in my situation where data augmentation is applied?

The following is just a theory, but it is one that you can test!
One possible explanation why your validation accuracy is better than your training accuracy, is that the data augmentation you are applying to the training data is making the task significantly harder for the network. (It's not totally clear from your code sample. but it looks like you are applying the augmentation only to your training data, not your validation data).
To see why this might be the case, imagine you are training a model to recognise whether someone in the picture is smiling or frowning. Most pictures of faces have the face the "right way up" so the model could solve the task by recognising the mouth and measuring if it curves upwards or downwards. If you now augment the data by applying random rotations, the model can no longer focus just on the mouth, as the face could be upside down. In addition to recognising the mouth and measuring its curve, the model now also has to work out the orientation of the face as a whole and compare the two.
In general, applying random transformations to your data is likely to make it harder to classify. This can be a good thing as it makes your model more robust to changes in the input, but it also means that your model gets an easier ride when you test it on non-augmented data.
This explanation might not apply to your model and data, but you can test it in two ways:
If you decrease the range of the augmentation transformations you are using you should see the training and validation loss get closer together.
If you apply the exact same augmentation transformations to the validation data as you do the training data, then you should see the validation accuracy drop below the training accuracy as you expected.

What layers should I use for Keras?

I am building a sample project in Keras. The project is to identify the difference between cats and dogs. I found an example online with the model as such:
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=(3, 150, 150)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
My question is, how do people know which layers to use? Are there guidelines or rules of thumb when to use Conv2D vs a Conv1D vs a another layer?

In short - they don't. Coming up with the good architecture is a majority of current deep learning research. There are some rules of thumbs, intuitions, but mostly - experience or coping existing ones that were reported to work.
In really short words:
convolutions are used when you have spatial and/or temporal structure in data thus images, videos, sound etc.
pooling has similar use cases to convolutions, it still requires spatial and/or temporal structure (unless it is applied to the whole channel/dimension) and provides a way of removing "details" (usually noise) and reduce dimension of the signal
recurrent when your data has sequential character
fully connected are needed to "force" a given dimension (thus often used as a last layer) or when one does not really know any structure that can be exploited (since they are pretty much the most generic ones)
However the question how to compose, what hyperparameters to use, how many to use is a huge open research question, and at the very beginning the best approach is to copy someone else's architectures and gain some experience/intuition what works and what does not for the data you are working with.

Suggest using Kfold splits or validation_split kwarg in Keras Training?

In many examples, I see train/cross-validation dataset splits being performed by using a Kfold, StratifiedKfold, or other pre-built dataset splitter. Keras models have a built in validation_split kwarg that can be used for training.
model.fit(self, x, y, batch_size=32, nb_epoch=10, verbose=1, callbacks=[], validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None)
(https://keras.io/models/model/)
validation_split: float between 0 and 1: fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
I am new to the field and tools, so my intuition on what the different splitters offer you. Mainly though, I can't find any information on how Keras' validation_split works. Can someone explain it to me and when separate method is preferable? The built-in kwarg seems to me like the cleanest and easiest way to split test datasets, without having to architect your training loops much differently.

The difference between the two is quite subtle and they can be used in conjunction.
Kfold and similar functions in scikit-learn will randomly split your data into k folds. You can then train models holding out a single fold each time and testing on the fold.
validation_split takes a fraction of your data non-randomly. According to the Keras documentation it will take the fraction from the end of your data, e.g. 0.1 will hold out the final 10% of rows in the input matrix. The purpose of the validation split is to allow you to assess how the model is performing on the training set and a held out set at every epoch in the training period. If the model continues to improve on the training set but not the validation set then it is a clear sign of potential overfitting.
You could theoretically use KFold cross-validation to construct a model while also using validation_split to monitor the performance of each model. At each fold you will be generating a new validation_split from the training data.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart