CNN training loss is so unstable - machine-learning

my CNN network
Above is my config of the network.
l am training a CNN network on picture size of 192*192.
my target is a classification network of 11 kinds.
However, the loss and the accuracy on testing dataset appears to be very unstable. l have to run 15+ epochs to get a stable accuracy and loss. The maximum accuracy is only 50%.
What can l do to improve the performance?

I would recommend you to first refer to models which are widely known like VGG-16, LeNET or VGG-19 and check out the way how the conv2D and max-pooling layers are placed.
Start with a very basic model without any batch normalization and Leaky ReLU layers. You just keep the conv2D and max pooling layers and train your model for a few epochs.
Next, try other activations like ReLU to TanH. Try Changing the max pooling to average pooling.
If you are solving a classification problem then use the softmax layer at the end. Also, introduce Dense layer(s) after flattening.
Your dataset should be large and also the target should be one-hot encoded if you wish to use the softmax layer.

Related

Not able to train Resnet model with tripletloss, while VGG16 works, why?

I am trying to do a transfer learning with ResNet50V2 model using triplet loss function. I have kept Include_top = False, input shape = (160,160,3) with Imagenet weights. The last 3 layers of my model is shown in the below image with 6 million trainable parameters.
During the training process, I could see the loss function values reducing from 7.6 to 0.8 but the accuracy does not improve. But when I replace the model with VGG16 and while training the last 3 layers, the accuracy improves from 50% to 90% along with loss value reducing from 6.0 to 0.5.
Where am I going wrong ? Is there anything specific I should look at while training resnet model ? How to train the resnet model ?

What is best practice for which CNN fully-connected layers to keep when doing transfer-learning?

I can't seem to find a concrete answer to the question. I am currently doing transfer learning from a VGG19 network, and my target domain is document classification (either solely by visual classification or using CNN's feature extraction for another model).
I want to understand in which cases is it desirable to keep all fully connected layers of the model, and in which cases should I remove the fully connected layers and make a new fully-connected layer on top of the last convolutional layer. What does each of these choices imply for the training, predictions, etc. ?
These are code examples using Keras of what I mean:
Extracting the last fully connected layer:
original_model = VGG19(include_top=True, weights='imagenet', input_shape=(224, 224, 3))
layer_name = 'fc2'
x = Dropout(0.5)(original_model.get_layer(layer_name).output)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
features_model = Model(inputs=original_model.input, outputs=predictions)
adam = optimizers.Adam(lr=0.001)
features_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
features_model.summary()
return features_model
Adding one fully connected layer after the last convolutional layer:
original_model = VGG19(include_top=False, weights='imagenet', input_shape=(224, 224, 3))
x = Flatten()(base_model.output)
x = Dense(4096, activation='relu')(x)
x = Dropout(0.5)(x)
x = BatchNormalization()(x)
predictions = Dense(num_classes, activation='softmax')(x)
head_model = Model(input=base_model.input, output=predictions)
adam = optimizers.Adam(lr=0.001)
head_model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
head_model.summary()
return head_model
Is there a rule of thumb for what to choose when doing transfer-learning?
According to my past experience, applying transfer learning from stock market to business forecast successfully, you should keep original structure, because if you are doing transfer learning, you will want to load weights trained from original structure, without issues regarding differences in neural net architecture. Then you unfreeze parts of the CNN and your neural net training will start training from a high accuracy and adapt weights for the target problem.
However, if you remove a Flatten layer, computational cost will decrease as you will have fewer parameters to train.
I follow the rule of keeping neural nets as simple as possible (equals bigger generalization properties), with high efficiency.
#Kamen, as a complement to your comment, regarding how much data you will need, it depends on the variance of your data. More variance, you will need more layers and weights to learn the details. However, when you increase complexity in the architecture, your neural net will be more prone to overfit, than can be decreased using Dropout, for instance.
As fully connected layers are the more expensive part of a neural net, if you add one or two of them your parameter number will increase a lot, demanding more time to train. With more layers you will get a higher accuracy, but you may have overfit.
For instance, MNIST with 10,000 examples can reach an accuracy bigger than 99% with a quite simple architecture. However, IMAGENET has 1,000,000 examples (155 GB) and then demands a more complex structure, like VGG16.

How does Caffe updates gradients of a blob with more than one output branch?

Caffe supports multiple losses. Then for the backpropagation stage, some blobs may have multiple gradients coming from different losses. How does Caffe do with the gradients of this blob?
As far as I know, this may not be a concern when designing networks. But this question really confuse me when I try to write a new layer. Thanks for any idea!
This is not an issue of caffe or any other deep-learning tool. This is purely a mathematical question: When you have several losses, you have loss_weight assigned to each loss and the overall loss of the net is the weighted sum of all losses. Consequently, the gradients computed for the net are gradients of the weighted sum of the losses: there is no gradient-per-loss that needs to be integrated, but rather a single loss which is a weighted sum of loss layers.
Caffe usually uses "Split" layer when directing the "top" of a layer into several layers (in your example the output of "conv2" is "Split" to a "bottom" of "auxiliary loss" and "ip1").
Looking at the code of backward propagation of "Split" layer you can see that the all top.diffs are summed into bottom.diff.

Transfer Learning and linear classifier

In cs231n handout here, it says
New dataset is small and similar to original dataset. Since the data
is small, it is not a good idea to fine-tune the ConvNet due to
overfitting concerns... Hence, the best idea might be to train a
linear classifier on the CNN codes.
I'm not sure what linear classifier means. Does the linear classifier refer to the last fully connected layer? (For example, in Alexnet, there are three fully connected layers. Does the linear classifier the last fully connected layer?)
Usually when people say "linear classifier" they refer to Linear SVM (support vector machine). A linear classifier learns a weight vecotr w and a threshold (aka "bias") b such that for each example x the sign of
<w, x> + b
is positive for the "positive" class and negative for the "negative" class.
The last (usually fully connected) layer of a neural-net can be considered as a form of a linear classifier.

Ada-Delta method doesn't converge when used in Denoising AutoEncoder with MSE loss & ReLU activation?

I just implemented AdaDelta (http://arxiv.org/abs/1212.5701) for my own Deep Neural Network Library.
The paper kind of says that SGD with AdaDelta is not sensitive to hyperparameters, and that it always converge to somewhere good. (at least the output reconstruction loss of AdaDelta-SGD is comparable to that of well-tuned Momentum method)
When I used AdaDelta-SGD as learning method in in Denoising AutoEncoder, it did converge in some specific settings, but not always.
When I used MSE as loss function, and Sigmoid as activation function, it converged very quickly, and after iterations of 100 epochs, the final reconstruction loss was better than all of plain SGD, SGD with Momentum, and AdaGrad.
But when I used ReLU as activation function, it didn't converge but continued to be stacked(oscillating) with high(bad) reconstruction loss (just like the case when you used plain SGD with very high learning rate).
The magnitude of reconstruction loss it stacked was about 10 to 20 times higher than the final reconstruction loss generated with Momentum method.
I really don't understand why it happened since the paper says AdaDelta is just good.
Please let me know the reason behind the phenomena and teach me how I could avoid it.
The activation of a ReLU is unbounded, making its use in Auto Encoders difficult since your training vectors likely do not have arbitrarily large and unbounded responses! ReLU simply isn't a good fit for that type of network.
You can force a ReLU into an auto encoder by applying some transformation to the output layer, as is done here. However, hey don't discuss the quality of the results in terms of an auto-encoder, but instead only as a pre-training method for classification. So its not clear that its a worth while endeavor for building an auto encoder either.

Resources