how to implement L2 Regularization in caffe or DIGITS? - machine-learning

I am using Caffe and also NVIDIA DIGITS. I want to use AlexNet pretrained on ImageNet and wanna fine tune it on my medical data. I have nearly 1000 images and using 80% for training, I generated 40,000 images by data augmentation (using cropping and rotation). However I face a severe overfitting. I tried to overcome this by adding multiple dropout layers. and the result change from :
to:
but my accuracy does not improve.
my network specifications:
AlexNet pre-trained on ImageNet
base learning rate: 0.001
learning rate multiplier: 0.1 for convolution layers and 1 for fully connected layers and xavier weight initialisation.
dropout: 0.5
Now I want to add L2 regularization. I did not find such layer in Caffe and I should maybe make it myself.
first question: Do you have any solution for my problem? ( I have tried other ways like changing stepsize, changing learning rate from 1 to 10^(-5) and I found 0.001 is better, weigh decay changes, adding various dropout layer (which helped as you see))
second question: can you please help me how I can implement L2 regularization??

You have L2 regularization by default in caffe.
See this thread for more information.

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

Neural Network with Input - Relu - SoftMax - Cross Entropy Weights and Activations grow unbounded

I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons.
I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation.
When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion.
It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation.
Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?
as you can see the Relu function is unbounded on positive values, thus creating the weights to grow
in fact, that's why hyperbolic tangent and alike function are being used in those cases, to bound the output value between a certain range (-1 to 1 or 0 to 1 in most cases)
there is another approach to deal with this phenomenon called weights decay
the basic motivation is to get a more generalised model (avoid overfitting) and make sure the weights won't blow up you use a regulation value depending on the weight itself when update them
meaning that bigger weights get bigger penalty
you can farther read about it here

How should I optimize neural network for image classification using pretrained models

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.

Perceptron and shape recognition

I recently implemented a simple Perceptron. This type of perceptron (composed of only one neuron giving binary information in output) can only solve problems where classes can be linearly separable.
I would like to implement a simple shape recognition in images of 8 by 8 pixels. I would like for example my neural network to be able to tell me if what I drawn is a circle, or not.
How to know if this problem has classes being linearly separable ? Because there is 64 inputs, can it still be linearly separable ? Can a simple perceptron solve this kind of problem ? If not, what kind of perceptron can ? I am a bit confused about that.
Thank you !
This problem, in a general sense, can not be solved by a single layer perception. In general other network structures such as convolutional neural networks are best for solving image classification problems, however given the small size of your images a multilayer perception may be sufficient.
Most problems are linearly separable, but not necessarily in 2 dimensions. Adding extra layers to a network allows it to transform data in higher dimensions so that it is linearly separable.
Look into multilayer perceptrons or convolutional neural networks. Examples of classification on the MNIST dataset might be helpful as well.

Convolutional ImageNet network is invariant to flipping images

I am using Deep learning caffe framework for image classification.
I have coins with faces. Some of them are left directed some of them are right.
To classify them I am using common aproach - take weights and structure from pretrained ImageNet network that have already capture a lot of image patterns and train mostly the last layer to fit my training set.
But I have found that netowork does not works on this set:
I have taken some coin for example leftdirected , generated horizontally flipped image for it and marked it as right sided.
For this set convolutional net gets ~50% accuracy, it is exactly random result.
I have also tried to train net on 2 images ( 2 flipped versions of "h" letter ). But with the same result - 50% . ( If I choose to diffetrent letters and train net on augemeneted dataset - i receive 100% accuracy very fast ) . But invariance to flipping brokes my classification.
My question is: is there exists some aproach that allowes me to use advantages of pretrained imagenet but broke somehow this invariance. And what layer on net make invariance possible.
I am using "caffe" for generating net based on this example approach:
https://github.com/BVLC/caffe/blob/master/examples/02-fine-tuning.ipynb
Caffe basic/baseline models trained on image net mostly use the very trivial image augmentation: flipping images horizontally. That is, imagenet classes are indeed the same when flipped horizontally. Thus, the weights you are trying to fine-tune were trained in a setting where horizontal flip should be ignored and I suppose what you see is a net that captured this quite well - it is no longer sensitive to this particular transformation.
It is not trivial to tell at what layer of the net this invariance is happening and therefore it is not easy to say what layers should be fine-tuned to overcome this behavior. I suppose this invariance is quite fundamental to the network and I will not be surprise if it required re-training of the entire net.

Resources