How should I optimize neural network for image classification using pretrained models - image-processing

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice

I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.

There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.


Autoencoder vs Pre-trained network for feature extraction

I wanted to know if anyone has any sort of guidance on what is better for image classification with a small amount of samples per class (arround 20) yet a lot of classes (about 400) for relatively big RGB images (arround 600x600).
I know that Autoencoders can be used for feature extraction, such that I can just let an autoencoder run on the images unsupervised, and thus reduce the dimensionality of the images to train on those dimensionally-reduced images.
Similarly, I also know that you can just use a pre-trained network, strip the final layer and change it into a linear layer to your own dataset's number of classes, and then just train that final layer or a few layers before it to fit your dataset.
I haven't been able to find any resources online that determine which of these two techniques for feature extraction is better and under which conditions; anyone has any advice?

Reducing pixels in large data set (sklearn)

Im currently working on a classification project but I'm in doubt about how I should start off.
Accurately classifying pictures of size 80*80 (so 6400 pixels) in the correct class (binary).
5260 training samples, 600 test samples
As there are more pixels than samples, it seems logic to me to 'drop' most of the pixels and only look at the important ones before I even start working out a classification method (like SVM, KNN etc.).
Say the training data consists of X_train (predictors) and Y_train (outcomes). So far, I've tried looking at the SelectKBest() method from sklearn for feature extraction. But what would be the best way to use this method and to know how many k's I've actually got to select?
It could also be the case that I'm completely on the wrong track here, so correct me if I'm wrong or suggest an other approach to this if possible.
You are suggesting to reduce the dimension of your feature space. That is a method of regularization to reduce overfitting. You haven't mentioned overfitting is an issue so I would test that first. Here are some things I would try:
Use transfer learning. Take a pretrained network for image recognition tasks and fine tune it to your dataset. Search for transfer learning and you'll find many resources.
Train a convolutional neural network on your dataset. CNNs are the go-to method for machine learning on images. Check for overfitting.
If you want to reduce the dimensionality of your dataset, resize the image. Going from 80x80 => 40x40 will reduce the number of pixels by 4x, assuming your task doesn't depend on fine details of the image you should maintain classification performance.
There are other things you may want to consider but I would need to know more about your problem and its requirements.

Convolutional ImageNet network is invariant to flipping images

I am using Deep learning caffe framework for image classification.
I have coins with faces. Some of them are left directed some of them are right.
To classify them I am using common aproach - take weights and structure from pretrained ImageNet network that have already capture a lot of image patterns and train mostly the last layer to fit my training set.
But I have found that netowork does not works on this set:
I have taken some coin for example leftdirected , generated horizontally flipped image for it and marked it as right sided.
For this set convolutional net gets ~50% accuracy, it is exactly random result.
I have also tried to train net on 2 images ( 2 flipped versions of "h" letter ). But with the same result - 50% . ( If I choose to diffetrent letters and train net on augemeneted dataset - i receive 100% accuracy very fast ) . But invariance to flipping brokes my classification.
My question is: is there exists some aproach that allowes me to use advantages of pretrained imagenet but broke somehow this invariance. And what layer on net make invariance possible.
I am using "caffe" for generating net based on this example approach:
Caffe basic/baseline models trained on image net mostly use the very trivial image augmentation: flipping images horizontally. That is, imagenet classes are indeed the same when flipped horizontally. Thus, the weights you are trying to fine-tune were trained in a setting where horizontal flip should be ignored and I suppose what you see is a net that captured this quite well - it is no longer sensitive to this particular transformation.
It is not trivial to tell at what layer of the net this invariance is happening and therefore it is not easy to say what layers should be fine-tuned to overcome this behavior. I suppose this invariance is quite fundamental to the network and I will not be surprise if it required re-training of the entire net.

Deep Belief Networks vs Convolutional Neural Networks

I am new to the field of neural networks and I would like to know the difference between Deep Belief Networks and Convolutional Networks.
Also, is there a Deep Convolutional Network which is the combination of Deep Belief and Convolutional Neural Nets?
This is what I have gathered till now. Please correct me if I am wrong.
For an image classification problem, Deep Belief networks have many layers, each of which is trained using a greedy layer-wise strategy.
For example, if my image size is 50 x 50, and I want a Deep Network with 4 layers namely
Input Layer
Hidden Layer 1 (HL1)
Hidden Layer 2 (HL2)
Output Layer
My input layer will have 50 x 50 = 2500 neurons, HL1 = 1000 neurons (say) , HL2 = 100 neurons (say) and output layer = 10 neurons,
in order to train the weights (W1) between Input Layer and HL1, I use an AutoEncoder (2500 - 1000 - 2500) and learn W1 of size 2500 x 1000 (This is unsupervised learning). Then I feed forward all images through the first hidden layers to obtain a set of features and then use another autoencoder ( 1000 - 100 - 1000) to get the next set of features and finally use a softmax layer (100 - 10) for classification. (only learning the weights of the last layer (HL2 - Output which is the softmax layer) is supervised learning).
(I could use RBM instead of autoencoder).
If the same problem was solved using Convolutional Neural Networks, then for 50x50 input images, I would develop a network using only 7 x 7 patches (say). My layers would be
Input Layer (7 x 7 = 49 neurons)
HL1 (25 neurons for 25 different features) - (convolution layer)
Pooling Layer
Output Layer (Softmax)
And for learning the weights, I take 7 x 7 patches from images of size 50 x 50, and feed forward through convolutional layer, so I will have 25 different feature maps each of size (50 - 7 + 1) x (50 - 7 + 1) = 44 x 44.
I then use a window of say 11x11 for pooling hand hence get 25 feature maps of size (4 x 4) for as the output of the pooling layer. I use these feature maps for classification.
While learning the weights, I don't use the layer wise strategy as in Deep Belief Networks (Unsupervised Learning), but instead use supervised learning and learn the weights of all the layers simultaneously. Is this correct or is there any other way to learn the weights?
Is what I have understood correct?
So if I want to use DBN's for image classification, I should resize all my images to a particular size (say 200x200) and have that many neurons in the input layer, whereas in case of CNN's, I train only on a smaller patch of the input (say 10 x 10 for an image of size 200x200) and convolve the learnt weights over the entire image?
Do DBNs provide better results than CNNs or is it purely dependent on the dataset?
Thank You.
Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.
Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.
I will try to explain the situation through learning shoes.
If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm
there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.
the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept
first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
Normally the idea is to have multiple convolution layers one after another to learn
lines/edges in the first layer,
arcs, corners in the second layer,
higher concepts in higher layers like shoes front, eye in a face, wheel in a car or rectangles cones triangles as primitive but yet combinations of previous layers outputs.
You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)
I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea ( Even just looking at the images in the paper will give you some idea.

Neural Network Picture Classification

I would like to implement a Picture Classification using Neural Network. I want to know the way to select the Features from the Picture and the number of Hidden units or Layers to go with.
For now i have an idea of changing the size of image to some 50x50 or smaller so that the number of Features are less and that all inputs have constant size.The features would be RGB value of each of the pixels.Will it be fine or there is some other better way?
Also i decided to go with 1 Hidden Layer with half the number of units as in Inputs. I can change the number to get better results. Or would i require more layers ?
There are numerous image data sets that are successfully learned by neural networks, like
MNIST (here you will find many links to papers)
and CIFAR-10/100.
Not that you need many training examples. Usually one hidden layer is sufficient. But it can be hard to determine the "right" number of neurons. Sometimes the number of hidden neurons should even be greater than the number of inputs. When you use 2 or more hidden layer you will usually need less hidden nodes and the training will be faster. But when you have to many hidden layers it can be difficult to train the weights in the first layer.
A kind of neural network that is designed especially for images are convolutional neural networks. They usually work much better than multilayer perceptrons and are much faster.
50x50 image features matrix is 2500 features with RGB values. Your neural network may memorize this but most probably will perform poorly on other images.
Therefore this type of problem is more about image-processing , feature extraction. Your features will change according to your requirements. See this similar question about image processing and neural networks
1 layer network will only be suitable for linear problems, are you sure your problem is linear? Otherwise you will need multi layer neural network
