Deep Belief Networks vs Convolutional Neural Networks - machine-learning

I am new to the field of neural networks and I would like to know the difference between Deep Belief Networks and Convolutional Networks.
Also, is there a Deep Convolutional Network which is the combination of Deep Belief and Convolutional Neural Nets?
This is what I have gathered till now. Please correct me if I am wrong.
For an image classification problem, Deep Belief networks have many layers, each of which is trained using a greedy layer-wise strategy.
For example, if my image size is 50 x 50, and I want a Deep Network with 4 layers namely
Input Layer
Hidden Layer 1 (HL1)
Hidden Layer 2 (HL2)
Output Layer
My input layer will have 50 x 50 = 2500 neurons, HL1 = 1000 neurons (say) , HL2 = 100 neurons (say) and output layer = 10 neurons,
in order to train the weights (W1) between Input Layer and HL1, I use an AutoEncoder (2500 - 1000 - 2500) and learn W1 of size 2500 x 1000 (This is unsupervised learning). Then I feed forward all images through the first hidden layers to obtain a set of features and then use another autoencoder ( 1000 - 100 - 1000) to get the next set of features and finally use a softmax layer (100 - 10) for classification. (only learning the weights of the last layer (HL2 - Output which is the softmax layer) is supervised learning).
(I could use RBM instead of autoencoder).
If the same problem was solved using Convolutional Neural Networks, then for 50x50 input images, I would develop a network using only 7 x 7 patches (say). My layers would be
Input Layer (7 x 7 = 49 neurons)
HL1 (25 neurons for 25 different features) - (convolution layer)
Pooling Layer
Output Layer (Softmax)
And for learning the weights, I take 7 x 7 patches from images of size 50 x 50, and feed forward through convolutional layer, so I will have 25 different feature maps each of size (50 - 7 + 1) x (50 - 7 + 1) = 44 x 44.
I then use a window of say 11x11 for pooling hand hence get 25 feature maps of size (4 x 4) for as the output of the pooling layer. I use these feature maps for classification.
While learning the weights, I don't use the layer wise strategy as in Deep Belief Networks (Unsupervised Learning), but instead use supervised learning and learn the weights of all the layers simultaneously. Is this correct or is there any other way to learn the weights?
Is what I have understood correct?
So if I want to use DBN's for image classification, I should resize all my images to a particular size (say 200x200) and have that many neurons in the input layer, whereas in case of CNN's, I train only on a smaller patch of the input (say 10 x 10 for an image of size 200x200) and convolve the learnt weights over the entire image?
Do DBNs provide better results than CNNs or is it purely dependent on the dataset?
Thank You.

Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.
Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.

I will try to explain the situation through learning shoes.
If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm
there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.
the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept
first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
Normally the idea is to have multiple convolution layers one after another to learn
lines/edges in the first layer,
arcs, corners in the second layer,
higher concepts in higher layers like shoes front, eye in a face, wheel in a car or rectangles cones triangles as primitive but yet combinations of previous layers outputs.
You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)
I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf). Even just looking at the images in the paper will give you some idea.

Related

How should I optimize neural network for image classification using pretrained models

Thank you for viewing my question. I'm trying to do image classification based on some pre-trained models, the images should be classified to 40 classes. I want to use VGG and Xception pre-trained model to convert each image to two 1000-dimensions vectors and stack them to a 1*2000 dimensions vector as the input of my network and the network has an 40 dimensions output. The network has 2 hidden layers, one with 1024 neurons and the other one has 512 neurons.
Structure:
image-> vgg(1*1000 dimensions), xception(1*1000 dimensions)->(1*2000 dimensions) as input -> 1024 neurons -> 512 neurons -> 40 dimension output -> softmax
However, using this structure I can only achieve about 30% accuracy. So my question is that how could I optimize the structure of my networks to achieve higher accuracy? I'm new to deep learning so I'm not quiet sure my current design is 'correct'. I'm really looking forward to your advice
I'm not entirely sure I understand your network architecture, but some pieces don't look right to me.
There are two major transfer learning scenarios:
ConvNet as fixed feature extractor. Take a pretrained network (any of VGG and Xception will do, do not need both), remove the last fully-connected layer (this layer’s outputs are the 1000 class scores for a different task like ImageNet), then treat the rest of the ConvNet as a fixed feature extractor for the new dataset. For example, in an AlexNet, this would compute a 4096-D vector for every image that contains the activations of the hidden layer immediately before the classifier. Once you extract the 4096-D codes for all images, train a linear classifier (e.g. Linear SVM or Softmax classifier) for the new dataset.
Tip #1: take only one pretrained network.
Tip #2: no need for multiple hidden layers for your own classifier.
Fine-tuning the ConvNet. The second strategy is to not only replace and retrain the classifier on top of the ConvNet on the new dataset, but to also fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
Tip #3: keep the early pretrained layers fixed.
Tip #4: use a small learning rate for fine-tuning because you don't want to distort other pretrained layers too quickly and too much.
This architecture much more resembled the ones I saw that solve the same problem and has higher chances to hit high accuracy.
There are couple of steps you may try when the model is not fitting well:
Increase training time and decrease learning rate. It may be stopping at very bad local optima.
Add additional layers that can extract specific features for the large number of classes.
Create multiple two-class deep networks for each class ('yes' or 'no' output class). This will let each network be more specialized for each class, rather than training one single network to learn all 40 classes.
Increase training samples.

Can't understand how filters in a Conv net are calculated

I've been studying machine learning for 4 months, and I understand the concepts behind the MLP. The problem came when I started reading about Convolutional Neural Networks. Let me tell you what I know and then ask what I'm having trouble with.
The core parts of a CNN are:
Convolutional Layer: you have "n" number of filters that you use to generate "n" feature maps.
RELU Layer: you use it for normalizing the output of the convolutional layer.
Sub-sampling Layer: used for "generating" a new feature map that represents more abstract concepts.
Repeat the first 3 layers some times and the last part is a common Classifier, such as a MLP.
My doubts are the following:
How do I create the filters used in the Convolutional Layer? Do I have to create a filter, train it, and then put it in the Conv Layer, or do I train it with the backpropagation algorithm?
Imagine I have a conv layer with 3 filters, then it will output 3 feature maps. After applying the RELU and Sub-sampling layer, I will still have 3 feature maps (smaller ones). When passing again through the Conv Layer, how do I calculate the output? Do I have to apply the filter in each feature map separately, or do some kind of operation over the 3 feature maps and then make the sum? I don't have any idea of how to calculate the output of this second Conv Layer, and how many feature maps it will output.
How do I pass the data from the Conv layers to the MLP (for classification in the last part of the NN)?
If someone knows of a simple implementation of a CNN without using a framework I will appreciate it. I think the best way of learning how stuff works is by doing it by yourself. In another time, when you already know how stuff works, you can use frameworks, because they save you a lot of time.
You train it with backpropagation algorithm, the same way as you train MLP.
You apply each filter separately. For example if you have 10 feature maps in the first layer and the filter shape of one of the feature maps from the second layer is 3*3, then you apply 3*3 filter to each of the ten feature maps in the first layer, weights for each feature map are different, in this case one filter will have 3*3*10 weights.
To understand it easier, keep in mind that a pixel of a non-grayscale image has three values - red, green and blue, so if you're passing images to a convolutional neural network ,then in the input layer you alredy have 3 feature maps(for RGB), so one value in the next layer will be connected too all 3 feature maps in the first layer
You should flatten the convolutional feature maps, for example if you have 10 feature maps with the size of 5*5, then you will have a layer with 250 values and then nothing different from MLP, you connect all of these artificial neurons to all of the artificial neurons in the next layer by weights.
Here someone has implemented convolutional neural network without frameworks.
I would also recommend you those lectures.

Digit Recognition on CNN

I am testing printed digits (0-9) on a Convolutional Neural Network. It is giving 99+ % accuracy on the MNIST Dataset, but when I tried it using fonts installed on computer (Ariel, Calibri, Cambria, Cambria math, Times New Roman) and trained the images generated by fonts (104 images per font(Total 25 fonts - 4 images per font(little difference)) the training error rate does not go below 80%, i.e. 20% accuracy. Why?
Here is "2" number Images sample -
I resized every image 28 x 28.
Here is more detail :-
Training data size = 28 x 28 images.
Network parameters - As LeNet5
Architecture of Network -
Input Layer -28x28
| Convolutional Layer - (Relu Activation);
| Pooling Layer - (Tanh Activation)
| Convolutional Layer - (Relu Activation)
| Local Layer(120 neurons) - (Relu)
| Fully Connected (Softmax Activation, 10 outputs)
This works, giving 99+% accuracy on MNIST. Why is so bad with computer-generated fonts? A CNN can handle lot of variance in data.
I see two likely problems:
Preprocessing: MNIST is not only 28px x 28px, but also:
The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
Source: MNIST website
Overfitting:
MNIST has 60,000 training examples and 10,000 test examples. How many do you have?
Did you try dropout (see paper)?
Did you try dataset augmentation techniques? (e.g. slightly shifting the image, probably changing the aspect ratio a bit, you could also add noise - however, I don't think those will help)
Did you try smaller networks? (And how big are your filters / how many filters do you have?)
Remarks
Interesting idea! Did you try simply applying the trained MNIST network on your data? What are the results?
It may be an overfitting problem. It could happen when your network is too complex for the problem to resolve.
Check this article: http://es.mathworks.com/help/nnet/ug/improve-neural-network-generalization-and-avoid-overfitting.html
It definitely looks like an issue of overfitting. I see that you have two convolution layers, two max pooling layers and two fully connected. But how many weights total? You only have 96 examples per class, which is certainly smaller than the number of weights you have in your CNN. Remember that you want at least 5 times more instances in your training set than weights in your CNN.
You have two solutions to improve your CNN:
Shake each instance in the training set. You each number about 1 pixel around. It will already multiply your training set by 9.
Use a transformer layer. It will add an elastic deformation to each number at each epoch. It will strengthen a lot the learning by artificially increase your training set. Moreover, it will make it much more effective to predict other fonts.

Convolutional ImageNet network is invariant to flipping images

I am using Deep learning caffe framework for image classification.
I have coins with faces. Some of them are left directed some of them are right.
To classify them I am using common aproach - take weights and structure from pretrained ImageNet network that have already capture a lot of image patterns and train mostly the last layer to fit my training set.
But I have found that netowork does not works on this set:
I have taken some coin for example leftdirected , generated horizontally flipped image for it and marked it as right sided.
For this set convolutional net gets ~50% accuracy, it is exactly random result.
I have also tried to train net on 2 images ( 2 flipped versions of "h" letter ). But with the same result - 50% . ( If I choose to diffetrent letters and train net on augemeneted dataset - i receive 100% accuracy very fast ) . But invariance to flipping brokes my classification.
My question is: is there exists some aproach that allowes me to use advantages of pretrained imagenet but broke somehow this invariance. And what layer on net make invariance possible.
I am using "caffe" for generating net based on this example approach:
https://github.com/BVLC/caffe/blob/master/examples/02-fine-tuning.ipynb
Caffe basic/baseline models trained on image net mostly use the very trivial image augmentation: flipping images horizontally. That is, imagenet classes are indeed the same when flipped horizontally. Thus, the weights you are trying to fine-tune were trained in a setting where horizontal flip should be ignored and I suppose what you see is a net that captured this quite well - it is no longer sensitive to this particular transformation.
It is not trivial to tell at what layer of the net this invariance is happening and therefore it is not easy to say what layers should be fine-tuned to overcome this behavior. I suppose this invariance is quite fundamental to the network and I will not be surprise if it required re-training of the entire net.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Resources