Digit Recognition on CNN - machine-learning

I am testing printed digits (0-9) on a Convolutional Neural Network. It is giving 99+ % accuracy on the MNIST Dataset, but when I tried it using fonts installed on computer (Ariel, Calibri, Cambria, Cambria math, Times New Roman) and trained the images generated by fonts (104 images per font(Total 25 fonts - 4 images per font(little difference)) the training error rate does not go below 80%, i.e. 20% accuracy. Why?
Here is "2" number Images sample -
I resized every image 28 x 28.
Here is more detail :-
Training data size = 28 x 28 images.
Network parameters - As LeNet5
Architecture of Network -
Input Layer -28x28
| Convolutional Layer - (Relu Activation);
| Pooling Layer - (Tanh Activation)
| Convolutional Layer - (Relu Activation)
| Local Layer(120 neurons) - (Relu)
| Fully Connected (Softmax Activation, 10 outputs)
This works, giving 99+% accuracy on MNIST. Why is so bad with computer-generated fonts? A CNN can handle lot of variance in data.

I see two likely problems:
Preprocessing: MNIST is not only 28px x 28px, but also:
The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
Source: MNIST website
Overfitting:
MNIST has 60,000 training examples and 10,000 test examples. How many do you have?
Did you try dropout (see paper)?
Did you try dataset augmentation techniques? (e.g. slightly shifting the image, probably changing the aspect ratio a bit, you could also add noise - however, I don't think those will help)
Did you try smaller networks? (And how big are your filters / how many filters do you have?)
Remarks
Interesting idea! Did you try simply applying the trained MNIST network on your data? What are the results?

It may be an overfitting problem. It could happen when your network is too complex for the problem to resolve.
Check this article: http://es.mathworks.com/help/nnet/ug/improve-neural-network-generalization-and-avoid-overfitting.html

It definitely looks like an issue of overfitting. I see that you have two convolution layers, two max pooling layers and two fully connected. But how many weights total? You only have 96 examples per class, which is certainly smaller than the number of weights you have in your CNN. Remember that you want at least 5 times more instances in your training set than weights in your CNN.
You have two solutions to improve your CNN:
Shake each instance in the training set. You each number about 1 pixel around. It will already multiply your training set by 9.
Use a transformer layer. It will add an elastic deformation to each number at each epoch. It will strengthen a lot the learning by artificially increase your training set. Moreover, it will make it much more effective to predict other fonts.

Related

Data normalization Convolutional Autoencoders

Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.

Poor performance on digit recognition with CNN trained on MNIST dataset

I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.

How to fit a classifier with high accuracy on the training set with low features?

I have input (r,c) in range (0, 1] as the coordinate of a pixel of an image and its color 1 or 2 only.
I have about 6,400 pixels.
My attempt of fitting X=(r,c) and y=color was a failure the accuracy won't go higher than 70%.
Here's the image:
The first is the actual image, the 2nd is the image I use to train on, it has only 2 colors. The last is the image that the neural network generated with about 500 weights training with 50 iterations. Input Layer is 2, one hidden layer of size 100, and the output layer is 2. (for binary classification like this, I may need only one output layer but I am just preparing for multi-class classification)
The classifier failed to fit the training set, why is that? I tried generating high polynomial terms of those 2 features but it doesn't help. I tried using Gaussian kernel and random 20-100 landmarks on the picture to add more features, also got similar output. I tried using logistic regressions, doesn't help.
Please help me increase the accuracy.
Here's the input:input.txt (you can load it into Octave the variable is coordinate (r,c features) and idx (color)
You can try plotting it first to make sure that you understand the input then try training on it and tell me if you get better result.
Your problem is hard to model. You are trying to fit function from R^2 to R, which has lots of complexity - lots of "spikes", lots of discontinuous regions (pixels that are completely separated from the rest). This is not an easy problem, and not usefull one.. In order to overfit your network to such setting you will need plenty of hidden units. Thus, what are the options to do so?
General things that are missing in the question, and are important
Your output variable should be {0, 1} if you are fitting your network through cross entropy cost (log likelihood), which you should use for classification.
50 iteraions (if you are talking about some mini-batch iteraions) is orders of magnitude to small, unless you mean 50 epochs (iterations over whole training set).
Actual things, that will probably need to be done (at least one of the below):
I assume that you are using ReLU activations (or Tanh, hard to say looking at the output) - you can instead use RBF activations, and increase number of hidden neurons to ~5000,
If you do not want to go with RBFs, then you will need 1-2 additional hidden layers to fit function of this complexity. Try architecture of type 100-100-100 instaed.
If the above fails - increase number of hidden units, that's all you need - enough capacity.
In general: neural networks are not designed for working with low dimensional datasets. This is nice example from the web, that you can learn pix-pos to color mapping, but it is completely artificial and seems to actually harm people intuitions.

Deep Belief Networks vs Convolutional Neural Networks

I am new to the field of neural networks and I would like to know the difference between Deep Belief Networks and Convolutional Networks.
Also, is there a Deep Convolutional Network which is the combination of Deep Belief and Convolutional Neural Nets?
This is what I have gathered till now. Please correct me if I am wrong.
For an image classification problem, Deep Belief networks have many layers, each of which is trained using a greedy layer-wise strategy.
For example, if my image size is 50 x 50, and I want a Deep Network with 4 layers namely
Input Layer
Hidden Layer 1 (HL1)
Hidden Layer 2 (HL2)
Output Layer
My input layer will have 50 x 50 = 2500 neurons, HL1 = 1000 neurons (say) , HL2 = 100 neurons (say) and output layer = 10 neurons,
in order to train the weights (W1) between Input Layer and HL1, I use an AutoEncoder (2500 - 1000 - 2500) and learn W1 of size 2500 x 1000 (This is unsupervised learning). Then I feed forward all images through the first hidden layers to obtain a set of features and then use another autoencoder ( 1000 - 100 - 1000) to get the next set of features and finally use a softmax layer (100 - 10) for classification. (only learning the weights of the last layer (HL2 - Output which is the softmax layer) is supervised learning).
(I could use RBM instead of autoencoder).
If the same problem was solved using Convolutional Neural Networks, then for 50x50 input images, I would develop a network using only 7 x 7 patches (say). My layers would be
Input Layer (7 x 7 = 49 neurons)
HL1 (25 neurons for 25 different features) - (convolution layer)
Pooling Layer
Output Layer (Softmax)
And for learning the weights, I take 7 x 7 patches from images of size 50 x 50, and feed forward through convolutional layer, so I will have 25 different feature maps each of size (50 - 7 + 1) x (50 - 7 + 1) = 44 x 44.
I then use a window of say 11x11 for pooling hand hence get 25 feature maps of size (4 x 4) for as the output of the pooling layer. I use these feature maps for classification.
While learning the weights, I don't use the layer wise strategy as in Deep Belief Networks (Unsupervised Learning), but instead use supervised learning and learn the weights of all the layers simultaneously. Is this correct or is there any other way to learn the weights?
Is what I have understood correct?
So if I want to use DBN's for image classification, I should resize all my images to a particular size (say 200x200) and have that many neurons in the input layer, whereas in case of CNN's, I train only on a smaller patch of the input (say 10 x 10 for an image of size 200x200) and convolve the learnt weights over the entire image?
Do DBNs provide better results than CNNs or is it purely dependent on the dataset?
Thank You.
Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.
Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.
I will try to explain the situation through learning shoes.
If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm
there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.
the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept
first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
Normally the idea is to have multiple convolution layers one after another to learn
lines/edges in the first layer,
arcs, corners in the second layer,
higher concepts in higher layers like shoes front, eye in a face, wheel in a car or rectangles cones triangles as primitive but yet combinations of previous layers outputs.
You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)
I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf). Even just looking at the images in the paper will give you some idea.

Overflowing of neural network weights in training

I'm training my neural network to classify some things in an image. I crop 40x40 pixels images and classify it that it as some object or not. So it has 1600 input neurons, 3 hidden layers (500, 200, 30) and 1 output neuron that must say 1 or 0. I use the Flood library.
I cannot train it with QuasiNewtonMethod, because it uses a big matrix in the algorithm and it do not fit in my memory. So I use GradientDescent and the ObjectiveFunctional is NormalizedSquaredError.
The problem is that by training it overflows the weights and the output of the neural network is INF or NaN for every input.
Also my dataset is too big (about 800mb when it is in CSV) and I can't load it fully. So I made many InputTargetDataSets with 1000 instances and saved it as XML (the default format for Flood) and training it for one epoch on each dataset randomly shuffled. But also when I train it just on one big dataset (10000 instances) it overflows.
Why is this happening and how can I prevent that?
I would recommend normalization of inputs. You should also think about that if you have 1600 neurons..output of input layer will sum(if sigmoid neurons) and there can be many problems.
It is quite useful to print out some steps..for example in which step it overflows.
There are some tips for weights of neurons. I would recommend very small < 0.01. Maybe if you could give more info about NN and intervals of inputs, weights etc. I could give you some other ideas.
And btw I think it is mathematically proved that two layers should be enough so there is no need for three hidden layers if you are not using some specialized algorithms which simulate human eye..

Resources