I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.
Does the order in which I feed training examples into the net affect it's convergence?
Should I feed training examples in random order?
First I will answer your question
Yes it will affect it's convergence
Yes it's encouraged to do that, it's called randomized arrangement
But why?
referenced from here
A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.
Another example, in the past I also have 100 images for each Alphabets (26 classes),
When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.
Related
I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?
I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.
I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.
I would like to know how to define or represent a negative training set if I would want to train a binary classifier from a pre-trained model say, AlexNet on ILSVRC12 (or ImageNet) dataset. What I am currently thinking of is to take one the classes which is not related as the negative training set while the one which is related as positive one. Is there any better way which is more elegant?
The CNNs trained on the ILSVRC data set are already discriminating among 1000 classes of images. Yes, you can use one of those topologies to train a binary classifier, but I suggest that you start with an untrained model and run it through your two chosen classes. If you start with a trained model, you have to unlearn a lot, and your result is still trying to discriminate among 1000 classes: that last FC layer is going to give you trouble.
There are ways to work around the 1000-class problem. If your application already overlaps one or more of the trained classes, then simply add a layer that maps those classes to label "1" and all the others to label "0".
If you're insistent on retaining the trained kernels, then try replacing the final FC layer (1000) with a 2-class FC layer. Then choose your two classes (applicable images vs everything else) and run your training.
I want to classify the input as one of 3 possibilities. Is it better to use 3 networks with one output each or 1 network with 3 outputs?
(i.e. 3 networks that output 0 or 1 or 1 network that outputs a one hot vector of length 3 [1,0,0]
Does the answer change depending on how complex the incoming data is to classify?
At what amount of outputs does it make sense to partition the networks (if ever)? For example, if I want to classify into 20 groups, does it make a difference?
I would say it would make more sense to use a single network with multiple outputs.
The main reason is that hidden layers (I'm assuming you'll have at least one hidden layer) can be interpreted as transforming the data from the original space (feature space) into a different space that is more suitable for the task (classification in your case). For example, when training a network to recognize faces from raw pixels, it might use a hidden layer to first detect simple shapes such as small lines based on pixels, then use another hidden layer to detect simple shapes such as eyes/noses based on the lines from the first layer, etc. (it may not be entirely as ''clean'' as this, but this is an easy-to-understand example).
Such a transformation that a network can learn is typically useful for the classification task, regardless of what class the specific example has. For example, it is useful to be able to detect eyes in images regardless of whether or not the actual image contains a face; if you do indeed detect two eyes, you can classify it as a face, and otherwise you classify it as not being a face. In both cases, you were looking for eyes.
So, by splitting up into multiple networks, you may end up learning quite similar patterns in all networks anyway. Then you might as well have saved yourself the computational effort and just learned it once.
Another disadvantage of splitting up into multiple networks would be that you would probably cause your dataset to become imbalanced (or more imbalanced if it already is imbalanced). Suppose you have three classes, with exactly 1/3 of the dataset belonging to each class. If you use three networks for three binary classification tasks, you suddenly always have 1/3 ''1'' classes and 2/3 ''0'' classes. A network may then become biased towards predicting 0s everywhere, since those are the majority classes in each of the three separate problems.
Note that this is all based on my intuition; the best solution if you have time would be to simply try both approaches and test! I don't think I have ever seen someone using multiple networks for a single classification task in practice though, so if you only have time for one approach I'd recommend going for a single network.
I think the only case where it would really make sense to use multiple networks would be if you actually want to predict multiple unrelated values (or at least values that are not strongly related). For example, if, given images, you want to 1) predict whether or not there is a dog on the image, and 2) whether it is a photograph or a painting. Then it may be better to use two networks with two outputs each, instead of a single network with four outputs.
In this case i want to make letter recognition, the letter is scanned from a paper. the result of that process i have 5 x 5 binary matrix. so, it would use 25 input node. but i don't understand how to determine total hidden layer nodes and outputs node for that cases.i want to build the architecture of multilayer perecptron for that cases. thanks for your help!
Every NN has three types of layers: input, hidden, and output.
Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.
The Input Layer
Simple--every NN has exactly one of them--no exceptions that I'm aware of.
With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.
The Output Layer
Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).
If the NN is a regressor, then the output layer has a single node.
If the NN is a classifier, then it also has a single node unless softmax is used
in which case the output layer has one node per class label in your model.
The Hidden Layers
So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.
How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.
Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.
So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
Optimization of the Network Configuration
Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.
Formula
One additional rule of thumb for supervised learning networks, the upperbound on the number of hidden neurons that won't result in over-fitting is:
Others recommend setting alpha to a value between 5 and 10, but I find a value of 2 will often work without overfitting. As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or Ns∗(Ni+No) (assuming they're all independent). So alpha is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.
For an automated procedure you'd start with an alpha of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error for training data is significantly smaller than for the cross-validation data set.
References
Advameg (2016) Comp.Ai.Neural-nets FAQ, part 1 of 7: Introduction. Available at: http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016a) Available at: https://stats.stackexchange.com/a/136542
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016b) Available at: https://stats.stackexchange.com/a/1097
Legal, H.R. - and Info, C. (2016) Introduction to neural networks for java, 2nd edition. Available at: http://www.heatonresearch.com/book/programming-neural-networks-java-2.html
Working on something for, I need to compare some classification techniques (support vector machines, neural networks, decision trees, etc). My contact person at university told me to use the Kaggle data set https://www.kaggle.com/c/GiveMeSomeCredit/data.
The data set consist of a training set of 150,000 borrowers and a test set of 100,000 borrowers. For me, only the training set is useful, because the test set does not have the outcome of the borrowers.
My questions is, how many instances should I use, keeping in mind the computational effort of a large data set. In the papers which I've used for my literature study the size of the data sets vary from 500 to 2500 instances.
How much instances would you use?
split the data,
training on 90% and testing on the remaining 10%:
size = int(len(brown_tagged_sents) * 0.9)
size 4160