I have recently found this answer to a question regarding choosing a good neural network structure, specifically in regard to the hidden layers and their neurons. The answer gives a general formula that produces a non-optimal but good starting point for the network.
Nh = (Ns)/(a * (Ni + No))
Where:
Ni = number of input neurons.
No = number of output neurons.
Ns = number of samples in training data set.
a = an arbitrary scaling factor usually 2-10.
So what does this mean if I have 8 inputs, 8 outputs (regression mode), and a training set of 37 million examples? Plugging in the numbers with a scaling factor of 10 is over 230,000 hidden neurons, and reducing that scaling factor will only cause it to go up. Surely this is far too many?
For clarification, the training set is stock market candlestick data, and I am attempting to train the network to predict what the next candle might look like given the current one (and perhaps a few previous ones too). I plan on using backpropagation, which I know is very resource intensive, so surely I don't need hundreds of thousands of neurons in the hidden layers to make the network effective, would I?
Related
I am new to Machine/Deep learning area!
If I understood correctly, when I am using images as an input,
the number of neurons at input layer = the number of pixels (i.e resolution)
The weights and biases are updated through back-propagation to achieive low as possible error-rate.
Question 1.
So, even one single image data will adjust the values of weights & biases (through back-propagation algorithm), then how does adding more similar images into this MLP improve the performance?
(I must be missing something big.. however to me, it seems like it will only be optimised for the given single image and if i input the next one (of similar img), it will only be optimised for the next one )
Question 2.
If I want to train my MLP to recognise certain types of images ( Let's say clothes / animals ) , what is a good number of training set for each label(i.e clothes,animals)? I know more training set will produce better result, however how much number would be ideal for good enough performance?
Question 3. (continue)
A bit different angle question,
There is a google cloud vision API , which will take images as an input, and produce label/probability as an output. So this API will give me an output of 100 (lets say) labels and the probabilities of each label.
(e.g, when i put an online game screenshot, it will produce as below,)
Can this type of data be used as an input to MLP to categorise certain type of images?
( Assuming I know all possible types of labels that Google API produces and using all of them as input neurons )
Pixel values represent an image. But also, I think this type of API output results can represent an image in different angle.
If so, what would be the performance difference ?
e.g) when classifying 10 different types of images,
(pixels trained model) vs (output labels trained model)
I can help you with the "intuitive" picture.
First, it may be worth looking at convolution neural nets and deep learning and see how to handle images as input to reduce number of weights. It will not be 1 weight per pixel.
Also, what exactly you mean by "performance"? That is not a well defined question. If you use 1 image, say a cat, do you mean by performance that you can identify cats in other pictures, or how well you are able to get close to your cat?
Imagine you have a table of 3 weights, 1 input and 1 output, and trained your network to have error of < 0.01, and the desired output is 0.5
W1 | W2 | W3 | Output
0.1 0.2 0.05 0.5006
If you retrain the network, you may get a different
W1 | W2 | W3 | Output
0.3 0.2 0.08 0.49983
Since the weights are way different, you can imagine that there are several solutions.
Then, if you add another input, you can imagine that some of those weights which worked for first solution will work for the second.
Then you add another input. Then subset of the solutions with 2 inputs will work for 3 inputs. Etc.
When you have enough unrelated or noisy inputs, you won't find a subset of weights which meet your error criterion. Either you need to add weights (more degrees of freedom) or increase the error target, or both.
Now, you have a learning rate when you train a network. Say you are doing online training (for each input you update the weights), not batch training (you find the error vector for a batch (subset) of the input and you update your weights based on that, 1 time for the batch).
Now, suppose your learning rate was 0.01 and weight of 0.1. Intuitively:
If, for the first input, the first weight had derivative of 5, then your weight has new value of 0.1 - 0.01*5 = 0.05
If you feed your next input, say the derivative was -5. That means that the second input "disagrees" with the first change, and tries to go back to 0.01
If the derivative for the second input was 5, that means that the second weight "agrees" with the first.
If you have 20 inputs, some will pull the value up, some will push the value down. You keep looping through the training and then the value will approach a value which most of the inputs agree on, hence minimizing the error caused by that weight.
For question 2:
My mathematical guts feel tells me you definitely need at least 2*weight number to have any meaning to the training, but you should make that at least 10x the number of weights for the least minimum amount to even make a conclusion about your network, unless you are not trying to guess something new (for example, for xor gate, you can probably get away with way less input than weights, but that is a bit long discussion)
Note:
With 1 image, you can rotate it, stretch it, mix it with other images... to create another images and increase your input set.
If you have a simple input like xor gate, you can create inputs like (0.3, 0.7) (0.3, 0.6) (0.2, 0.8)... to expand your training set.
For question 3:
This is equivalent to chaining google's network with a network you create serially, but training each part separately.
Basically: You have Pictures --> 10 labels input to your network --> your classification
The problem I see there is, you may not know all the possible outputs of google's classification. But say they are consistent,
Is your label same as one of the 10 labels? If so, use the given label. If it is a different type of label, you can use that API to simplify your network. What are the consequences or what is the performance?
That is beyond me. In neural nets, while they have good mathematical theories to tell us what they can do, many posed problems such as the one you asked require either a special mathematical analysis (perhaps get PhD on some insight related to that class of problems) or, as most do, show empirical results.
In this case i want to make letter recognition, the letter is scanned from a paper. the result of that process i have 5 x 5 binary matrix. so, it would use 25 input node. but i don't understand how to determine total hidden layer nodes and outputs node for that cases.i want to build the architecture of multilayer perecptron for that cases. thanks for your help!
Every NN has three types of layers: input, hidden, and output.
Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.
The Input Layer
Simple--every NN has exactly one of them--no exceptions that I'm aware of.
With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.
The Output Layer
Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).
If the NN is a regressor, then the output layer has a single node.
If the NN is a classifier, then it also has a single node unless softmax is used
in which case the output layer has one node per class label in your model.
The Hidden Layers
So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.
How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.
Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.
So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
Optimization of the Network Configuration
Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.
Formula
One additional rule of thumb for supervised learning networks, the upperbound on the number of hidden neurons that won't result in over-fitting is:
Others recommend setting alpha to a value between 5 and 10, but I find a value of 2 will often work without overfitting. As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or Ns∗(Ni+No) (assuming they're all independent). So alpha is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.
For an automated procedure you'd start with an alpha of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error for training data is significantly smaller than for the cross-validation data set.
References
Advameg (2016) Comp.Ai.Neural-nets FAQ, part 1 of 7: Introduction. Available at: http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016a) Available at: https://stats.stackexchange.com/a/136542
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016b) Available at: https://stats.stackexchange.com/a/1097
Legal, H.R. - and Info, C. (2016) Introduction to neural networks for java, 2nd edition. Available at: http://www.heatonresearch.com/book/programming-neural-networks-java-2.html
So I have a set of data, 1900 rows and 22 columns. 21 column is just numbers but that one crucial that I want to train the data on has 3 stages: a,b, and c.
I have tried both decision trees/jungles, and neural networks and no matter how I set them up I can't get more than 55% precision.
Usually it's around 50% accuracy and the best I was ever able to get was 55% overall accuracy and around 70% average.
Should I even use NN on a such small dataset? As I said I tried with other ML algorithms but they don't yield anything better.
I think that there is no clear answer to your question. Low accuracy score may come from a few reasons. I will state some of them in the following points :
When you use decision trees / neural networks - low accuracy may be a result of a wrong setup of metaparameters (like maximum height of a tree or number of trees in DT or wrong topology or data preparation in NN case). What I advise you is to use a grid or random search for both NN and DT to look for the best metaparameters for your algorithm (in case of "static" (not sequential data) packages like e.g. h20 in R or Scikit-learn in Python may do a great job) and in neural network case - normalize your data properly (e.g. subtract mean and divide by standard deviation every x column of your data).
Your dataset might be inconsistent. If e.g. your data has not a property that there exists a functional dependency between x and y (what means that y = f(x) for some f) then what is learnt during a training session is a probability that given x - your example belong to some specified class. This inconsistency might seriously harm your accuracy. What I advice you in this case is to try specify if that phenomenon occurs and then e.g. try to segmentate your data to solve the problem.
Your data set might be simply too small. Try to get more data in this case.
I have dataset which is built from 940 attributes and 450 instance and I'm trying to find the best classifier to get the best results.
I have used every classifier that WEKA suggest (such as J48, costSensitive, combinatin of several classifiers, etc..)
The best solution I have found is J48 tree with accuracy of 91.7778 %
and the confusion matrix is:
394 27 | a = NON_C
10 19 | b = C
I want to get better reuslts in the confution matrix for TN and TP at least 90% accuracy for each.
Is there something that I can do to improve this (such as long time run classifiers which scans all options? other idea I didn't think about?
Here is the file:
https://googledrive.com/host/0B2HGuYghQl0nWVVtd3BZb2Qtekk/
Please help!!
I'd guess that you got a data set and just tried all possible algorithms...
Usually, it is a good to think about the problem:
to find and work only with relevant features(attributes), otherwise
the task can be noisy. Relevant features = features that have high
correlation with class (NON_C,C).
your dataset is biased, i.e. number of NON_C is much higher than C.
Sometimes it can be helpful to train your algorithm on the same portion of positive and negative (in your case NON_C and C) examples. And cross-validate it on natural (real) portions
size of your training data is small in comparison with the number of
features. Maybe increasing number of instances would help ...
...
There are quite a few things you can do to improve the classification results.
First, it seems that your training data is severly imbalanced. By training with that imbalance you are creating a significant bias in almost any classification algorithm
Second, you have a larger number of features than examples. Consider using L1 and/or L2 regularization to improve the quality of your results.
Third, consider projecting your data into a lower dimension PCA space, say containing 90 % of the variance. This will remove much of the noise in the training data.
Fourth, be sure you are training and testing on different portions of your data. From your description it seems like you are training and evaluating on the same data, which is a big no no.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
If we have 10 eigenvectors then we can have 10 neural nodes in input layer.If we have 5 output classes then we can have 5 nodes in output layer.But what is the criteria for choosing number of hidden layer in a MLP and how many neural nodes in 1 hidden layer?
how many hidden layers?
a model with zero hidden layers will resolve linearly separable data. So unless you already know your data isn't linearly separable, it doesn't hurt to verify this--why use a more complex model than the task requires? If it is linearly separable then a simpler technique will work, but a Perceptron will do the job as well.
Assuming your data does require separation by a non-linear technique, then always start with one hidden layer. Almost certainly that's all you will need. If your data is separable using a MLP, then that MLP probably only needs a single hidden layer. There is theoretical justification for this, but my reason is purely empirical: Many difficult classification/regression problems are solved using single-hidden-layer MLPs, yet I don't recall encountering any multiple-hidden-layer MLPs used to successfully model data--whether on ML bulletin boards, ML Textbooks, academic papers, etc. They exist, certainly, but the circumstances that justify their use is empirically quite rare.
How many nodes in the hidden layer?
From the MLP academic literature. my own experience, etc., I have gathered and often rely upon several rules of thumb (RoT), and which I have also found to be reliable guides (ie., the guidance was accurate, and even when it wasn't, it was usually clear what to do next):
RoT based on improving convergence:
When you begin the model building, err on the side of more nodes
in the hidden layer.
Why? First, a few extra nodes in the hidden layer isn't likely do any any harm--your MLP will still converge. On the other hand, too few nodes in the hidden layer can prevent convergence. Think of it this way, additional nodes provides some excess capacity--additional weights to store/release signal to the network during iteration (training, or model building). Second, if you begin with additional nodes in your hidden layer, then it's easy to prune them later (during iteration progress). This is common and there are diagnostic techniques to assist you (e.g., Hinton Diagram, which is just a visual depiction of the weight matrices, a 'heat map' of the weight values,).
RoTs based on size of input layer and size of output layer:
A rule of thumb is for the size of this [hidden] layer to be somewhere
between the input layer size ... and the output layer size....
To calculate the number of hidden nodes we use a general rule of:
(Number of inputs + outputs) x 2/3
RoT based on principal components:
Typically, we specify as many hidden nodes as dimensions [principal
components] needed to capture 70-90% of the variance of the input data
set.
And yet the NN FAQ author calls these Rules "nonsense" (literally) because they: ignore the number of training instances, the noise in the targets (values of the response variables), and the complexity of the feature space.
In his view (and it always seemed to me that he knows what he's talking about), choose the number of neurons in the hidden layer based on whether your MLP includes some form of regularization, or early stopping.
The only valid technique for optimizing the number of neurons in the Hidden Layer:
During your model building, test obsessively; testing will reveal the signatures of "incorrect" network architecture. For instance, if you begin with an MLP having a hidden layer comprised of a small number of nodes (which you will gradually increase as needed, based on test results) your training and generalization error will both be high caused by bias and underfitting.
Then increase the number of nodes in the hidden layer, one at a time, until the generalization error begins to increase, this time due to overfitting and high variance.
In practice, I do it this way:
input layer: the size of my data vactor (the number of features in my model) + 1 for the bias node and not including the response variable, of course
output layer: soley determined by my model: regression (one node) versus classification (number of nodes equivalent to the number of classes, assuming softmax)
hidden layer: to start, one hidden layer with a number of nodes equal to the size of the input layer. The "ideal" size is more likely to be smaller (i.e, some number of nodes between the number in the input layer and the number in the output layer) rather than larger--again, this is just an empirical observation, and the bulk of this observation is my own experience. If the project justified the additional time required, then I start with a single hidden layer comprised of a small number of nodes, then (as i explained just above) I add nodes to the Hidden Layer, one at a time, while calculating the generalization error, training error, bias, and variance. When generalization error has dipped and just before it begins to increase again, the number of nodes at that point is my choice. See figure below.
To automate the selection of the best number of layers and best number of neurons for each of the layers, you can use genetic optimization.
The key pieces would be:
Chromosome: Vector that defines how many units in each hidden layer (e.g. [20,5,1,0,0] meaning 20 units in first hidden layer, 5 in second, ... , with layers 4 and 5 missing). You can set a limit on the maximum number number of layers to try, and the max number of units in each layer. You should also place restrictions of how the chromosomes are generated. E.g. [10, 0, 3, ... ] should not be generated, because any units after a missing layer (the '3,...') would be irrelevant and would waste evaluation cycles.
Fitness Function: A function that returns the reciprocal of the lowest training error in the cross-validation set of a network defined by a given chromosome. You could also include the number of total units, or computation time if you want to find the "smallest/fastest yet most accurate network".
You can also consider:
Pruning: Start with a large network, then reduce the layers and hidden units, while keeping track of cross-validation set performance.
Growing: Start with a very small network, then add units and layers, and again keep track of CV set performance.
It is very difficult to choose the number of neurons in a hidden layer, and to choose the number of hidden layers in your neural network.
Usually, for most applications, one hidden layer is enough. Also, the number of neurons in that hidden layer should be between the number of inputs (10 in your example) and the number of outputs (5 in your example).
But the best way to choose the number of neurons and hidden layers is experimentation. Train several neural networks with different numbers of hidden layers and hidden neurons, and measure the performance of those networks using cross-validation. You can stick with the number that yields the best performing network.
Recently there is theoretical work on this https://arxiv.org/abs/1809.09953. Assuming you use a RELU MLP, all hidden layers have the same number of nodes and your loss function and true function that you're approximating with a neural network obey some technical properties (in the paper), you can choose your depth to be of order $\log(n)$ and your width of hidden layers to be of order $n^{d/(2(\beta+d))}\log^2(n)$. Here $n$ is your sample size, $d$ is the dimension of your input vector, and $\beta$ is a smoothness parameter for your true function. Since $\beta$ is unknown, you will probably want to treat it as a hyperparameter.
Doing this you can guarantee that with probability that converges to $1$ as function of sample size your approximation error converges to $0$ as a function of sample size. They give the rate. Note that this isn't guaranteed to be the 'best' architecture, but it can at least give you a good place to start with. Further, my own experience suggests that things like dropout can still help in practice.