Large weights in NN - machine-learning

My NN when training seems to produce weights that are very large. Is it okay if weights in a neural network have values greater than 1 or less than -1?
Furthermore, the extremely large weights tend to be in the connections between the input and first hidden layer. For example, the input-hidden layer weights look something like this:
-12.728901995585,-13.2337212413569,5.73922593605989,-5.12803672380726......
Whereas the hidden-output weights look more like:
-0.00434217225630834,0.130458439630824,0.153923956195796,0.59407334088441
The NN functions fine, however the large weights are concerning as usually I seem them being between -1 and 1. Are larger weights okay?
Thanks.

According to this example https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/ your NN's weights between input and hidden layers may seem to be off the usual (assuming feed forward NNs). Towards the end of the given example you will see how smaller the changes are generated through back-propagation.
But depending on the training iteration count and NN architecture it is possible to have large weights too. Also larger weights mean the preceding neuron has more effect than the following neuron on NN's decision.
In general NN's interior values are difficult to understand. If your NN does the job, I guess it's OK to have large weights between some layers.
I am unfortunately not very proficient on the subject. So corrections and their explanations are welcome if any. I am looking forward to learn from experts.

Related

Suggestions needed about the generalization of a regression neural network

I've trained a deep neural network of a few hundreds of features which analyzes geo data of a city, and calculate a score per sample based on the profile between the observer and the target location. That is, the longer the distance between the observer and target, the more features I will have for this sample. When I train my NN with samples from part of a city and test with other parts of the same city, the NN works very well, but when I apply my NN to other cities, the NN starts to give high standard deviation of errors, especially on cases which the samples of the city I'm applying the NN to generally has more features than samples of the city I used to train this NN. To deal with that, I've appended 10% of empty samples in training which was able to reduce the errors by half, but the remaining errors are still too large compare to the solutions calculated by hand. May I have some advise of generalize a regression neural network? Thanks!
I was going to ask for more examples of your data, and your network, but it wouldn't really matter.
How to improve the generalization of a regression neural network?
You can use exactly the same things you would use for a classification neural network. The only difference is what it does with the numbers that are output from the penultimate layer!
I've appended 10% of empty samples in training which was able to reduce the errors by half,
I didn't quite understand what that meant (so I'd still be interested if you expanded your question with some more concrete details), but it sounds a bit like using dropout. In Keras you append a Dropout() layer between your other layers:
...
model.append(Dense(...))
model.append(Dropout(0.2))
model.append(Dense(...))
...
0.2 means 20% dropout, which is a nice starting point: you could experiment with values up to about 0.5.
You could read the original paper or this article seems to be a good introduction with keras examples.
The other generic technique is to add some L1 and/or L2 regularization, here is the manual entry.
I typically use a grid search to experiment with each of these, e.g. trying each of 0, 1e-6, 1e-5 for each of L1 and L2, and each of 0, 0.2, 0.4 (usually using the same value between all layers, for simplicity) for dropout. (If 1e-5 is best, I might also experiment with 5e-4 and 1e-4.)
But, remember that even better than the above are more training data. Also consider using domain knowledge to add more data, or more features.

Where do neurons in a neural network share their predictive results (learned functions)?

Definitely a noob NN question, but here it is:
I understand that neurons in a layer of a neural network all initialize with different (essentially random) input-feature weights as a means to vary their back-propagation results so they can converge to different functions describing the input data. However, I do not understand when or how these neurons generating unique functions to describe the input data "communicate" their results with each other, as is done in ensemble ML methods (e.g. by growing a forest of trees with randomized initial-decision criteria and then determining the most discriminative models in the forest). In the trees ensemble example, all of the trees are working together to generalize the rules each model learns.
How, where, and when do neurons communicate their prediction functions? I know individual neurons use gradient descent to converge to their respective functions, but they are unique since they started with unique weights. How do they communicate these differences? I imagine there's some subtle behavior in combining the neuronic results in the output layer where this communication is occurring. Also, is this communication part of the iterative training process?
Someone in the comments section (https://datascience.stackexchange.com/questions/14028/what-is-the-purpose-of-multiple-neurons-in-a-hidden-layer) asked a similar question, but I didn't see it answered.
Help would be greatly appreciated!
During propagation, each neuron typically participates in forming the value in multiple neurons of the next layer. In back-propagation, each of those next-layer neurons will try to push the participating neurons' weights around in order to minimise the error. That's pretty much it.
For example, let's say you're trying to get a NN to recognise digits. Let's say that one neuron in a hidden layer starts getting ideas on recognising vertical lines, another starts finding horisontal lines, and so on. The next-layer neuron that is responsible for finding 1 will see that if it wants to be more accurate, it should pay lots of attention to the vertical line guy; and also the more the horisontal line guy yells the more it's not a 1. That's what weights are: telling each neuron how strongly it should care about each of its inputs. In turn, the vertical line guy will learn how to recognise vertical lines better, by adjusting weights for its input layer (e.g. individual pixels).
(This is quite abstract though. No-one told the vertical line guy that he should be recognising vertical lines. Different neurons just train for different things, and by the virtue of mathematics involved, they end up picking different features. One of them might or might not end up being vertical line.)
There is no "communication" between neurons on the same layer (in the base case, where layers flow linearly from one to the next). It's all about neurons on one layer getting better at predicting features that the next layer finds useful.
At the output layer, the 1 guy might be saying "I'm 72% certain it's a 1", while the 7 guy might be saying "I give that 7 a B+", while the third one might be saying "A horrible 3, wouldn't look at twice". We instead usually either take whoever's loudest's word for it, or we normalise the output layer (divide by the sum of all outputs) so that we have actual comparable probabilities. However, this normalisation is not actually a part of neural network itself.

Why do I get good accuracy with IRIS dataset with a single hidden node?

I have a minimal example of a neural network with a back-propagation trainer, testing it on the IRIS data set. I started of with 7 hidden nodes and it worked well.
I lowered the number of nodes in the hidden layer to 1 (expecting it to fail), but was surprised to see that the accuracy went up.
I set up the experiment in azure ml, just to validate that it wasn't my code. Same thing there, 98.3333% accuracy with a single hidden node.
Can anyone explain to me what is happening here?
First, it has been well established that a variety of classification models yield incredibly good results on Iris (Iris is very predictable); see here, for example.
Secondly, we can observe that there are relatively few features in the Iris dataset. Moreover, if you look at the dataset description you can see that two of the features are very highly correlated with the class outcomes.
These correlation values are linear, single-feature correlations, which indicates that one can most likely apply a linear model and observe good results. Neural nets are highly nonlinear; they become more and more complex and capture greater and greater nonlinear feature combinations as the number of hidden nodes and hidden layers is increased.
Taking these facts into account, that (a) there are few features to begin with and (b) that there are high linear correlations with class, would all point to a less complex, linear function as being the appropriate predictive model-- by using a single hidden node, you are very nearly using a linear model.
It can also be noted that, in the absence of any hidden layer (i.e., just input and output nodes), and when the logistic transfer function is used, this is equivalent to logistic regression.
Just adding to DMlash's very good answer: The Iris data set can even be predicted with a very high accuracy (96%) by using just three simple rules on only one attribute:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
In general neural networks are black boxes where you never really know what they are learning but in this case back-engineering should be easy. It is conceivable that it learnt something like the above.
The above rules were found by using the OneR package.

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources