Should you normalize outputs of a neural network for regression tasks? - machine-learning

I've made a CNN that takes a signal as input and outputs the parameters used in a simulation to create that signal. I've heard that for regression tasks you don't normally normalize the outputs to a neural network. But the variables the model is trying to predict have very different standard deviations, like one variable is always in the range of [1x10^-20, 1x10-24] while another is almost always in the range of [8, 16]. My question is since all loss functions first take the difference between the target and actual output values and this difference would naturally scale with the std of that output variable wouldn't loss of the network mostly dependent on the accuracy of the output variables with large stds and not ones with small stds? Also would unnormalized output hinder the training process since the network can get low loss for an output variable with very low std by just guessing values close to its mean?
If this is the case why can't I find much on the internet talking about or suggesting to normalize outputs? It seems really important for getting reliable loss values.

Related

Accuracy of Neural Networks incase of doing prediction of a continuious variable

Is there a way to calculate Accuracy instead of Error metrics for neural networks when doing regression (prediction of continuous variable) the same way we do when classifying categorical variables?
Though, the concept of accuracy comes in the classification, but you can print the predicted values and check them with dependent variables.
The problem with continuous variable, is that the probability to reproduce exactly a given value is (practically) zero. For instance if your neural network produces 2.000001 and the actual value is 2, then this would count as a wrong prediction as both values are different (although they are very close). Error metric like the root mean square, measure therefore at the average difference (squared).
However, depending on your application, you could introduce a threshold value ϵ and consider a given output of your neural network as correct if the absolute value of the difference between the observed value and the output is smaller than ϵ and compute the percentage of correct prediction.
In practice such a metric is not minimized directly, because it is difficult to compute its gradient, but it is still a useful quantity to compute.

Neural Network Custom Binary Prediction

I am trying to design a neural network that makes a custom binary prediction.
Normally to do binary prediction, I would use a softmax as my last layer, and then my loss could be the difference between the prediction I made and the true binary value.
However, what if I don't want to use a softmax layer. Instead, I output a real valued number, and check if some condition on this number is true. In a really simple case, I check if this number is positive. If it is, I predict 1, else I predict 0. Let's say I want all the numbers to be positive, so the true predictions should be all 1, and then I want to train this network such that it outputs all positive numbers. I am confused as how to formulate a loss function for this problem, so that I am able to back propagate and train the network.
Does anyone have an idea how to create this kind of network?
I am confused as how to formulate a loss function for this problem, so
that I am able to back propagate and train the network.
Here's how you should approach it. Effectively, you need to transform the labels to positive and negative target values (say +1 and -1) and solve the regression problem. The loss function can be a simple L1 or L2 loss. The network will try to learn to output a prediction close to the training target, which you can afterwards interpret if it's closer to one target or another, i.e. positive or negative. You can even go ahead and make some targets larger (e.g. +2 or +10) to emphasize that these examples are very important. Example code: linear regression in tensorflow.
However, I simply have to warn you that your approach has serious drawbacks, see for instance this question. One outlier in training data can easily skew your predictions. Classification with softmax + cross-entropy loss is more stable, that's why almost always a better choice.

Machine learning multi-classification: Why use 'one-hot' encoding instead of a number

I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.

How many back propagations to run on a single entry of new data in online machine learning using neural networks?

I want my neural network to be trained on every new data that it classifies incorrectly. Assuming that I somehow label the data correctly every time the network makes a mistake, how many back props do i need to run on this single instance of new data in order to train my network for that particular case? Is there a better way to train a neural network on real time scenarios?
It depends on the optimization algorithm you use. The backpropagation by itself calculates only the gradient, which is used by the next iteration of the algorithm.
In the simplest case you can use a self-developed gradient descent and check the behavior of your cost function. If the cost function decreases less than some threshold epsilon, you might break the optimization loop for the current instance. You can also limit the maximum number of iterations.
It is worth using some advanced optimizers such fminunc in Matlab, which will stop by themselves when reached an optimum.
You may find this post about different termination conditions of gradient descent very useful.
I think, learning only using one single instance is not really efficient. The cost function can behave jerky. You may consider the batch learning method, where you learn using small batches of new instances. It should provide a better learning rate.
In order to illustrate how network's accuracy depends on the iteration number and on the batch size, I experimented a bit with a neural network used to recognize hand written digits. I had 4000 examples in the training set and 1000 examples in the validation set. Then I started the learning algorithm with different parameters and measured the resulted accuracy. You can see the result here:
Of course this plot describes only my particular case, but you can get some intuition on what to expect and on how to validate network parameters.

Echo state neural network?

Is anyone here who is familiar with echo state networks? I created an echo state network in c#. The aim was just to classify inputs into GOOD and NOT GOOD ones. The input is an array of double numbers. I know that maybe for this classification echo state network isn't the best choice, but i have to do it with this method.
My problem is, that after training the network, it cannot generalize. When i run the network with foreign data (not the teaching input), i get only around 50-60% good result.
More details: My echo state network must work like a function approximator. The input of the function is an array of 17 double values, and the output is 0 or 1 (i have to classify the input into bad or good input).
So i have created a network. It contains an input layer with 17 neurons, a reservoir layer, which neron number is adjustable, and output layer containing 1 neuron for the output needed 0 or 1. In a simpler example, no output feedback is used (i tried to use output feedback as well, but nothing changed).
The inner matrix of the reservoir layer is adjustable too. I generate weights between two double values (min, max) with an adjustable sparseness ratio. IF the values are too big, it normlites the matrix to have a spectral radius lower then 1. The reservoir layer can have sigmoid and tanh activaton functions.
The input layer is fully connected to the reservoir layer with random values. So in the training state i run calculate the inner X(n) reservor activations with training data, collecting them into a matrix rowvise. Using the desired output data matrix (which is now a vector with 1 ot 0 values), i calculate the output weigths (from reservoir to output). Reservoir is fully connected to the output. If someone used echo state networks nows what im talking about. I ise pseudo inverse method for this.
The question is, how can i adjust the network so it would generalize better? To hit more than 50-60% of the desired outputs with a foreign dataset (not the training one). If i run the network again with the training dataset, it gives very good reults, 80-90%, but that i want is to generalize better.
I hope someone had this issue too with echo state networks.
If I understand correctly, you have a set of known, classified data that you train on, then you have some unknown data which you subsequently classify. You find that after training, you can reclassify your known data well, but can't do well on the unknown data. This is, I believe, called overfitting - you might want to think about being less stringent with your network, reducing node number, and/or training based on a hidden dataset.
The way people do it is, they have a training set A, a validation set B, and a test set C. You know the correct classification of A and B but not C (because you split up your known data into A and B, and C are the values you want the network to find for you). When training, you only show the network A, but at each iteration, to calculate success you use both A and B. So while training, the network tries to understand a relationship present in both A and B, by looking only at A. Because it can't see the actual input and output values in B, but only knows if its current state describes B accurately or not, this helps reduce overfitting.
Usually people seem to split 4/5 of data into A and 1/5 of it into B, but of course you can try different ratios.
In the end, you finish training, and see what the network will say about your unknown set C.
Sorry for the very general and basic answer, but perhaps it will help describe the problem better.
If your network doesn't generalize that means it's overfitting.
To reduce overfitting on a neural network, there are two ways:
get more training data
decrease the number of neurons
You also might think about the features you are feeding the network. For example, if it is a time series that repeats every week, then one feature is something like the 'day of the week' or the 'hour of the week' or the 'minute of the week'.
Neural networks need lots of data. Lots and lots of examples. Thousands. If you don't have thousands, you should choose a network with just a handful of neurons, or else use something else, like regression, that has fewer parameters, and is therefore less prone to overfitting.
Like the other answers here have suggested, this is a classic case of overfitting: your model performs well on your training data, but it does not generalize well to new test data.
Hugh's answer has a good suggestion, which is to reduce the number of parameters in your model (i.e., by shrinking the size of the reservoir), but I'm not sure whether it would be effective for an ESN, because the problem complexity that an ESN can solve grows proportional to the logarithm of the size of the reservoir. Reducing the size of your model might actually make the model not work as well, though this might be necessary to avoid overfitting for this type of model.
Superbest's solution is to use a validation set to stop training as soon as performance on the validation set stops improving, a technique called early stopping. But, as you noted, because you use offline regression to compute the output weights of your ESN, you cannot use a validation set to determine when to stop updating your model parameters---early stopping only works for online training algorithms.
However, you can use a validation set in another way: to regularize the coefficients of your regression! Here's how it works:
Split your training data into a "training" part (usually 80-90% of the data you have available) and a "validation" part (the remaining 10-20%).
When you compute your regression, instead of using vanilla linear regression, use a regularized technique like ridge regression, lasso regression, or elastic net regression. Use only the "training" part of your dataset for computing the regression.
All of these regularized regression techniques have one or more "hyperparameters" that balance the model fit against its complexity. The "validation" dataset is used to set these parameter values: you can do this using grid search, evolutionary methods, or any other hyperparameter optimization technique. Generally speaking, these methods work by choosing values for the hyperparameters, fitting the model using the "training" dataset, and measuring the fitted model's performance on the "validation" dataset. Repeat N times and choose the model that performs best on the "validation" set.
You can learn more about regularization and regression at http://en.wikipedia.org/wiki/Least_squares#Regularized_versions, or by looking it up in a machine learning or statistics textbook.
Also, read more about cross-validation techniques at http://en.wikipedia.org/wiki/Cross-validation_(statistics).

Resources