Neural network weird prediction [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I try to implement a neural network. I'm using backpropagation to compute the gradients. After obtaining the gradients, I multiply them by the learning rate and subtract them from the corresponding weights. (basically trying to apply gradient descent, please tell me if this is wrong).
So the first thing I tried after having the backpropagation and gradient descent ready, was to train a simple XOR classifier where the inputs can be (0,0), (1,0), (0,1), (1,1) and the corresponding outputs are 0, 1, 1, 0. So my neural network contains 2 input units, 1 output unit and one hidden layer with 3 units on it. When training it with a learning rate of 3.0 for >100 (even tried >5000), the cost drops until a specific point where it gets stuck, so it's remaining constant. The weights are randomly initialized each time I run the program, but it always gets stuck at the same specific cost. Anyways, after the training is finished I tried to run my neural network on any of the above inputs and the output is always 0.5000. I thought about changing the inputs and outputs so they are : (-1,-1), (1, -1), (-1, 1), (1, 1) and the outputs -1, 1, 1, -1. Now when trained with the same learning rate, the cost is dropping continuously, no matter the number of iterations but the results are still wrong, and they always tend to be very close to 0. I even tried to train it for an insane number of iterations and the results are the following: [ iterations: (20kk), inputs:(1, -1), output:(1.6667e-08) ] and also [iterations: (200kk), inputs:(1, -1), output:(1.6667e-09) ], also tried for inputs(1,1) and others, the output is also very close to 0. It seems like the output is always mean(min(y), max(y)), it doesn't matter in what form I provide the input/output. I can't figure out what I'm doing wrong, can someone please help?

There are so many places where you might be wrong:
check your gradients numerically
you have to use nonlinear hidden units to learn XOR - do you have non-linear activation there?
you need bias neuron, do you have one?
minor things that should not cause the mentioned problem, but worth fixing either way:
do you have sigmoidal activation in the output node (as your network is a classifier)?
do you train with cross-entropy cost (although this is minor problem)?

Related

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Bound output values [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Here is an open question:
suppose I need to predict a student's exam score given some inputs, e.g. hours spent on prep, previous scores, etc. How should I bound the output between 0 - 100? What are the best practices out there?
Thanks!
Edit:
Since the answers are mostly concerned about bounding model output after we have the predictions, is it possible to train the model beforehand such that this bound is implicitly learned by the model?
You would train an Isotonic Regression model: http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html
Or you could simply clip the predicted values that are out of bounds.
It is general practice, when training multi-flavored data to appropriately scale it between 0 - 1, so for example, say ur test data was:
[input: [10 hrs studying, 100% on last test], output: [95% on this test] ]
then you should first standardize both input and output by dividing by the greatest numerical value in each of their elements or the greatest possible value:
input = input/input.max
output = output/100
[input: [0.1 , 1], output: [0.95] ]
When you are done training and want to predict a test scores, simply multiply the output by 100 and you are done.
BTW what you want to do is well documented on stephenwelch's Neural Network Youtube series.
You can either do Normalisation or Standardisation. They would transform your values within [0, 1].
I am not sure why you need the range to be 0-100, but if it is really so, you can multiply by 100 to get that range post the above transformation.
Normalise: Here each value of your feature column is converted like so:
X_new = (X - X_min) / (X_max - X_min)
where X_min and X_max are min and max values in the feature.
Standardise: Here each value of your feature column is converted like so:
X_new = (X - Mean) / StandardDeviation
where Mean and StandardDeviation are the mean and SD values of your feature.
Check which one gives you better results. If your data has extreme outliers, Standardisation might give better results.
In sklearn, you can use sklearn.preprocessing.normalize or sklearn.preprocessing.StandardScaler to do the conversions.
HTH

How to find the value of theta 0 and theta 1? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am new to ML, I am not sure on how to solve this problem
Could someone tell me how to solve this problem of finding values in a a step by step manner?
From newcomer view point you can actually just test:
h1=0.5+0.5x
h2=0+0.5x
h3=0.5+0x
h4=1+0.5x
h5=1+x
Then which one of the hs(1..5) gives exact observed values of y(0.5,1,2,0) for a given set of dependent variables x(1,2,4,0).
You can answer that by passing sample values of x in the above equation.
I hope i made it simple enough
Here is the cache It's one of most easy problems in machine learning.
Just see that we have to create a linear regression model to fit the following data:-
STEP 1:UNDERSTANDING THE PROBLEM
And as mentioned at the last of question it should completely fit the data.
We have to find theta0 and theta1 in such a way such that given value of x Htheta(x) will give the correct value of y.
STEP 2:FINDING THETA1
In these m examples take any 2 random examples
Htheta(x2)-Htheta(x1) = theta1*(x2)-theta1*(x1)
-----Subtracting those 2 variables(eliminating theta0)
hteta(x2) = y2
(y corresponding to that x in the data as the parameters exactly fit the data provided )
(y2-y1)/(x2-x1) = theta1
----taking common and then dividing by(x2-x1) on both sides of equation
From this:
theta1 = 0.5
STEP3 :CALCULATING THETA0
Take any random example and put the values of theta1, y and x in this equation
y = theta1*x + theta0
theta0 will come out to be 0
My approach would be to view these points by plotting a graph with x,y values. Since it's a straight line, calculate tan(theta) using normal trigonometry, which in this case is y/x(Since it's mentioned they fit perfectly!!). eg:-
tan(theta1) = 0.5/1 or 1/2
Calculate arctan(1/2) // Approx 0.5
Note:- This is not a scalable approach but just some maths fun! Sorry.
In general you would execute some non-iterative algorithmic approach (probably based on solving a system of linear equations) or some iterative approach like GD (Gradient Descent), but this is more simple here, as it's already given that there is a perfect fit.
Perfect fit means: loss/error of zero.
Loss of zero implicates, that sigma0 needs to be zero or else sample 4 (last one) induces a loss
Overall loss is the sum of sample-losses and each loss/component is nonnegative -> we can't tolerate a loss here
When sigma0 is fixed, sample 4 has an infinite amount of solutions producing no loss
But sample 1 shows that it has to be 0.5 to induce no loss
Check the others, it's fitting perfectly
One assumption i made:
Gradient-descent will converge to the optimal solution (which is not always true, even for convex-optimization problems; it's depending learning-rates; one might use line-searches to proof convergence based on some assumptions about the problem; but all that is irrelevant here)

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Neural Network Diverging instead of converging

I have implemented a neural network (using CUDA) with 2 layers. (2 Neurons per layer).
I'm trying to make it learn 2 simple quadratic polynomial functions using backpropagation.
But instead of converging, the it is diverging (the output is becoming infinity)
Here are some more details about what I've tried:
I had set the initial weights to 0, but since it was diverging I have randomized the initial weights
I read that a neural network might diverge if the learning rate is too high so I reduced the learning rate to 0.000001
The two functions I am trying to get it to add are: 3 * i + 7 * j+9 and j*j + i*i + 24 (I am giving the layer i and j as input)
I had implemented it as a single layer previously and that could approximate the polynomial functions better
I am thinking of implementing momentum in this network but I'm not sure it would help it learn
I am using a linear (as in no) activation function
There is oscillation in the beginning but the output starts diverging the moment any of weights become greater than 1
I have checked and rechecked my code but there doesn't seem to be any kind of issue with it.
So here's my question: what is going wrong here?
Any pointer will be appreciated.
If the problem you are trying to solve is of classification type, try 3 layer network (3 is enough accordingly to Kolmogorov) Connections from inputs A and B to hidden node C (C = A*wa + B*wb) represent a line in AB space. That line divides correct and incorrect half-spaces. The connections from hidden layer to ouput, put hidden layer values in correlation with each other giving you the desired output.
Depending on your data, error function may look like a hair comb, so implementing momentum should help. Keeping learning rate at 1 proved optimum for me.
Your training sessions will get stuck in local minima every once in a while, so network training will consist of a few subsequent sessions. If session exceeds max iterations or amplitude is too high, or error is obviously high - the session has failed, start another.
At the beginning of each, reinitialize your weights with random (-0.5 - +0.5) values.
It really helps to chart your error descent. You will get that "Aha!" factor.
The most common reason for a neural network code to diverge is that the coder has forgotten to put the negative sign in the change in weight expression.
another reason could be that there is a problem with the error expression used for calculating the gradients.
if these don't hold, then we need to see the code and answer.

Resources