How to use a dataset in Neural Network training - machine-learning

I am trying to implement a Neural Network. I am currently working on the backpropagation part. I don't need help with the coding. I have wrote the feedForward part so far with great success. But my question more related to the dataset I am using.
this is my data set:
http://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data
I have to use a 5-fold cross validation to backpropagate and stop training until error threshold is .1, .01, .001 meaning I have to do 3 trials. Ignore the first 2 point for each data point. I have already normalized the set. The architecture of the network is 7 neuron in input layer, 3 neurons in 1 hidden layer and 1 output. Very basic implementation.
So my question is,
I have to break the set up in 5 smaller subset right? keep one for testing and rest for validation and training right?
how long do I train? say I use 1 fold (about 42 data sets per fold) and i reach my desired error threshold. Do I stop? and use the test data? else I load up the next set and keep training? what if I run out of data set before I reach my desired error threshold?
also should I follow something like this?
use input values
feed forward, then compare error
backpropagate and adjust weights
repeat process a-c until i reach error threshold and finish all data in the current fold?
thank you for your time and response. I am really trying to understand how to use the dataset. I will update you guys once I have written the code.

Related

Neural network online training

I want to implement a simple feed-forward neural network to approximate the function y=f(x)=ax^2 where a is some constant and x is the input value.
The NN has one input node, one hidden layer with 1-n nodes, and one output node. For example, I input the value 2.0 -> the NN produces 4.0, and again I input 3.0 -> the NN produces 9.0 or close to it and so on.
If I understand "online-training," the training data is fed one by one - meaning I input the value 2.0 -> I iterate with the gradient decent 100 times, and then I pass the value 3.0, and I iterate another 100 times.
However, when I try to do this with my experimental/learning NN - I input the value 2.0 -> the error gets very small -> the output is very close to 4.0.
Now if I want to predict for the input 3.0 -> the NN produces 4.36 or something instead of 9.0. So the NN just learns the last training value.
How can I use online-training to get a Neural Network that approximates the desired function for a range [-d, d]? What am I missing?
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function. This is besides the point but in case someone was wondering.
Any advise would be greatly appreciated.
More info - I am activating the hidden layer with the Sigmoid function and the output layer with the linear one.
The reason why I like online-training is that eventually I want to input a time series - and map that series to the desired function.
Recurrent Neural Networks (RNNs) are the state of the art for modeling time series. This is because they can take inputs of arbitrary length, and they can also use internal state to model the changing behavior of the series over time.
Training feedforward neural networks for time series is an old method which will generally not perform as well. They require a fixed sized input so you must choose a fixed sized sliding time window, and they also don't preserve state, so it is hard to learn a time-varying function.
I can find very little about "online training" of feedforward neural nets with stochastic gradient descent to model non-stationary behavior except for a couple of very vague references. I don't think this provides any benefit besides allowing you to train in real time when you are getting a stream of data one at a time. I don't think it will actually help you model time-dependent behavior.
Most of the older methods I can find in the literature about online learning for neural networks use a hybrid approach with a neural network and some other method that can help capture time dependencies. Again, these should all be inferior to RNNs, not to mention harder to implement in practice.
Furthermore, I don't think you are implementing online training correctly. It should be stochastic gradient descent with a mini-batch size of 1. Therefore, you only run one iteration of gradient descent on each training example per training epoch. Since you are running 100 iterations before moving on to the next training example, you are going too far down the error gradient with respect to that single example, resulting in serious overfitting to a single data point. This is why you get poor results on the next input. I don't think this is a justifiable method of training, nor do I think it will work for time series.
You haven't mentioned what your activations are or your loss function is, so I can't comment on whether those are appropriate for the task.
Also, I don't think the learning y=ax^2 is a good analogy for time series prediction. This is a static function that always gives the same output for a given input, regardless of the index of the input or the value of previous inputs.

Updating a Neural Network input

I trained an 4 inputs by 1 output NN for 1 month and then the same NN was upgraded to become 5 I by 1 O. Should I repeat the training with the new configuration or I can still use the old training?
You'll almost definitely need to repeat the training, unless you can feed your five-input NN to your trained 4-input NN, in which case you might be able to get away with less. It depends on exactly what the new variable represents.
If remaining 4 inputs still represent the same thing, you do not have to start from scratch. Instead, add new neuron in the input layer, and edges between it and hidden units. Initialize them as usually, but leave remaining weights. In other words - you are using your previous network as a starting point of the optimization. It should converge way faster, and in general be better, if you do not have access to historical data anymore (or you do not have time to retrain everything).

Time Series Prediction using Recurrent Neural Networks

I am using a Bike Sharing dataset to predict the number of rentals in a day, given the input. I will use 2011 data to train and 2012 data to validate. I successfully built a linear regression model, but now I am trying to figure out how to predict time series by using Recurrent Neural Networks.
Data set has 10 attributes (such as month, working day or not, temperature, humidity, windspeed), all numerical, though an attribute is day (Sunday: 0, Monday:1 etc.).
I assume that one day can and probably will depend on previous days (and I will not need all 10 attributes), so I thought about using RNN. I don't know much, but I read some stuff and also this. I think about a structure like this.
I will have 10 input neurons, a hidden layer and 1 output neuron. I don't know how to decide on how many neurons the hidden layer will have.
I guess that I need a matrix to connect input layer to hidden layer, a matrix to connect hidden layer to output layer, and a matrix to connect hidden layers in neighbouring time-steps, t-1 to t, t to t+1. That's total of 3 matrices.
In one tutorial, activation function was sigmoid, although I'm not sure exactly, if I use sigmoid function, I will only get output between 0 and 1. What should I use as activation function? My plan is to repeat this for n times:
For each training data:
Forward propagate
Propagate the input to hidden layer, add it to propagation of previous hidden layer to current hidden layer. And pass this to activation function.
Propagate the hidden layer to output.
Find error and its derivative, store it in a list
Back propagate
Find current layers and errors from list
Find current hidden layer error
Store weight updates
Update weights (matrices) by multiplying them by learning rate.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
It seems to be the correct way to do it, if you are just wanting to learn the basics. If you want to build a neural network for practical use, this is a very poor approach and as Marcin's comment says, almost everyone who constructs neural nets for practical use do so by using packages which have an ready simulation of neural network available. Let me answer your questions one by one...
I don't know how to decide on how many neurons the hidden layer will have.
There is no golden rule to choose the right architecture for your neural network. There are many empirical rules people have established out of experience, and the right number of neurons are decided by trying out various combinations and comparing the output. A good starting point would be (3/2 times your input plus output neurons, i.e. (10+1)*(3/2)... so you could start with a 15/16 neurons in hidden layer, and then go on reducing the number based on your output.)
What should I use as activation function?
Again, there is no 'right' function. It totally depends on what suits your data. Additionally, there are many types of sigmoid functions like hyperbolic tangent, logistic, RBF, etc. A good starting point would be logistic function, but again you will only find the right function through trial and error.
Is this the correct way to do it? I want real numerical values as output, instead of a number between 0-1.
All activation functions(including the one assigned to output neuron) will give you an output of 0 to 1, and you will have to use multiplier to convert it to real values, or have some kind of encoding with multiple output neurons. Coding this manually will be complicated.
Another aspect to consider would be your training iterations. Doing it 'n' times doesn't help. You need to find the optimal training iterations with trial and error as well to avoid both under-fitting and over-fitting.
The correct way to do it would be to use packages in Python or R, which will allow you to train neural nets with large amount of customization quickly, where you can train and test multiple nets with different activation functions (and even different training algorithms) and network architecture without too much hassle. With some amount of trial and error, you will eventually find the net that gives you desirable output.

Output length of recurrent neural network

I have written two LSTM RNN codes in python that do sequence-prediction. I have a simple sequence (say a noisy-sinewave) and I am training my networks to "predict" future values along the sinewave. My first code just predicts the single next value (so there is only 1 output neuron), while the second code I wrote predicts the 5-next values (i.e. 5 output neurons). To get the prediction 5-steps in advance for the first code I need to call the predict function several times (utilising the previous predict's output).
Both cases seem to work quite well, but what I'm really trying to work out is which of these two network architectures is best for this problem. There is practically nothing in the literature comparing these output models.
I think using output as an input is not a good idea for this problem. Your output will always have some error and it might increase with each step ( Steady state error ).

Validation Set in Backpropogation in a Neural Network

I have a neural network model, and so far I am running the training set forward, calculating the errors, and adjusting the weights.
As I understand it, after I do this for each training set example I need to run an example from the validation set forward and calculate the errors. When the validation set error stops decreasing, but the training set error is still decreasing it is time to stop because over-fitting is starting to occur. After we stop, we use the testing set to calculate how much error is in our network.
Please correct me if there are any mistakes so far.
My question is what error are we comparing? Are we just comparing the error of the output layer? Or are we comparing the errors from every node? If so, how exactly do we define the overall error of the network, just sum up all the errors?
My question is what error are we comparing?
We are comparing the error only on the output layer. So, if you plot a error vs epoch graph, you will have two curves there. The line for training error goes down as you have more epochs. But the line for validation error goes down up to certain point before starting to go up. This indicates overfitting and you want to find the last point where the validation error was lowest.
Note that you are talking about each individual samples while I am talking about epochs. For batch methods these errors are usually plotted after one iteration over the data set (training or validation). So each point on the plot is the mean error or mean squared error from that epoch.
Also, if we have more than 1 output, are we just taking the sum of the errors in the output layer, or should it be some kind of weighted sum?
It's interesting for the multiple output case. Basically we are trying to find the early stopping point to stop training the weights. On the very last layer of multiple output network, the weights are being trained using different error derivatives and can possibly have different optimal early stopping points. You may want to plot them separately if you think that is the case. Otherwise, simple sum of error is sufficient. Weighted sum would mean that you care to optimize for on output over another, even when that causes other one(s) to over/under train.
If you are thinking about implementing separate early stopping points, you can use sum of MSEs to get stopping point for all internal weights that depend on all error derivatives. For the weights on the last layer, use their corresponding MSEs to get their separate stopping points.
Let's say I have 60% training, 20% validation, and 20% test set. For each epoch, I run through the 60 training set samples while adjusting the weights on each sample and also calculating the error on each validation sample.
Another way to do the weight update is to calculate the updates for each sample and then apply an average of all updates at the end of the epoch. If your training data has noise/outliers/misclassified samples, this is good. For example, couple outliers will not be able to massively distort the weights since their 'bad' updates will get averaged out with other 'good' updates.
Since there are only 1/3 as many validation samples as training samples, do I run through the validation 3 times for each epoch?
Why do we iterate over the validation set? Do we calculate error in validation to get weight updates? No. We do all our updating only using the training set. Validation is only their to see how our trained model generalizes outside of training data. Think of it as a test before the test you run with test set. Now, does it make sense to run over the validation set 3 times in each epoch? No, it doesn't.
I use the last calculated weights for online learning correct?
Yes. Error calculation and weight updates happen as new samples come in.
When we use the test set to calculate the error of our final model, are we using mse for this or does it even really matter too much which we use?
If your model is producing real valued output, then use MSE. If you system is trying to solve a classification problem, use classification error. i.e. 10% classification error, meaning 10% of the test set was misclassified by your model during test.

Resources