Time Series Ahead Prediction in Neural Network (N Point Ahead Prediction) Large Scale Iterative Training - machine-learning

(N=90) Point ahead Prediction using Neural Network:
I am trying to predict 3 minutes ahead i.e. 180 points ahead. Because I compressed my time series data as taking the mean of every 2 points as one, I have to predict (N=90) step-ahead prediction.
My time series data is given in seconds. The values are in between 30-90. They usually move from 30 to 90 and 90 to 30, as seen in the example below.
My data could be reach from: https://www.dropbox.com/s/uq4uix8067ti4i3/17HourTrace.mat
I am having trouble in implementing neural network to predict N points ahead. My only feature is previous time. I used elman recurrent neural network and also newff.
In my scenario I need to predict 90 points ahead. First how I separated my input and target data manually:
For Example:
data_in = [1,2,3,4,5,6,7,8,9,10]; //imagine 1:10 only defines the array index values.
N = 90; %predicted second ahead.
P(:, :) T(:) it could also be(2 theta time) P(:, :) T(:)
[1,2,3,4,5] [5+N] | [1,3,5,7,9] [9+N]
[2,3,4,5,6] [6+N] | [2,4,6,8,10] [10+N]
...
until it reaches to end of the data
I have 100 input points and 90 output points in Elman recurrent neural networks. What could be the most efficient hidden node size?
input_layer_size = 90;
NodeNum1 =90;
net = newelm(threshold,[NodeNum1 ,prediction_ahead],{'tansig', 'purelin'});
net.trainParam.lr = 0.1;
net.trainParam.goal = 1e-3;
//At the beginning of my training I filter it with kalman, normalization into range of [0,1] and after that I shuffled the data.
1) I won't able to train my complete data. First I tried to train complete M data which is around 900,000, which didn't gave me a solution.
2) Secondly I tried iteratively training. But in each iteration the new added data is merged with already trained data. After 20,000 trained data the accuracy start to decreases. First trained 1000 data perfectly fits in training. But after when I start iterativelt merge the new data and continue to training, the training accuracy drops very rapidly 90 to 20.
For example.
P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
net = train(net,P,T, [], [] );%until it reaches to minimum error I train it.
[normTrainOutput] = sim(net,P, [], [] );
P = [ P P(counter*1000:counter*2000)]%iteratively new training portion of the data added.
counter = counter + 1; end
This approach is very slow and after a point it won't give any good resuts.
My third approach was iteratively training; It was similar to previous training but in each iteration, I do only train the 1000 portion of the data, without do any merging with previous trained data.For example when I train first 1000 data until it gets to minimum error which has >95% accuracy. After it has been trained, when I have done the same for the second 1000 portion of the data;it overwrites the weight and the predictor mainly behave as the latest train portion of the data.
> P = P_test(1:1000) T = T_test(1:1000) counter = 1;
while(1)
> net = train(net,P,T, [], [] ); % I did also use adapt()
> [normTrainOutput] = sim(net,P, [], [] );
>
> P = [ P(counter*1000:counter*2000)]%iteratively only 1000 portion of the data is added.
> counter = counter + 1;
end
Trained DATA: This figure is snapshot from my trained training set, blue line is the original time series and red line is the predicted values with trained neural network. The MSE is around 50.
Tested DATA: On the below picture, you can see my prediction for my testing data with the neural network, which is trained with 20,000 input points while keeping MSE error <50 for the training data set. It is able to catch few patterns but mostly I doesn't give the real good accuracy.
I wasn't able to successes any of this approaches. In each iteration I also observe that slight change on the alpha completely overwrites to already trained data and more focus onto the currently trained data portion.
I won't able to come up with a solution to this problem. In iterative training should I keep the learning rate small and number of epochs as small.
And I couldn't find an efficient way to predict 90 points ahead in time series. Any suggestions that what should I do to do in order to predict N points ahead, any tutorial or link for information.
What is the best way for iterative training? On my second approach when I reach 15 000 of trained data, training size starts suddenly to drop. Iteratively should I change the alpha on run time?
==========
Any suggestion or the things I am doing wrong would be very appreciated.
I also implemented recurrent neural network. But on training for large data I have faced with the same problems.Is it possible to do adaptive learning(online learning) in Recurrent Neural Networks for(newelm)? The weight won't update itself and I didn't see any improvement.
If yes, how it is possible, which functions should I use?
net = newelm(threshold,[6, 8, 90],{'tansig','tansig', 'purelin'});
net.trainFcn = 'trains';
batch_size = 10;
while(1)
net = train(net,Pt(:, k:k+batch_size ) , Tt(:, k:k+batch_size) );
end

Have a look at Echo State Networks (ESNs) or other forms of Reservoir Computing. They are perfect for time series prediction, very easy to use and converge fast. You don't need to worry about the structure of the network at all (every neuron in the mid-layer has random weights which do not change). You only learn the output weights.
If I understood the problem correctly, with Echo State Networks, I would just train the network to predict the next point AND 90 points ahead. This can be done by simply forcing the desired output in the output neurons and then performing ridge regression to learn the output weights.
When running the network after having trained it, at every step n, it would output the next point (n+1), which you would feed back to the network as input (to continue the iteration), and 90 points ahead (n+90), which you can do whatever you want with - i.e: you could also feed it back to the network so that it affects the next outputs.
Sorry if the answer is not very clear. It's hard to explain how reservoir computing works in a short answer, but if you just read the article in the link, you will find it very easy to understand the principles.
If you do decide to use ESNs, also read this paper to understand the most important property of ESNs and really know what you're doing.
EDIT: Depending on how "predictable" your system is, predicting 90 points ahead may still be very difficult. For example if you're trying to predict a chaotic system, noise would introduce very large errors if you're predicting far ahead.

use fuzzy logic using membership function to predict the future data. will be efficient method.

Related

Can a dense layer on many inputs be represented as a single matrix multiplication?

Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Gradient descent stochastic update - Stopping criterion and update rule - Machine Learning

My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Resources