Time Series Prediction using LSTM - time-series

I am using Jason Brownlee's tutorial (mirror) to apply LSTM network on some syslog/network log data. He's a master!
I have syslog data(a specific event) for each day for last 1 year and so I am using LSTM network for time series analysis. I am using LSTM from Keras deep learning library.
As I understand -
About Batch_size
A batch of data is a fixed-sized number of rows from the training
dataset that defines how many patterns to process before updating
the weights of the network. Based on the batch_size the Model
takes random samples from the data for the analysis. For time series
this is not desirable, hence the batch_size should always be 1.
About setting value for shuffle value
By default, the samples within an epoch are shuffled prior to being exposed to the network. This is undesirable for the LSTM
because we want the network to build up state as it learns across
the sequence of observations. We can disable the shuffling of
samples by setting “shuffle” to “False“.
Scenario1 -
Using above two rules/guidelines - I ran several trials with different number of neurons, epoch size and different layers and got better results from the baseline model(persistence model).
Scenario2-
Without using above guidelines/rules - I ran several trials with different number of neurons, epoch size and different layers and got even better results than Scenario 1.
Query - Setting shuffle to True and Batch_size values to 1 for time series. Is this a rule or a guideline?
It seems logical reading the tutorial that the data for time series should not be shuffled as we do not want to change the sequence of data, but for my data the results are better if I let the data be shuffled.
At the end what I think, what matters is how I get better predictions with my runs.
I think I should try and put away "theory" over concrete evidence, such as metrics, elbows, RMSEs,etc.
Kindly enlighten.

It depends a lot on the size of your data, also in the number of variables, decreasing batch size in my experience gives better results since the update is more frequent but in huge datasets it is very expensive. And you have to play with this trade-off (training time vs result).
About your shuffle it may be the case that your data is not that correlated with the past, if that is the case shuffling the data helps the network to learn and be able to generalize (like ordered by label) check reason 7 of the following 37 reasons your neural network not working
Batch size the larger the difficult it is to generalize (reason 11). When data clearly depends on the past you can declare your LSTM in Keras to stateful, this means: "that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch" according to Keras API. Hope this helps.

Related

How to adjust to the randomness of the neural network weights?

The weights of the network are random during the initialization. Thus, if you train the network multiple times with multiple different random weights, you will get different results.
My question is:
What do you do during the hyperparameter tuning? Do you retrain the network multiple time for each hyperparameter configuration, and take the mean of the results as the value of this hyperparameter configuration?
And if this is the case, does anyone use the information provided by the standard deviation?
The final results reported on the test data. do we train the network multiple times to compensate for the random weights, or just once?
For example, in this paper A Neural Representation of Sketch Drawings,
they report the log-likelihood for different categories in this table
So I don't get the methodology behind getting these numbers.
I appreciate any clarification :-)
I'd say fix the seed so you get the same random init every time, and play with hyperparameters only. Of course if you wanna try different rand inits (e.g. one of https://keras.io/initializers/) then that would be a hyperparameter.
The paper you cited isn't about the network's weight initialization.
This is about the weighting of two loss functions as a the following key phrase reveals:
Our training procedure follows the approach of the Variational
Autoencoder [15], where the loss function is the sum of two terms: the
Reconstruction Loss, LR, and the Kullback-Leibler Divergence Loss, LKL.
Anyway to answer your question, there are several other random factors in a neural model, not just the weights initialization.
To handle these randomness, its variance there are several methods as well.
Some of them is training the network multiple times as you mentioned and with different train-test set break up, different cross-validation methods and many others.
You can fix the initial random state of random generator to get every hyper-parameter tuning process the same "randomness" regarding weights but you can and sometimes you should do it at the different stages of the training process i.e. you can use seed(1234) at the weight initialization, but at getting the train-test sets you can use seed(555) to get similar distribution of the two sets.

Should I use the same training set from one epoch to another (convolutional neural network)

From what I know about convolutional neural networks, you must feed the same training examples each epoch, but shuffled (so the network won't remember some particular order while training).
However, in this article, they're feeding the network 64000 random samples each epoch (so only some of the training examples were "seen" before):
Each training instance was a uniformly sampled set of 3 images, 2 of
which are of the same class (x and x+), and the third (x−) of a
different class. Each training epoch consisted of 640000 such
instances (randomly chosen each epoch), and a fixed set of 64000
instances used for test.
So, do I have to use the same training examples each epoch, and why?
Experimental results are poor when I use random samples - the accuracy varies a lot. But I want to know why.
Most of the time you might want to use as much data as you can. However, in the paper you cite they train a triplet loss, which uses triples of images, and there could be billions of such triples.
You might wonder, why introduce the idea of epoch in the first place if we're likely to obtain different training sets each time. The answer is technical: we'd like to evaluate the network on the validation data once in a while, also you might want to do learning rate decay based on the number of completed epochs.

Using LSTM for binary classification

I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Neural Network Outputs Are Not Changing Very Much

I have 20 output neurons on a feed-forward neural network, for which I have already tried varying the number of hidden layers and number of neurons per hidden layer. When testing, I've noticed that while the outputs are not always exactly the same, they vary from test case to case very little, especially in respect to one another. It seems to be outputting nearly (within 0.0005 depending on the initial weights) the same output on every test case; the one that is the highest is always the highest. Is there a reason for this?
Note: I'm using a feed-forward neural network, with resilient and common backpropagation, separating training/validation/testing and shuffling in between training sets.
UPDATE: I'm using the network to categorize patterns from 4 inputs into one of twenty output possibilities. I have 5000 training sets, 800 validation sets, and 1500 testing sets. Number of rounds can vary depending on what I'm doing, on my current training case, the training error seems to converge too quickly (under 20 epochs). However, I have noticed this non-variance at other times when the error will decrease over a period of 1000 epochs. I have also adjusted the learning rate and momentum for the regular propagation. Resilient propagation does not use a learning rate or momentum for updates. This is being implemented using Encog.
Your dataset seems problematic to begin with. 20 outputs for 4 inputs seem too many. The number of output is generally much smaller than the number of inputs. Most probably, either the dataset is wrongly formulated, or you have misunderstood something in the problem you are trying to solve. Anyway, some things regarding your other comments:
First of all, you don't use 1500 training sets, but one set with 1500 training patterns. The same goes for validation and testing.
Second, the output can't be exactly the same on each run, since the weights are initialized randomly and the outputs depend on them. However, we want them to be similar on each run. If they weren't it would mean that they depend too much on the random initialization, so the network wouldn't work well.
In your case, the highest output is the selected category, so if the same output is the highest every time your network is working well.
If the network output is almost the same for different input patterns, the network is unable to categorize input well.
You say your network has 4 input nodes and 20 output nodes (right?). So there are 2*2*2*2 = 16 different possible input patterns. Why the hell you need 800 validation sets?
Your training data may be corrupt.

Resources