Darts: Methods for Efficiently Predicting on a Test set (without retraining) - time-series

I am using the TFTModel. After training (and validating) using the fit method, I would like to predict all data points in the train, test and validation set using the already trained model.
Currently, there are only the methods:
historical_forcast: supports predicting for multiple time steps (with corresponding look backs) but just one time series
predict: supports predicting for multiple time series but just for n next time steps.
What I am looking for is a method like historical_forcast but where series, past_covariates, and future_covariates are supported for being predicted without retraining. My best attempt so far is to run the following code block on an already trained model:
predictions = []
for s, past_cov, future_cov in zip(series, past_covariates, future_covariates):
predictions.append(model.historical_forecasts(
s,
past_covariates=past_cov,
future_covariates=future_cov,
retrain=False,
start=model.input_chunk_length,
verbose=True
))
Where series, past_covariates, and future_covariates are lists of target time series and covariates respectively, each consisting of the concatenated train, val and test series which I split afterwards again to ensure the availability of the past values needed for predicting at the beginning of test and val.
My objection / question about this: is there a more efficient way to do this through better batching with the current interface, our would I have to call the torch model my self?

Related

How to use tf.metrics.accuracy?

I want to use tf.metrics.accuracy to track the accuracy of my predictions, but I am unsure of how to use the update_op (acc_update_op below) that the function returns:
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
tf.metrics.accuracy is one of the many streamed metric TensorFlow operations (another one of which is tf.metrics.recall). Upon creation, two variables (count and total) are created in order to accumulate all incoming results for one final outcome. The first returned value is a tensor for the calculation count / total. The second op returned is a stateful function which updates these variables. Streamed metric functions are useful when evaluating the performance of a classifier over multiple batches of data. A quick example of use:
# building phase
with tf.name_scope("streaming"):
accuracy, acc_update_op = tf.metrics.accuracy(labels, predictions)
test_fetches = {
'accuracy': accuracy,
'acc_op': acc_update_op
}
# when testing the classifier
with tf.name_scope("streaming"):
# clear counters for a fresh evaluation
sess.run(tf.local_variables_initializer())
for _i in range(n_batches_in_test):
fd = get_test_batch()
outputs = sess.run(test_fetches, feed_dict=fd)
print("Accuracy:", outputs['accuracy'])
I was thinking that adding it to tf.GraphKeys.UPDATE_OPS would make sense, but I am not sure how to do this.
That would not be a good idea unless you are only using the UPDATE_OPS collection for testing purposes. Usually, the collection will already have certain control operations for the training phase (such as moving batch normalization parameters) that are not meant to be run alongside the validation phase. It may be best to either keep them in a new collection or add these operations to the fetch dictionary manually.

Predictive modelling

How to perform regression(Random Forest,Neural Networks) for this kind of data?
The data contains features and we need to predict sales qty based on week and attributes
here I am attaching the sample data
Here we are trying to predict sales quantity based on other attributes
Multivariate linear regression
Assuming
input variables x[][] (each row corresponds to a sample, each column corresponds to a variable such as week, season, ..)
expected output y[] (as many rows as x)
parameters being learned theta[] (as many as there are input variables + 1)
you are optimizing a function h:
h = sum for all j of { x[j][i] * p[i] - y[j] } is minimal
This can easily be achieved through gradient descent.
You can also include combinations of parameters (and simply include more thetas for those pseudo-parameters)
I have some code lying around in a GitHub repository that performs basic multivariate linear regression (for a course I sometimes teach).
https://github.com/jorisschellekens/ml/tree/master/linear_regression

Is this the right way to predict a time series with the stateful LSTM neural network?

I am using LSTM neural networks (stateful) for time series prediction.
I'm hoping that the stateful LSTM can capture the hidden patterns and make a satisfactory prediction (the physical law that cause the variation of the time series is not clear).
I have a time series X with a length of 1500 (actual observational data), and my purpose is to predict the future 100.
I suppose predict the next 10 will be more promising than predict the next 100 (is that right?).
So, I prepare the training data like this (always using 100 values to predict the next 10; x_n denotes the n-th element in X):
shape of trainX: [140, 100, 1]
shape of trainY: [140, 10, 1]
---
0: [x_0, x_1, ..., x_99] -> [x_100, x_101, ..., x_109]
1: [x_10, x_11, ..., x_109] -> [x_110, x_111, ..., x_119]
2: [x_20, x_21, ..., x_119] -> [x_120, x_121, ..., x_129]
...
139: [x_1390, x_1391, ..., x_1489] -> [x_1490, x_1491, ..., x_1499]
---
After the training, I want to use the model to predict the next 10 values [x_1500 - x_1509] with [x_1400 - x_1499], and then predict the next 10 values [x_1510 - x_1519] with [x_1410 - x_1509].
Is this the right way?
After a lot of reading of documents and examples, I can train a model and make the prediction, but the result seems not satisfactory.
To validate the method, I assume that the last 100 (x_1400 - x_1499) values are unknown, and remove them from trainX and trainY, then try to train a model and predict them. Lastly, compare the predicted values with the observed values.
Any suggestions or comments will be appreciated.
The time series looks like this:
Your question is really complexed. Before I will try to answer it - I'll share my doubts with you about is it sensible to use LSTM for your task. You want to use a really advanced model (LSTM are capable to learn really complexed patterns) to a time series which seems to be relatively easy. Moreover - you have a really small amout of data. To be honest - I would try to train simpler and easier methods first (like ARMA or ARIMA).
To answer your question - if your approach is good - it seems to be reasonable. Other reasonable methods are predicting all 100 steps or e.g. 50 steps twice. With 10 steps you might come across error cumulation - but still it might be a good method.
As I mentioned earlier - I would rather try easier ML method for this task but if you really want to use LSTM you may tackle this problem in a following way:
Define metaparameters like number of steps ahead you want to predict, the size of input fed to network.
Try to use e.g. grid search in order to find the best value of this metaparameters. Evaluate each setup using k-fold crossvalidation.
Retrain final model using the best metaparameter setup.
You have relatively small amount of data so you may easily find the best values of hyperparameters. This will also show you if your approach is good or not - simply check the results provided by the best solution.

How to split documents into training set and test set?

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 -> Training and 30 -> Test) What is the better approach to do so? I am using python.
I wanted a approach programatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.
I tried a few ways by using built in python keywords and functions only to fail. Lastly I got the idea of approaching it. Also Cross-validation is a good option to be considered for the building general classification models.
Not sure exactly what you're after, so I'll try to be comprehensive. There will be a few steps:
Get a list of the files
Randomize the files
Split files into training and testing sets
Do the thing
1. Get a list of the files
Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. What we want to do is get a list of all of these files. This is done simply with the os module. I'm assuming you have no subdirectories; this would change if there were.
import os
def get_file_list_from_dir(datadir):
all_files = os.listdir(os.path.abspath(datadir))
data_files = list(filter(lambda file: file.endswith('.data'), all_files))
return data_files
So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).
2. Randomize the files
We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.
from random import shuffle
def randomize_files(file_list):
shuffle(file_list)
Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)
3. Split files into training and testing sets
I'll assume a 70:30 ratio instead of any specific number of documents. So:
from math import floor
def get_training_and_testing_sets(file_list):
split = 0.7
split_index = floor(len(file_list) * split)
training = file_list[:split_index]
testing = file_list[split_index:]
return training, testing
4. Do the thing
This is the step where you open each file and do your training and testing. I'll leave this to you!
Cross-Validation
Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.
Edit: Alright, since you requested I will explain this a little bit more.
So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.
Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.
If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.
To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.
def cross_validate(data_files, folds):
if len(data_files) % folds != 0:
raise ValueError(
"invalid number of folds ({}) for the number of "
"documents ({})".format(folds, len(data_files))
)
fold_size = len(data_files) // folds
for split_index in range(0, len(data_files), fold_size):
training = data_files[split_index:split_index + fold_size]
testing = data_files[:split_index] + data_files[split_index + fold_size:]
yield training, testing
That yield keyword at the end is what makes this a generator. To use it, you would use it like so:
def ml_function(datadir, num_folds):
data_files = get_file_list_from_dir(datadir)
randomize_files(data_files)
for train_set, test_set in cross_validate(data_files, num_folds):
do_ml_training(train_set)
do_ml_testing(test_set)
Again, it's up to you to implement the actual functionality of your ML system.
As a disclaimer, I'm no expert by any means, haha. But let me know if you have any questions about anything I've written here!
that's quite simple if you use numpy, first load the documents and make them a numpy array, and then:
import numpy as np
docs = np.array([
'one', 'two', 'three', 'four', 'five',
'six', 'seven', 'eight', 'nine', 'ten',
])
idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices
np.random.shuffle(idx) # shuffle to make training data and test data random
train = docs[idx == 1]
test = docs[idx == 0]
print(train)
print(test)
the result:
['one' 'two' 'three' 'six' 'eight' 'nine' 'ten']
['four' 'five' 'seven']
Just make a list of the filenames using os.listdir(). Use collections.shuffle() to shuffle the list, and then training_files = filenames[:700] and testing_files = filenames[700:]
You can use train_test_split method provided by sklearn. See documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Parameter selection and k-fold cross-validation

I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.

Resources