Should I select ARIMA parameters based on training data or the whole data? - time-series

I have time series data and I would like to build an ARIMA forecasting model. I have split my data into train-test. I will be training the model only on the training set and evaluate on testing set.
So my question is when I am plotting the ACF and PACF to get an idea of the appropriate p and q parameters, should I plot the ACF and PACF on my training set or the whole data? and about Auto Arima, should I feed the whole data or just the training set?
I tried with both the training data and the whole data. and they give different results(for both ACF - PACF plots and Auto ARIMA). So which data should I use?

Related

Differencing a time series in Prophet

I have a series that has a linear trend but no seasonality. I have been trying different time series algorithms. I have tried ARIMA using pmdarima and I get good results with 1st order differencing of the series.
Next, I am using Prophet. With the series as is I get a high MAE. So I differenced the series and used Prophet to make predictions. But now the predicted values (yhat) are the differenced values. How do I convert the predicted values in the yhat column to the original scale so that I can calculate MAE and evaluate the model?
Is it even possible? I have tried all the possible solutions, but since this is unlike min-max scaler, I am not able to find a way out of it. Most of the solutions require the first value of the original series to inverse diff the differenced series.
Any help will be appreciated.

Normalize time-series data before or after split of training and testing data?

I use a classification model on time-series data where I normalize the data before splitting the data into train and test. Now, I know that train and test data should be treated separately to prevent data leaking. What could be the proper order of normalization steps here? Should I apply steps 1,2,3 separately to train and test after I split data with the help of a sliding window? I use a sliding window here to compare each hour (test) with its previous 24 hrs data (train). Here is the order that I am currently using in the pipeline.
Moving averages (mean)
Resampling every hour
Standardization
Split data into train and test using a sliding window (of a length 24 hrs (train) and slides every 1 hr (test))
Fit the model using train data
Predict using the test data
Steps 1 and 2 can be done safely, you just should take into account that The moving average must use only past values: X'i = mean(Xi, Xi-1, Xi-2, ..., Xi-n).
However, in step 3, the normalization/standardization parameters, like max and min if you are using minmax scaler or mean and standard deviation if you are using standardization, should be computed from the training data and should be applied to the whole dataset, so your pipeline would be something like this
Moving average (using only past values)
Resampling every hour
Split data into train and test.
Get standardization parameters from the train data (mean and std).
Standardize the whole dataset (train and test) using the parameters computed in 4.
Fit the model using train data
Predict using the test data

TIme Series forcasting

I've been following a lot of tutorials using lstms to forecast timeseries data. My question is that how do we predict on new data that is not part of the dataset since almost all the tutorials show the predict function in Keras being used on the test data split.
How do we actually forecast into the future?
Usually, you create your training data such that the model receives n points and predict the following m points. Once you have your model trained, you take the last n available points of your dataset or new points from the present, and the model will output a prediction of m points in the future.
If you want to predict more than m points in the future, you could predict m points and use it as input to predict another m points, and so on. However, you should be aware that using this technique you will probably get worse results as you are accumulating errors.

Machine learning: training model from test data

I was wondering if a model trains itself from the test data as well while evaluating it multiple times, leading to a over-fitting scenario. Normally we split the training data into train-test splits and I noticed some people split it into 3 sets of data - train, test and eval. eval is for final evaluation of the model. I might be wrong but my point is that if the above mentioned scenario is not true, then there is no need for an eval data set.
Need some clarification.
The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.
People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.
Does this help?
Consider you have train and test datasets. Train dataset is the one in which you know the output and you train your model on train dataset and you try to predict the output of Test dataset.
Most people split train dataset into train and validation. So first you run your model on train data and evaluate it on validation set. Then again you run the model on test dataset.
Now you are wondering how this will help and of any use?
This helps you to understand your model performance on seen data(validation data) and unseen data(your test data).
Here comes bias-variance trade-off into picture.
https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
Let's consider a binary classification example where a student's previous semester grades, Sports achievements, Extracurriculars etc are used to predict whether or not he will pass the final semester.
Let's say we have around 10000 samples (data of 10000 students).
Now we split them:
Training set - 6000 samples
Validation set - 2000 samples
Test set - 1000 samples
The training data is generally split into three (training set, validation set, and test set) for the following reasons:
1) Feature Selection: Let's assume you have trained the model using some algorithm. You calculate the training accuracy and validation accuracy. You plot the learning curves and find if the model is overfitting or underfitting and make changes (add or remove features, add more samples etc). Repeat until you have the best validation accuracy. Now test the model with the test set to get your final score.
2) Parameter Selection: When you use algorithms like KNN, And you need to find the best K value which fits the model properly. You can plot the accuracy of different K value and choose the best validation accuracy and use it for your test set. (same applies when you find n_estimators for Random forests etc)
3) Model Selection: Also you can train the model with different algorithms and choose the model which better fits the data by testing out the accuracy using validation set.
So basically the Validation set helps you evaluate your model's performance how you must fine-tune it for best accuracy.
Hope you find this helpful.

Overfitting and Data splitting

Let's say that I have a data file like:
Index,product_buying_date,col1,col2
0,2013-01-16,34,Jack
1,2013-01-12,43,Molly
2,2013-01-21,21,Adam
3,2014-01-09,54,Peirce
4,2014-01-17,38,Goldberg
5,2015-01-05,72,Chandler
..
..
2000000,2015-01-27,32,Mike
with some more data and I have a target variable y. Assume something as per your convenience.
Now I am aware that we divide the data into 2 parts i.e. Train and Test. And then we divide Train into 70:30, build the model with 70% and validate it with 30%. We tune the parameters so that model does not get overfit. And then predict with the Test data. For example: I divide 2000000 into two equal parts. 1000000 is train and I divide it in validate i.e. 30% of 1000000 which is 300000 and 70% is where I build the model i.e. 700000.
QUESTION: Is the above logic depending upon how the original data splits?
Generally we shuffle the data and then break it into train, validate and test. (train + validate = Train). (Please don't confuse here)
But what if the split is alternate. Like When I divide it in Train and Test first, I give even rows to Test and odd rows to Train. (Here data is initially sort on the basis of 'product_buying_date' column so when i split it in odd and even rows it gets uniformly split.
And when I build the model with Train I overfit it so that I get maximum AUC with Test data.
QUESTION: Isn't overfitting helping in this case?
QUESTION: Is the above logic depending upon how the original data
splits?
If dataset is large(hundred of thousand), you can randomly split the data and you should not have any problem but if dataset is small then you can adopt the different approaches like cross-validation to generate the data set. Cross-validation states that you split you make n number of training-validation set out of your Training set.
suppose you have 2000 data points, you split like
1000 - Training dataset
1000 - testing dataset.
5-cross validation would mean that you would make five 800/200 training/validation dataset.
QUESTION: Isn't overfitting helping in this case?
Number one rule of the machine learning is that, you don't touch the test data set. It's a holly data set that should not be touched.
If you overfit the test data to get maximum AUC score then there won't be any meaning of validation dataset. Foremost aim of any ml algorithm is to reduce the generalization error i.e. algorithm should be able to perform good on unseen data. If you would tune your algorithm with testing data. you won't be able to meet this criteria. In cross-validation also you do not touch your testing set. you select your algorithm. tune its parameter with validation dataset and after you have done with that apply your algorithm to test dataset which is your final score.

Resources