how to predict future data based on present data set - time-series

I'm trying to predict future values based on my present data set in python using pandas. After stationarity of my data set, none of the time series algorithms will give the correct predictions can anyone please help me.

Related

Forecasting Value in time series data with multiple independent variables

I have a data set attributes are (Date, Value, Variable-1, Variable-2, Variable-3, Variable-4, Variable-5), I have 100k plus rows. I wanted to predict the "Value" in the future based on 5 variables trained in time series manners, there will be seasonal trends and low and high scores in "Value". Can someone suggest to me some statistical or machine learning/deep learning solution for this?
Here is Dataset Screenshot, I wanted to forecast Value Variable
This is very interesting problem and you can use "Vector auto regression (VAR)" method to solve this problem. Packages are available in both R and Python to solve this problem.

Temporal train-test split for forecasting

I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

normalization methods for stream data

I am using Clustream algorithm and I have figured out that I need to normalize my data. I decided to use min-max algorithm to do this, but I think in this way the values of new coming data objects will be calculated differently as the values of min and max may change. Do you think that I'm correct? If so, which algorithm shall I use?
Instead to compute the global min-max based on the whole data, you can use a local nomarlization based on a sliding window (e.g. using just the last 15 secconds of data). This approach is very commom to compute Local Mean Filter on signal and image processing.
I hope it can help you.
When normalizing stream data you need to use the statistical properties of the train set. During streaming you just need to cut too big/low values to a min/max value. There is no other way, it's a stream, you know.
But as a tradeoff, you can continuously collect the statistical properties of all your data and retrain your model from time to time to adapt to evolving data. I don't know Clustream but after short googling: it seems to be an algorithm to help to make such tradeoffs.

No improvement in prediction accuracy, in-spite of obtaining best c and g value

I am a newbie in machine learning.Hence i apologize in advanced for this silly question.
I used LIBSVM to train a scaled training data using default parameter and generated a model. However i got a only 26.26% accuracy on test data. Hence,I used grid search to obtain optimum C and gamma(g) value.
I plugged in the best C and G value and re-trained my training data.But there was no change in accuracy.Can anyone please explain me the reason behind this.Thanks.
You have to normalize the data first and then use grid search to obtain optimum C and gamma(g) value.
Remember you need to normalize both training and testing data.

Resources