Weka time series auto complete missing dates - time-series

I am using Weka's time series package for a forecasting task. I need to implement the forecast programmatically in java.
I followed the example given in
http://wiki.pentaho.com/display/DATAMINING/Time+Series+Analysis+and+Forecasting+with+Weka#TimeSeriesAnalysisandForecastingwithWeka-2Requirements. However, in some situation, if the time series (weekly data) is not consecutive
e.g.
2010-03-05
2010-03-26
The results from code will differ from what weka ui give. The reason is, I can see from weka's UI output that weka automatically supplements missing dates as follows:
2010-03-05
2010-03-12
2010-03-19
2010-03-26
with predicted values as train values.
Anyone know how to make this happen in java code?

Related

Forecasting Value in time series data with multiple independent variables

I have a data set attributes are (Date, Value, Variable-1, Variable-2, Variable-3, Variable-4, Variable-5), I have 100k plus rows. I wanted to predict the "Value" in the future based on 5 variables trained in time series manners, there will be seasonal trends and low and high scores in "Value". Can someone suggest to me some statistical or machine learning/deep learning solution for this?
Here is Dataset Screenshot, I wanted to forecast Value Variable
This is very interesting problem and you can use "Vector auto regression (VAR)" method to solve this problem. Packages are available in both R and Python to solve this problem.

Temporal train-test split for forecasting

I know this may be a basic question but I want to know if I am using the train, test split correctly.
Say I have data that ends at 2019, and I want to predict values in the next 5 years.
The graph I produced is provided below:
My training data starts from 1996-2014 and my test data starts from 2014-2019. The test data perfectly fits the training data. I then used this test data to make predictions from 2019-2024.
Is this the correct way to do it, or my predictions should also be from 2014-2019 just like the test data?
The test/validation data is useful for you to evaluate the predictor to use. Once you have decided which model to use, you should train the model with the whole dataset 1996-2019 so that you do not lose possible valuable knowledge from 2014-2019. Take into account that when working with time-series, usually the newer part of the serie has more importance in your prediction than older values of the serie.

ARIMA Nodes in KNIME how to use?

I'm new to KNIME and trying to use ARIMA for extrapolation of my time series data. But I've failed to make ARIMA Predictor to do it's work.
Input data are of the following format
year,cv_diff
2011,-4799.099999999977
2012,60653.5
2013,64547.5
2014,60420.79999999993
And I would like to predict values for example for 2015 and 2016 years.
I'm using String to Date/Time node to convert year to date. In ARIMA Learner I can choose only cv_diff field. And this is the first question: for option 'Column containing univariate time series' should I set year column or variable that I'm going to predict? But in my case I have only one option - cv_diff variable. After that I connect Learner's output with ARIMA Predictor's input and execute. Execution is failing with ' ERROR ARIMA Predictor 2:3 Execute failed: The column with the defined time series was not found. Please configure the node anew.'
Help me to understand which variable should I set for Learner and Predictor? Should it be non-timeseries variable? And how then Arima nodes will understand which column to use as time series?
You should set the cv_diff as the time series variable and connect the input to the predictor too. (And do not try to set too large values for the parameters as with so little data points, learning will not work.)
Here is an example:
Finally, I've figured it out. Option 'Column containing univariate time series' for ARIMA Learner node seems little bit confusing especially for those unfamiliar with time series analysis. I should't have provided any time series field explicitly, because ARIMA treats variable on which it is going to make prediction as collected in equal time intervals and it doesn't matter what kind of intervals they are.
I've found a good explanation of what 'univariate time series' means
The term "univariate time series" refers to a time series that
consists of single (scalar) observations recorded sequentially over equal time increments. Some examples are monthly CO2 concentrations and southern
oscillations to predict el nino effects.
Although a univariate time series data set is usually given as a single column of numbers, time is in fact an implicit variable in the time series. If the data are equi-spaced, the time variable, or index, does not need to be explicitly given. The time variable may sometimes be explicitly used for plotting the series. However, it is not used in the time series model itself.
So, I should choose cv_diff variable for both Learner and Predictor and do not provide any timestamps or any other time related columns.
One more thing that I didn't understand. That I should train on some series of data and then provide another SERIES for which I want predictions. That is little bit different from other Machine Learning workflows when you need to provide only new data and there is no notion of series at all.

What ML package (RNN model) has the capacity to predict with data of less time steps than training data

(Since my original question is probably not getting an answer because of too specific about one package, I will ask another general.)
According to the RNN Model, we have an input and output for every step. Let's say a model trained with data of 6 time steps. Of course if I use test data of 6 time steps, I will get outputs, and I have succeed in that. But theoretically, if I only have data of first 3 time steps, I should get an output from the 3rd output node too (without re-train a model with first 3 time steps). But I found at least "keras" package can't do this.
Is there any packages that support such prediction? Better in python language and better to have LSTM layer.
As far as I understand your problem you have two options, depending on your goal.
You can pad sequences in the end with zeros so they are the correct dimension but in the output you just use the first n steps according to your test data dimension.
You can use a stateful implementation

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Resources