ARIMA model producing a straight line prediction - time-series

I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?

Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.

Related

Time series prediction using GP - training data

I am trying to implement time series forecasting using genetic programming. I am creating random trees (Ramped Half-n-Half) with s-expressions and evaluating each expression using RMSE to calculate the fitness. My problem is the training process. If I want to predict gold prices and the training data looked like this:
date open high low close
28/01/2008 90.959999 91.889999 90.75 91.75
29/01/2008 91.360001 91.720001 90.809998 91.150002
30/01/2008 90.709999 92.580002 90.449997 92.059998
31/01/2008 90.919998 91.660004 90.739998 91.400002
01/02/2008 91.75 91.870003 89.220001 89.349998
04/02/2008 88.510002 89.519997 88.050003 89.099998
05/02/2008 87.900002 88.690002 87.300003 87.68
06/02/2008 89 89.650002 88.75 88.949997
07/02/2008 88.949997 89.940002 88.809998 89.849998
08/02/2008 90 91 89.989998 91
As I understand, this data is nonlinear so my questions are:
1- Do I need to make any changes to this data like exponential smoothing? and why?
2- When looping the current population and evaluating the fitness of each expression on the training data, should I calculate the RMSE on just part of this data or all of it?
3- When the algorithm finishes and I get an expression with the best (lowest) fitness, does this mean that when I apply any row from the training data, the output should be the price of the next day?
I've read some research papers about this and I noticed some of them mentioning dividing the training data when calculating the fitness and some of them are doing exponential smoothing. However, I found them a bit difficult to read and understand, and most implementations I've found are either in python or R which I am not familiar with.
I appreciate any help on this.
Thank you.

Trend and Seasonality from time series

how can we extract trend, seasonality from a time series in a way SARIMAX does internally.
I need to use the same to understand how much importance (feature importance) trend, seasonality, AR component, MA component and exogenous variables are to the forecast.
You can do this way -
from statsmodels.tsa.seasonal import seasonal_decompose
#decomposition
decomposition = seasonal_decompose(x = df.y, model = 'multiplicative')
decomposition.plot()
# df is the dataframe of y is the name of column having values of which you want
to see trends and seasonality.
# model value can be additive or multiplicative.

what does lightgbm python Dataset reference parameter mean?

I am trying to figure out how to train a gbdt classifier with lightgbm in python, but getting confused with the example provided on the official website.
Following the steps listed, I find that the validation_data comes from nowhere and there is no clue about the format of the valid_data nor the merit or avail of training model with or without it.
Another question comes with it is that, in the documentation, it is said that "the validation data should be aligned with training data", while I look into the Dataset details, I find that there is another statement shows that "If this is Dataset for validation, training data should be used as reference".
My final questions are, why should validation data be aligned with training data? what is the meaning of reference in Dataset and how is it used during training? is the alignment goal accomplished with reference set to training data? what is the difference between this "reference" strategy and cross-validation?
Hope someone could help me out of this maze, thanks!
The idea of "validation data should be aligned with training data" is simple :
every preprocessing you do to the training data, you should do it the same way for validation data and in production of course. This apply to every ML algorithm.
For example, for neural network, you will often normalize your training inputs (substract by mean and divide by std).
Suppose you have a variable "age" with mean 26yo in training. It will be mapped to "0" for the training of your neural network. For validation data, you want to normalize in the same way as training data (using mean of training and std of training) in order that 26yo in validation is still mapped to 0 (same value -> same prediction).
This is the same for LightGBM. The data will be "bucketed" (in short, every continuous value will be discretized) and you want to map the continuous values to the same bins in training and in validation. Those bins will be calculated using the "reference" dataset.
Regarding training without validation, this is something you don't want to do most of the time! It is very easy to overfit the training data with boosted trees if you don't have a validation to adjust parameters such as "num_boost_round".
still everything is tricky
can you share full example with using and without using this "reference="
for example
will it be different
import lightgbm as lgbm
importance_type_LGB = 'gain'
d_train = lgbm.Dataset(train_data_with_NANs, label= target_train)
d_valid = lgbm.Dataset(train_data_with_NANs, reference= target_train)
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)
from
import lightgbm as lgbm
importance_type_LGB = 'gain'
lgb_clf = lgbm.LGBMClassifier(class_weight = 'balanced' ,importance_type = importance_type_LGB)
lgb_clf.fit(test_data_with_NANs,target_train)
test_data_predict_proba_lgb = lgb_clf.predict_proba(test_data_with_NANs)

Reproduce ARIMA Forecast (Pandas)

I am quite new to the ARIMA model, and I have a question on how to analyze the chart of the ACF (autocorrelaction function) according to the lag. Is it correct to take into account the ACF value of 0.5 which corresponds to about 450 lag and then set the arima model on these values?
This is my graph:
and this is my simple code for arima model:
import from statsmodels.tsa.arima_model import ARIMA
# fit model
model = ARIMA(df['valore'], order=(400,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())
# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())
Thanks!
P.S. my page in jupyter format and the data (csv) can be found at: github
In theory it is possible to include an order of 400 in an ARIMA model. In practice that value is astronomically high for an ARIMA model (Anything higher than 3 or 4 is considered unusual in an ARIMA model). I would double check your data and also double check how you are calculating the ACF.
Additionally the p order of the ARIMA(p,d,q) model is usually determined using the PACF, not the ACF. You use the ACF for determining q.

Classification with numerical label?

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

Resources