I am forecasting the new covid cases in upcoming time and I want to see the trend, seasonality present in the dataset.
I tried to see the trend and seasonality using seasonal decompose.
from statsmodels.tsa.seasonal import seasonal_decompose
#decomposition
decomposition = seasonal_decompose(x = df.new_cases,
model = 'multiplicative')
decomposition.plot()
and I got this -
I am not able to understand the seasonal graph. What does it trying to show? Does that mean my dataset doesn't has any seasonality?
and what does Resid graph indicates?
Related
I'm trying to predict weekly sales of a store from the famous walmart dataset - a simple time series prediction exercise. I fit the model and run predict command, but the predictions are all NaNs.
The data is stationary.
I've taken rolling mean over 10 weeks, detrended the sales numbers using this rolling mean, differenced the data.
Auto arima detects a ARIMA (1,0,2) model.
I fit the model and predict, but the predictions are all NaNs. Also, the predicted NaNs start with index of almost 10 years ago! Same happens for A sarimax model.
Please help me solve this issue!
Attaching my code below:
train = rmdetdiff.iloc[:110]['weekly_sales']
test = rmdetdiff.iloc[110:]['weekly_sales']
model1 = ARIMA(train, order=(1,0,2))
model1_fit = model1.fit()
start = len(train)
end = len(train)+len(test)-1
rmdetdiff['arima_pred'] = model1_fit.predict(start=start, end=end, dynamic=True)
rmdetdiff[['arima_pred','weekly_sales']].plot(legend=True)
Here is the plot after predictions: (all the blank space before the weekly_sales is created after the model1.fit command.)
shape of rmdetdiff is (133,1)
I am working on a dataset with the following ACF and PACF plots. Having performed the Augmented Dickey Fuller test, the series stationary as the p-value is extremely small and the test statistic is smaller than each critical value.
from statsmodels.tsa.stattools import adfuller
test = adfuller(df['Debit'], autolag="AIC")
out = pd.Series(test[0:4], index = ['Test Statistic','p-val',"#Lags Used","Number of Observations Used"])
for key,value in test[4].items():
out[f'Critical Value {key}']=value
out
Test Statistic -1.846322e+01
p-val 2.145214e-30
#Lags Used 7.200000e+01
Number of Observations Used 1.269350e+05
Critical Value 1% -3.430402e+00
Critical Value 5% -2.861563e+00
Critical Value 10% -2.566782e+00
dtype: float64
But the results of the ADF do not correspond to my expectations about the ACF and PACF plots as they exhibit anomalies not seen in any of the time series I've encountered in tutorials
fg,ax = plt.subplots(2,2)
plot_acf(df_d['Debit'], lags=40,ax=ax[0,0], title="Autocorrelation")
plot_acf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[0,1], title="First Difference Autocorrelation")
plot_pacf(df_d['Debit'], lags = 40, ax=ax[1,0], title="Partial Autocorrelation")
plot_pacf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[1,1], title="First Difference Partial Autocorrelation")
Looking at the charts I'm unable to determine the ARIMA(p,d,q) parameters from the plot because of the statistically significant bumps at 30s. I also tried the auto_arima function but to no avail. How can I determine the parameters of the model?
how can we extract trend, seasonality from a time series in a way SARIMAX does internally.
I need to use the same to understand how much importance (feature importance) trend, seasonality, AR component, MA component and exogenous variables are to the forecast.
You can do this way -
from statsmodels.tsa.seasonal import seasonal_decompose
#decomposition
decomposition = seasonal_decompose(x = df.y, model = 'multiplicative')
decomposition.plot()
# df is the dataframe of y is the name of column having values of which you want
to see trends and seasonality.
# model value can be additive or multiplicative.
I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?
Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.
I am quite new to the ARIMA model, and I have a question on how to analyze the chart of the ACF (autocorrelaction function) according to the lag. Is it correct to take into account the ACF value of 0.5 which corresponds to about 450 lag and then set the arima model on these values?
This is my graph:
and this is my simple code for arima model:
import from statsmodels.tsa.arima_model import ARIMA
# fit model
model = ARIMA(df['valore'], order=(400,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())
# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
residuals.plot(kind='kde')
pyplot.show()
print(residuals.describe())
Thanks!
P.S. my page in jupyter format and the data (csv) can be found at: github
In theory it is possible to include an order of 400 in an ARIMA model. In practice that value is astronomically high for an ARIMA model (Anything higher than 3 or 4 is considered unusual in an ARIMA model). I would double check your data and also double check how you are calculating the ACF.
Additionally the p order of the ARIMA(p,d,q) model is usually determined using the PACF, not the ACF. You use the ACF for determining q.