I have a dataset (timeseries from 2010 to 2019 rainfall data from various districts near vellore). When I ran the ADF(Augmented Dickey-Fuller Test) i got my dataset to be Stationary! meaning no seasonality!
My question is that am I doing something wrong? because normally rainfall occurs more in particular months(rainy season ofc) So shouldn't there be seasonality in my dataset?
ADF Result
Results of Dickey-Fuller Test:
Test Statistic -1.770941e+01
p-value 3.507811e-30
#Lags Used 7.000000e+00
Number of Observations Used 3.644000e+03
Critical Value (1%) -3.432146e+00
Critical Value (5%) -2.862333e+00
Critical Value (10%) -2.567192e+00
According to this result my test statistic of -17.7 is very small compared to critical values -2.56(10%) Hence this means my data is already stationary!.
Dataset contains daily data so there are a lot of 0's too, does this affect the seasonality?
Thank you!
Check the same with KPPS test with checking the seasonal Trend
kpps(df,regression='ct')
The parameter regression = 'ct' will check over the seasonal trend
Related
I am trying to create a model that predicts if it will rain in the next 5 days (multi-step) or not, so I dont need the precipitation value, just a "yes" or "no". I've been testing with some different tools/algorithms and I guess the big challenge here is dealing with the zero skewed data.
The dataset consists of hourly data that has columns such as precipitation, temperature, pressure, wind speed, humidity. It has around 1 milion rows. There is no requisite to use a multivariate approach.
Rain occurs mostly on months 1,2,3,11 and 12.
So I tried using a univariate LSTM on the data, and with hourly sample I had the best results. I used the following architecture:
model=Sequential()
model.add(LSTM(150,return_sequences=True,input_shape=(1,look_back)))
model.add(LSTM(50,return_sequences=True))
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(trainX, trainY, epochs=15, batch_size=4096, validation_data=(testX, testY), shuffle=False)
I'm using a lookback value of 24*60, which should mean 2 months.
Train/Validation Loss:
https://i.stack.imgur.com/CjDbR.png
Final result:
https://i.stack.imgur.com/p6SnD.png
So I read that this train/validation loss means the model is underfitting, is it? What could I do to prevent this?
Before using LSTM I tried using Prophet, which rendered really bad results and tried used autoarima, but it couldn't handle a yearly seasonality (365 days).
In case of underfitting what you can do is icreasing the learning rate, increasing training duration and number of training data.
It is also worth having some external metric such as the F1 score because loss isn't a good metrics for human evaluation.
Just looking at your example I would start with experimenting a bit with the loss function, it seems like your data is binary so it would be wiser to use a binary loss instead of a regression loss
I am working on a dataset with the following ACF and PACF plots. Having performed the Augmented Dickey Fuller test, the series stationary as the p-value is extremely small and the test statistic is smaller than each critical value.
from statsmodels.tsa.stattools import adfuller
test = adfuller(df['Debit'], autolag="AIC")
out = pd.Series(test[0:4], index = ['Test Statistic','p-val',"#Lags Used","Number of Observations Used"])
for key,value in test[4].items():
out[f'Critical Value {key}']=value
out
Test Statistic -1.846322e+01
p-val 2.145214e-30
#Lags Used 7.200000e+01
Number of Observations Used 1.269350e+05
Critical Value 1% -3.430402e+00
Critical Value 5% -2.861563e+00
Critical Value 10% -2.566782e+00
dtype: float64
But the results of the ADF do not correspond to my expectations about the ACF and PACF plots as they exhibit anomalies not seen in any of the time series I've encountered in tutorials
fg,ax = plt.subplots(2,2)
plot_acf(df_d['Debit'], lags=40,ax=ax[0,0], title="Autocorrelation")
plot_acf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[0,1], title="First Difference Autocorrelation")
plot_pacf(df_d['Debit'], lags = 40, ax=ax[1,0], title="Partial Autocorrelation")
plot_pacf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[1,1], title="First Difference Partial Autocorrelation")
Looking at the charts I'm unable to determine the ARIMA(p,d,q) parameters from the plot because of the statistically significant bumps at 30s. I also tried the auto_arima function but to no avail. How can I determine the parameters of the model?
I am trying to implement time series forecasting using genetic programming. I am creating random trees (Ramped Half-n-Half) with s-expressions and evaluating each expression using RMSE to calculate the fitness. My problem is the training process. If I want to predict gold prices and the training data looked like this:
date open high low close
28/01/2008 90.959999 91.889999 90.75 91.75
29/01/2008 91.360001 91.720001 90.809998 91.150002
30/01/2008 90.709999 92.580002 90.449997 92.059998
31/01/2008 90.919998 91.660004 90.739998 91.400002
01/02/2008 91.75 91.870003 89.220001 89.349998
04/02/2008 88.510002 89.519997 88.050003 89.099998
05/02/2008 87.900002 88.690002 87.300003 87.68
06/02/2008 89 89.650002 88.75 88.949997
07/02/2008 88.949997 89.940002 88.809998 89.849998
08/02/2008 90 91 89.989998 91
As I understand, this data is nonlinear so my questions are:
1- Do I need to make any changes to this data like exponential smoothing? and why?
2- When looping the current population and evaluating the fitness of each expression on the training data, should I calculate the RMSE on just part of this data or all of it?
3- When the algorithm finishes and I get an expression with the best (lowest) fitness, does this mean that when I apply any row from the training data, the output should be the price of the next day?
I've read some research papers about this and I noticed some of them mentioning dividing the training data when calculating the fitness and some of them are doing exponential smoothing. However, I found them a bit difficult to read and understand, and most implementations I've found are either in python or R which I am not familiar with.
I appreciate any help on this.
Thank you.
I have a discrete time series covering 49 quarters between January 2007 and March 2019, which I am trying to analyse. Before undertaking various forms of analysis I wanted to check for the existence of seasonality and have tried to methods for such in R. In the first I used the WO function (Webel and Ollech) from the seastests package, which informed me that the data did not display seasonality.
library(seastests)
summary(wo(tt))
> summary(wo(tt))
Test used: WO
Test statistic: 0
P-value: 0.8174965 0.5785041 0.2495668
The WO - test does not identify seasonality
However, I wanted to check such again and used the decompose function, from which I got the below, which would appear to suggest a seasonal component. Can anyone advise if;
I am reading the decomposed data correctly?
AND
Why there is such disagreement between decompose and the seastest results?
The decompose function is a simple function that basically estimates the (moving) period average. The volatility of your time series increases strongly in the last years. Thus the averages may pick up on some random increases. Also, the seasonal component that you obtain using the decompose() function will basically always look seasonal.
set.seed(1234)
x <- ts(rnorm(80), frequency=4)
seastests::wo(x)
plot(decompose(x))
Therefore, seasonality tests are preferable to assessing whether a time series really is seasonal.
Still, if you have information that the data generating process has changed, you may want to use the test on the last few years of observations.
I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?
Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.