I am working on a dataset with the following ACF and PACF plots. Having performed the Augmented Dickey Fuller test, the series stationary as the p-value is extremely small and the test statistic is smaller than each critical value.
from statsmodels.tsa.stattools import adfuller
test = adfuller(df['Debit'], autolag="AIC")
out = pd.Series(test[0:4], index = ['Test Statistic','p-val',"#Lags Used","Number of Observations Used"])
for key,value in test[4].items():
out[f'Critical Value {key}']=value
out
Test Statistic -1.846322e+01
p-val 2.145214e-30
#Lags Used 7.200000e+01
Number of Observations Used 1.269350e+05
Critical Value 1% -3.430402e+00
Critical Value 5% -2.861563e+00
Critical Value 10% -2.566782e+00
dtype: float64
But the results of the ADF do not correspond to my expectations about the ACF and PACF plots as they exhibit anomalies not seen in any of the time series I've encountered in tutorials
fg,ax = plt.subplots(2,2)
plot_acf(df_d['Debit'], lags=40,ax=ax[0,0], title="Autocorrelation")
plot_acf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[0,1], title="First Difference Autocorrelation")
plot_pacf(df_d['Debit'], lags = 40, ax=ax[1,0], title="Partial Autocorrelation")
plot_pacf(df_d['Debit'].diff().dropna(), lags=40,ax=ax[1,1], title="First Difference Partial Autocorrelation")
Looking at the charts I'm unable to determine the ARIMA(p,d,q) parameters from the plot because of the statistically significant bumps at 30s. I also tried the auto_arima function but to no avail. How can I determine the parameters of the model?
Related
I have a dataset (timeseries from 2010 to 2019 rainfall data from various districts near vellore). When I ran the ADF(Augmented Dickey-Fuller Test) i got my dataset to be Stationary! meaning no seasonality!
My question is that am I doing something wrong? because normally rainfall occurs more in particular months(rainy season ofc) So shouldn't there be seasonality in my dataset?
ADF Result
Results of Dickey-Fuller Test:
Test Statistic -1.770941e+01
p-value 3.507811e-30
#Lags Used 7.000000e+00
Number of Observations Used 3.644000e+03
Critical Value (1%) -3.432146e+00
Critical Value (5%) -2.862333e+00
Critical Value (10%) -2.567192e+00
According to this result my test statistic of -17.7 is very small compared to critical values -2.56(10%) Hence this means my data is already stationary!.
Dataset contains daily data so there are a lot of 0's too, does this affect the seasonality?
Thank you!
Check the same with KPPS test with checking the seasonal Trend
kpps(df,regression='ct')
The parameter regression = 'ct' will check over the seasonal trend
I'm using the ScikitLearn flavour of the DecisionTree.jl package to create a random forest model for a binary classification problem of one of the RDatasets data sets (see bottom of the DecisionTree.jl home page for what I mean by ScikitLearn flavour). I'm also using the MLBase package for model evaluation.
I have built a random forest model of my data and would like to create a ROC Curve for this model. Reading the documentation available, I do understand what a ROC curve is in theory. I just can't figure out how to create one for a specific model.
From the Wikipedia page the last part of the first sentence that I have marked in bold italics below is the one that is causing my confusion: "In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied." There is more on the threshold value throughout the article but this still confuses me for binary classification problems. What is the threshold value and how do I vary it?
Also, in the MLBase documentation on ROC Curves it says "Compute an ROC instance or an ROC curve (a vector of ROC instances), based on given scores and a threshold thres." But doesn't mention this threshold anywhere else really.
Example code for my project is given below. Basically, I want to create a ROC curve for the random forest but I'm not sure how to or if it's even appropriate.
using DecisionTree
using RDatasets
using MLBase
quakes_data = dataset("datasets", "quakes");
# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)
# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2
# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)
# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)
DecisionTree.fit!(r_f_model, features, labels)
# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))
# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
# p::T # positive in ground-truth
# n::T # negative in ground-truth
# tp::T # correct positive prediction
# tn::T # correct negative prediction
# fp::T # (incorrect) positive prediction when ground-truth is negative
# fn::T # (incorrect) negative prediction when ground-truth is positive
I also read this question and didn't find it helpful really.
The task in binary classification is to give a 0/1 (or true/false, red/blue) label to a new, unlabeled, data-point. Most classification algorithms are designed to output a continuous real value. This value is optimized to be higher for points with known or predicted label 1, and lower for points with known or predicted label 0. To use this value to generate a 0/1 prediction, an additional threshold is used. Points with a value higher than threshold are predicted to be labeled 1 (and for lower than threshold a 0 label is predicted ).
Why is this setup useful? Because, sometimes mispredicting a 0 instead of a 1 is more costly, and then you can set the threshold low, making the algorithm output predict 1s more often.
In an extreme case when predicting 0 instead of a 1 costs nothing for the application, you can set the threshold at infinity, making it always output 0 (which is obviously the best solution, since it incurs no cost).
The threshold trick cannot eliminate errors from the classifier - no classifier in real-world problems is perfect or free from noise. What it can do is change the ratio between the 0-when-really-1 errors and 1-when-really-0 errors for the final classification.
As you increase the threshold, more points are classified with a 0 label. Consider a chart with the fraction of points classified with 0 on the x-axis, and the fraction of points with a 0-when-really-1 error on the y-axis. For each value of the threshold, plot a point for the resulting classifier on this chart. Plotting a point for all thresholds you get a curve. This is (some variant of) the ROC curve, which summarizes the abilities of the classifier. An often used metric for quality of classification is the AUC or area-under-curve of this chart, but in fact, the whole curve can be of interest in applications.
A summary like this appears in many texts on machine learning, which are a google query away.
Hope this clarifies the role of the threshold and its relation to ROC curves.
I'm investigating sensor measurements of NO2 in the atmosphere over the course of several days. My first interest is to find periodicity of the data to which end I'm using autocorrelation.
My problem is that the praxis seems to be to use moving average as well as filtering of the measurements; moving average of about 10-50 data points and readings set above the sensors maximum reading of 200µg/m³ is set to 200µg/m³ (as far as my understanding goes on that).
Anyhow... When performing my autocorrelation I found that processing the raw signal or the averaged/filtered signal gives wildly different results, as can be seen in appended autocorrelation figure (bottom), which leads me to my question:
When performing autocorrelation, do I wrongfully change the result by using an averaged/filtered input signal to my autocorrelation function? And if so, which way is "correct"?
On top: RAW sensor measurement of NO2 concentration, NO moving average/filtering! Middle: measurement processed with a moving average of 30 data points and any reading >200 is set to 200. Bottom: autocorrelation of the two above measurements, with some slight smoothing. Right scale is inactive and possible end effects are not interesting.
Comments on the figure: I know it looks bad/weird that the moving average signal is flat most of the time, and that this flatness is not at a constant 200 (max). This is really not of interest, the behavior of autocorrelation is my concern.
Applying a moving average before autocorrelating is the same as applying the moving average twice (once forward and once backward) after autocorrelating.
Let * denote convolution and ^R denote time-reversal of a signal. x and m are your input signal and moving average filter
AutoCorrelate(x*m) = (x*m) * (x*m)^R
= x * m * x^R * m^R
= x * x^R * m * m^R
= AutoCorrelate(x) * (m * m^R)
Note that a moving average filter is the same shape as it's time-reversal, so by filtering the signal before autocorrelation, you have filtered the autocorrelation twice.
Since a moving average filter is a low-pass this explains why the curves in your autocorrelation are smoothed out.
Whether or not this is appropriate really depends on your application. If the moving average filter only removes noise, then it's a good idea. If the moving average removes important parts of the signal that indicate its timing, then it's not a good idea.
I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).
My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.