I have my ACF and PACF graph with R commands acf() and pacf(). I would like to know if it's possible to find automatically the last |p-value|>= 1 with a command treating directly those graphs reading the p-values. The aim is to build an ARIMA(p,d,q) model and for that to find the p and the q.
Thanks
Related
a superimposed display for train/val splits using StatisticsGen
Hi,
I'm currently using tfx pipeline inside kubeflow. I struggle to have StatisticsGen showing a single graph with train and validation splits curves superimposed, allowing better comparaison distributions. this is exactly how tfdv.visualize_statistics(lhs_statistics=train_stats, rhs_statistics=eval_stats, lhs_name='train', rhs_name='eval') behaves (see illustration 1), and I would like StatisticsGen to also provide a superimposed splits graph.
Thanks for any reference or help so that i can move forward.
Regards
You can use something like
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
From the tensorflow data validation tutorial
I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?
Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.
In order to clusterize a set of time series I'm looking for a smart distance metric.
I've tried some well known metric but no one fits to my case.
ex: Let's assume that my cluster algorithm extracts this three centroids [s1, s2, s3]:
I want to put this new example [sx] in the most similar cluster:
The most similar centroids is the second one, so I need to find a distance function d that gives me d(sx, s2) < d(sx, s1) and d(sx, s2) < d(sx, s3)
edit
Here the results with metrics [cosine, euclidean, minkowski, dynamic type warping]
]3
edit 2
User Pietro P suggested to apply the distances on the cumulated version of the time series
The solution works, here the plots and the metrics:
nice question! using any standard distance of R^n (euclidean, manhattan or generically minkowski) over those time series cannot achieve the result you want, since those metrics are independent of the permutations of the coordinate of R^n (while time is strictly ordered and it is the phenomenon you want to capture).
A simple trick, that can do what you ask is using the cumulated version of the time series (sum values over time as time increases) and then apply a standard metric. Using the Manhattan metric, you would get as a distance between two time series the area between their cumulated versions.
Another approach would be by utilizing DTW which is an algorithm to compute the similarity between two temporal sequences. Full disclosure; I coded a Python package for this purpose called trendypy, you can download via pip (pip install trendypy). Here is a demo on how to utilize the package. You're just just basically computing the total min distance for different combinations to set the cluster centers.
what about using standard Pearson correlation coefficient? then you can assign the new point to the cluster with the highest coefficient.
correlation = scipy.stats.pearsonr(<new time series>, <centroid>)
Pietro P's answer is just a special case of applying a convolution to your time series.
If I gave the kernel:
[1,1,...,1,1,1,0,0,0,0,...0,0]
I would get a cumulative series .
Adding a convolution works because you're giving each data point information about it's neighbours - it's now order dependent.
It might be interesting to try with a guassian convolution or other kernels.
I am a developer that has been tasked with working out how previous results using SPSS were gathered, so we can repeat the process with some new data. We can't ask the person who did the original analysis because he is sadly no longer with us, so it has fallen to me to unravel what he did.
I am not a statistician and do not need to understand the principles involved. I really just need to know what menu items to navigate to.
We had a survey done, which asked a lot of questions of 10,000 people. A subset of 15 of these questions is being used for the analysis.
I know that factor analysis was done to reduce the data to 4 sets. K-means clustering was then used to find the cluster centers. This is what I'm after now.
I have worked out how to do the factor analysis to get the component score coefficient matrix that matches the data I have in my database. This was done by going to Analyze > Dimension Reduction > Factor. I then chose a fixed number of factors (4) from the "Extract" section, "Varimax" rotation from the "Rotation" section and checked the "Display factor score coefficient matrix" in the "Scores" section.
This gave data like this:
Matrix Value 1 Value 2 Value 3 Value 4
Q1 -0.0756 0.2134 -0.0245 -0.1236
Q2 ... ... ... ...
Q3 ... ... ... ...
...
What I have no idea of is how to proceed with this to do the k-means clustering.
The results I have in the database look like this:
Cluster centers Value 1 Value 2 Value 3 Value 4 Value 5
FAC1_1 -0.8373 -0.5766 0.2100 1.3499 0.2940
FAC2_1 ... ... ... ... ...
FAC3_1 ... ... ... ... ...
FAC4_1 ... ... ... ... ...
Now, I know that k-means clustering can be done on the original data set by using Analyze > Classify > K-means Cluster, but I don't know how to reference the factor analysis I've done.
Could someone give me some insight into how to create these cluster centers using SPSS?
In the GUI for FACTOR analysis (Analyze > Dimension Reduction > Factor), you have a sub-dialog "Scores", make sure "Save as variables" is checked.
This will save the factor scores in your data i.e. the variables FAC1_1, FAC2_1, FAC3_1, FAC4_1.
It is these variable that you then need to add as input variables in the K-means GUI.
It is better to setup your work in a syntax so if ever anyone else ever wants to replicate your work they can do so (and ideally your predecessor should have left his bread crumbs in a syntax document too. I would make every attempt to find this document if there is a remote possibility of it existing, a file of .sps file extension).
Here's how you'd set this up in syntax and what his/her workings may have looked like:
/* Replicate the factor analysis (four factors) and save the factor score variables */.
FACTOR
/VARIABLES < INPUT THE 15 VARIABLES HERE >
/MISSING LISTWISE
/ANALYSIS < INPUT THE 15 VARIABLES HERE >
/PRINT EXTRACTION ROTATION FSCORE
/FORMAT SORT BLANK(.10)
/PLOT ROTATION
/CRITERIA FACTORS(4) ITERATE(25)
/EXTRACTION PC
/CRITERIA ITERATE(25)
/ROTATION VARIMAX
/SAVE REG(ALL)
/METHOD=CORRELATION.
/* Replicate the clustering using factor scores as inputs, generating 5 segments */.
QUICK CLUSTER FAC1_1 FAC2_1 FAC3_1 FAC4_1
/MISSING=LISTWISE
/CRITERIA=CLUSTER(5) MXITER(10) CONVERGE(0)
/METHOD=KMEANS(NOUPDATE)
/SAVE CLUSTER (Seg5)
/PRINT INITIAL.
/* Check centroids match*/.
MEANS FAC1_1 FAC2_1 FAC3_1 FAC4_1 BY Seg5 /CELLS MEAN.
If you can replicate the FACTOR score variables to match exactly, then that is a good start, if the centroids do not match then, given the factor scores do match, then it can only be/most likely to be because the segment assignments are now different. Despite using the same input/methodology if the case ordering is different to previously, K-Means QUICK CLUSTER, can and will most likely yield different segment assignments due to random starting points.
I don't know any way round this but in principle these are the likely steps he/she had taken.
I have done same kind of analysis for a project of mine. First carry out the factor analysis, once you have been able to extract good amount of variance from the factor analysis try to save the factor scores (In SPSS).
For saving the factor scores go to Analyse->Dimension Reduction->Factor->Score->Save as variables.
As you save the scores there would be new variables created in the Variable view based on the number of components.
After you have been able to save the scores of the factors go to Analyse->Classify->K-Means and select the new variables (Factors Scores) enter the number of initial clusters required then OK.
If you have access to the system where the original work was done, look for the journal file (typically named statistics.jnl and kept in the location specified under Edit > Options > Files).
If journaling was in effect with the append option, it will have all the commands the user ran.
I'm doing the same set of analyses for a project. Just for your information, two-step clustering process offered by SPSS is more robust that K-means (Punj & Stewart 1983). In K-means, how are you going to choose the K?! You can also use the clvalid package to get the optimal number of K if you insist on using K-means.
Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and suggestions for application. Journal of marketing research, 134-148.
I was given a data set with 1,000 variables and have been asked to run Pearson's correlations on the explanatory variables and a binary dependent variable. I generate the correlations using the following code:
correlations /variables = x1 to x500 with y
correlations /variables = x501 to x1000 with y
The resulting output is a table which appears un-sortable in SPSS or other software (e.g. Excel)
x1 Pearson Correlation
p-value
N
-----------------------
x2 Pearson Correlation
p-value
N
-----------------------
.
.
.
-----------------------
xi Pearson Correlation
p-value
N
-----------------------
I want to be able to rank the variables according to Pearson's Correlation and then p-value. Does SPSS have the capability to save the Variable Name, Pearson's Correlation value and p-Value as a table, and then rank them?
I am too used to Stata and R and could not notice anything in the manual. Would a workaround be to run univariate regression models with only one dependent variable 1,000 times and try saving those coefficients?
Thanks!
You can easily pivot the statistics into the columns in the output table, which would give you a sortable arrangement. Try it with a few variables to see how this works. You double click the table to activate it and then Use Pivot > Pivoting Trays to open the controls for pivoting.
To do this for your real data, you will want to capture the table using OMS, creating a new dataset, which you can then sort or do other data manipulation operations. When you create your OMS command, you will want to tell it to pivot the table so that the dataset arrangement is convenient.
Bear in mind that fishing for the highest correlations is likely to give you an overly optimistic view of the predictive power of the top variables.
The NAIVEBAYES procedure (Statistics Server) might be another approach to consider. Check the Command Syntax Reference for details.