Standard Error Median Survival Time SPSS - spss

I would like to understand how the standard error for the median survival time is calculated in SPSS 19.0
I've looked at the Algorithms document (ftp://ftp.software.ibm.com/software/analytics/spss/support/Stats/Docs/19.0/Client/User_Manuals/English/IBM_SPSS_Statistics_19_Algorithms.pdf) and couldn't find an answer.
I found this review article (http://www.barkerstats.com/PDFs/tas.pdf) which outlines the formula in SPSS 17.0
$$ se(P_i) = P_i \sqrt{\sum^i_{j=1} ( \frac{q_j}{r_j p_j})} $$
Is this correct? If yes, how do I properly apply this quantity to the median survival time?

Related

Time series prediction using GP - training data

I am trying to implement time series forecasting using genetic programming. I am creating random trees (Ramped Half-n-Half) with s-expressions and evaluating each expression using RMSE to calculate the fitness. My problem is the training process. If I want to predict gold prices and the training data looked like this:
date open high low close
28/01/2008 90.959999 91.889999 90.75 91.75
29/01/2008 91.360001 91.720001 90.809998 91.150002
30/01/2008 90.709999 92.580002 90.449997 92.059998
31/01/2008 90.919998 91.660004 90.739998 91.400002
01/02/2008 91.75 91.870003 89.220001 89.349998
04/02/2008 88.510002 89.519997 88.050003 89.099998
05/02/2008 87.900002 88.690002 87.300003 87.68
06/02/2008 89 89.650002 88.75 88.949997
07/02/2008 88.949997 89.940002 88.809998 89.849998
08/02/2008 90 91 89.989998 91
As I understand, this data is nonlinear so my questions are:
1- Do I need to make any changes to this data like exponential smoothing? and why?
2- When looping the current population and evaluating the fitness of each expression on the training data, should I calculate the RMSE on just part of this data or all of it?
3- When the algorithm finishes and I get an expression with the best (lowest) fitness, does this mean that when I apply any row from the training data, the output should be the price of the next day?
I've read some research papers about this and I noticed some of them mentioning dividing the training data when calculating the fitness and some of them are doing exponential smoothing. However, I found them a bit difficult to read and understand, and most implementations I've found are either in python or R which I am not familiar with.
I appreciate any help on this.
Thank you.

ARIMA model producing a straight line prediction

I did some experiments with the ARIMA model on 2 datasets
Airline passengers data
USD vs Indian rupee data
I am getting a normal zig-zag prediction on Airline passengers data
ARIMA order=(2,1,2)
Model Results
But on USD vs Indian rupee data, I am getting prediction as a straight line
ARIMA order=(2,1,2)
Model Results
SARIMAX order=(2,1,2), seasonal_order=(0,0,1,30)
Model Results
I tried different parameters but for USD vs Indian rupee data I am always getting a straight line prediction.
One more doubt, I have read that the ARIMA model does not support time series with a seasonal component (for that we have SARIMA). Then why for Airline passengers data ARIMA model is producing predictions with cycle?
Having gone through similar issue recently, I would recommend the following:
Visualize seasonal decomposition of the data to make sure that the seasonality exists in your data. Please make sure that the dataframe has frequency component in it. You can enforce frequency in pandas dataframe with the following :
dh = df.asfreq('W') #for weekly resampled data and fillnas with appropriate method
Here is a sample code to do seasonal decomposition:
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(dh['value'], model='additive',
extrapolate_trend='freq') #additive or multiplicative is data specific
fig = decomposition.plot()
plt.show()
The plot will show whether seasonality exists in your data. Please feel free to go through this amazing document regarding seasonal decomposition. Decomposition
If you're sure that the seasonal component of the model is 30, then you should be able to get a good result with pmdarima package. The package is extremely effective in finding optimal pdq values for your model. Here is the link to it: pmdarima
example code pmdarima
If you're unsure about seasonality, please consult with a domain expert about the seasonal effects of your data or try experimenting with different seasonal components in your model and estimate the error.
Please make sure that the stationarity of data is checked by Dickey-Fuller test before training the model. pmdarima supports finding d component with the following:
from pmdarima.arima import ndiffs
kpss_diff = ndiffs(dh['value'].values, alpha=0.05, test='kpss', max_d=12)
adf_diff = ndiffs(dh['value'].values, alpha=0.05, test='adf', max_d=12)
n_diffs = max(adf_diff , kpss_diff )
You may also find d with the help of the document I provided here. If the answer isn't helpful, please provide the data source for exchange rate. I will try to explain the process flow with a sample code.

SPSS Bootstrap with custom sample size

I have a very large sample of 11236 cases for each of my two variables (ms and gar). I now want to calculate Spearman's rho correlation with bootstrapping in SPSS.
I figured out the standard syntax for bootstrapping in SPSS with bias corrected and accelerated confidence intervals:
DATASET ACTIVATE DataSet1.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES INPUT=ms gar
/CRITERIA CILEVEL=95 CITYPE=BCA NSAMPLES=10000
/MISSING USERMISSING=EXCLUDE.
NONPAR CORR
/VARIABLES=ms gar
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
But this syntax is resampling my 11236 cases 10000 times.
How can i achieve taking a random sample of 106 cases (√11236), calculate Spearman's rho and repeat 10000 times (with new random sample of 106 cases each bootstrap step)?
Use the sample selection procedures - Data > Select Cases. You can specify an approximate or exact random sample or select specific cases. Then run the BOOTSTRAP and NONPAR CORR commands.

SVD output interpretation in mahout

I am trying to run a SVD job in mahout. I have a matrix (say A) created (Document x term) of size 372053 x 21338 (21338 no of unique words say N, 372053 documents say M). So my matrix A is of size (M*N). I ran the svd using mahout and i got the cleaned eigen vectors (i gave the expected rank as 200 say R). Now i have a eigen vectors matrix created of size R*N.
Stating the SVD equation
A = U * S * V' (V' being transpose of V)
I need to convert the matrix A to the new space, to get the compressed vectors of the documents (I am trying to implement LSI)
What is the output i get from mahout SVD? (I would like to know in terms of the equation above) I read mailing list that we can get the eigen values from the NamedVectors in the generated eigen vectors matrix.
Please guide me on how to proceed from here to generate the document-term matrix A in the new space (of size M*R).
Any help is highly appreciated :)
A good starting point for LSI with Stochastic SVD on Mahout can be found here.
The good part is that the paper describes also the folding in process and is explicit on the output format in terms of the svd equation.
The work is integrated in the latest version 0.8 and can be used with SSVDCli job or through mahout CLI with mahout ssvd <options>

IBM SPSS 20: Rank and output a Matrix of Pearson's Correlations?

I was given a data set with 1,000 variables and have been asked to run Pearson's correlations on the explanatory variables and a binary dependent variable. I generate the correlations using the following code:
correlations /variables = x1 to x500 with y
correlations /variables = x501 to x1000 with y
The resulting output is a table which appears un-sortable in SPSS or other software (e.g. Excel)
x1 Pearson Correlation
p-value
N
-----------------------
x2 Pearson Correlation
p-value
N
-----------------------
.
.
.
-----------------------
xi Pearson Correlation
p-value
N
-----------------------
I want to be able to rank the variables according to Pearson's Correlation and then p-value. Does SPSS have the capability to save the Variable Name, Pearson's Correlation value and p-Value as a table, and then rank them?
I am too used to Stata and R and could not notice anything in the manual. Would a workaround be to run univariate regression models with only one dependent variable 1,000 times and try saving those coefficients?
Thanks!
You can easily pivot the statistics into the columns in the output table, which would give you a sortable arrangement. Try it with a few variables to see how this works. You double click the table to activate it and then Use Pivot > Pivoting Trays to open the controls for pivoting.
To do this for your real data, you will want to capture the table using OMS, creating a new dataset, which you can then sort or do other data manipulation operations. When you create your OMS command, you will want to tell it to pivot the table so that the dataset arrangement is convenient.
Bear in mind that fishing for the highest correlations is likely to give you an overly optimistic view of the predictive power of the top variables.
The NAIVEBAYES procedure (Statistics Server) might be another approach to consider. Check the Command Syntax Reference for details.

Resources