I have a very large sample of 11236 cases for each of my two variables (ms and gar). I now want to calculate Spearman's rho correlation with bootstrapping in SPSS.
I figured out the standard syntax for bootstrapping in SPSS with bias corrected and accelerated confidence intervals:
DATASET ACTIVATE DataSet1.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES INPUT=ms gar
/CRITERIA CILEVEL=95 CITYPE=BCA NSAMPLES=10000
/MISSING USERMISSING=EXCLUDE.
NONPAR CORR
/VARIABLES=ms gar
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
But this syntax is resampling my 11236 cases 10000 times.
How can i achieve taking a random sample of 106 cases (√11236), calculate Spearman's rho and repeat 10000 times (with new random sample of 106 cases each bootstrap step)?
Use the sample selection procedures - Data > Select Cases. You can specify an approximate or exact random sample or select specific cases. Then run the BOOTSTRAP and NONPAR CORR commands.
Related
I am working with SPSS 26 and I have some trouble finding out which functions to use...
I have scores from repeated measurements (9 setups with each 3 stimulus types á 10 scores ) and need to calculate the absolute differences in scores in order to create cumulative frequency tables. The whole thing is about test-retest variability of the scores obtained with the instrument. The main goal is to be able to say that e.g. XX % of the scores for setup X and stimulus type X were within X points. Sorry, I hope that is somehow understandable :) I APPRECIATE ANY HELP I CAN GET I AM TERRIBLE AT THIS!
I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).
I am trying to learn RankSVM using OHSUMED dataset and SVM Rank library as explained in following link:
http://research.microsoft.com/en-s/um/beijing/projects/letor/Baselines/RankSVM-Struct.txt
I used same parameters as link suggests for OHSUMED dataset. i.e
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold1_l1_c0.0002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold2_l1_c0.002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold3_l1_c0.01_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold4_l1_c0.02_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold5_l1_c0.01_e0.001.log
But when I train my model & run "svm_rank_classify" command I get following result:
Reading model...done.
Reading test examples...done.
Classifying test examples...done
Runtime (without IO) in cpu-seconds: 0.00
Average loss on test set: 0.3864
Zero/one-error on test set: 100.00% (0 correct, 22 incorrect, 22 total)
NOTE: The loss reported above is the fraction of swapped pairs averaged over
all rankings. The zero/one-error is fraction of perfectly correct
rankings!
Total Num Swappedpairs : 31337
Avg Swappedpairs Percent: 38.64
Please suggest If any steps I am missing here?
Thanks.
The zero/one-error is the percentage of rankings (i.e. qid sets) where the model ranked at least one pair incorrectly. Your accuracy over all pairs is actually:
(100 - Avg Swappedpairs Percent) = 61.36%
I always have trouble understanding the significance of chi-squared test and how to use it for feature selection. I tried reading the wiki page but I didn't get a practical understanding. Can anyone explain?
chi-squared test helps you to determine the most significant features among a list of available features by determining the correlation between feature variables and the target variable.
Example below is taken from https://chrisalbon.com/machine-learning/chi-squared_for_feature_selection.html
The below test will select two best features (since we are assigning 2 to the "k" parameter) among the 4 available features initially.
# Load libraries
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris = load_iris()
# Create features and target
X = iris.data
y = iris.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
X_kbest = chi2_selector.fit_transform(X, y)
type(X_kbest)
# Show results
print('Original number of features:', X.shape[1])
print('Reduced number of features:', X_kbest.shape[1])
Chi-squared feature selection is a uni-variate feature selection technique for categorical variables. It can also be used for continuous variable, but the continuous variable needs to be categorized first.
How it works?
It tests the null hypothesis that the outcome class depends on the categorical variable by calculating chi-squared statistics based on contingency table. For more details on contingency table and chi-squared test check the video: https://www.youtube.com/watch?v=misMgRRV3jQ
To categorize the continuous data, there are range of techniques available from simplistic frequency based binning to advance approaches such as Minimum Description Length and entropy based binning methods.
Advantage of using chi-squared test on continuous variable is that it can capture the non-linear relation with outcome variable.
I am using Support Vector Machines for document classification. My feature set for each document is a tf-idf vector. I have M documents with each tf-idf vector of size N.
Giving M * N matrix.
The size of M is just 10 documents and tf-idf vector is 1000 word vector. So my features are much larger than number of documents. Also each word occurs in either 2 or 3 documents. When i am normalizing each feature ( word ) i.e. column normalization in [0,1] with
val_feature_j_row_i = ( val_feature_j_row_i - min_feature_j ) / ( max_feature_j - min_feature_j)
It either gives me 0, 1 of course.
And it gives me bad results. I am using libsvm, with rbf function C = 0.0312, gamma = 0.007815
Any recommendations ?
Should i include more documents ? or other functions like sigmoid or better normalization methods ?
The list of things to consider and correct is quite long, so first of all I would recommend some machine-learning reading before trying to face the problem itself. There are dozens of great books (like ie. Haykin's "Neural Networks and Learning Machines") as well as online courses, which will help you with such basics, like those listed here: http://www.class-central.com/search?q=machine+learning .
Getting back to the problem itself:
10 documents is rows of magnitude to small to get any significant results and/or insights into the problem,
there is no universal method of data preprocessing, you have to analyze it through numerous tests and data analytics,
SVMs are parametrical models, you cannot use a single C and gamma values and expect any reasonable results. You have to check dozens of them to even get a clue "where to search". The most simple method for doing so is so called grid search,
1000 of features is a great number of dimensions, this suggest that using a kernel, which implies infinitely dimensional feature space is quite... redundant - it would be a better idea to first analyze simplier ones, which have smaller chance to overfit (linear or low degree polynomial)
finally is tf*idf a good choice if "each word occurs in 2 or 3 documents"? It can be doubtfull, unless what you actually mean is 20-30% of documents
finally why is simple features squashing
It either gives me 0, 1 of course.
it should result in values in [0,1] interval, not just its limits. So if this is a case you are probably having some error in your implementation.