nested for loop for t-test function to calculate p-values - p-value

I want to calculate sample based p-values for my data that has ~150 control and ~300 AD samples. I wrote two nested for loops, but it takes 7 hours to calculate all p-values. How can I write this code with apply functions?
This is my code:
for (i in 1:301){#number of AD samples in columns
for (j in 1:39637){#number of genes
pval[j,i]<-t.test(mu=data[j,i],data[j,c(302:449)])$p.value
}
}

Related

SPSS 26: How to calculate the absolute differences in scores from repeated measunrements in order to create cumulative frequency tables

I am working with SPSS 26 and I have some trouble finding out which functions to use...
I have scores from repeated measurements (9 setups with each 3 stimulus types á 10 scores ) and need to calculate the absolute differences in scores in order to create cumulative frequency tables. The whole thing is about test-retest variability of the scores obtained with the instrument. The main goal is to be able to say that e.g. XX % of the scores for setup X and stimulus type X were within X points. Sorry, I hope that is somehow understandable :) I APPRECIATE ANY HELP I CAN GET I AM TERRIBLE AT THIS!

PCA vs averaging columns

I have a dataframe with 300 float type columns and 1 integer column which is the dependent variable. The 300 columns are of 3 kinds:
1.Kind A: columns 1 to 100
2.Kind B: columns 101 to 200
3.Kind C: columns 201 to 300
I want to reduce the number of dimensions. Should I average the values for each kind and aggregate into 3 columns(one for each type), or should I perform some dimensionality reduction techniques like PCA? What is the justification of the same?
Option 1:
Do not do dimensionality reduction if you have large number of training data (say more then 5*300 sample for training)
Option 2:
Since you know that there are 3 kinds of data, run a PCA of those three kinds separately and get say 2 features for each. i.e
f1, f2 = PCA(kind A columns)
f3, f4 = PCA(kind B columns)
f5, f6 = PCA(kind C columns)
train(f1, f2, f3, f4, f5, f6)
Option 3
Run PCA on all columns and only take number of columns which preserve 90+ variance
Do not average, averaging is bad. But if you really want to do averaging and if you know certainly that some features are important rather do weighted average. In general averaging of features for dimensionally reduction is a very bad idea.
PCA will only consider the rows which will have highest co-relation with the output / result. So not all rows will be considered as a part of process to determine the output.
So it will be better if u do averaging as it will consider all the rows and determine the output from them.
As u have a larger number of features it is better if all the features are used to determine output.

How to calculate the accuracy of classes from a 7x7 confusion matrix?

So I've got the following results from Naïves Bayes classification on my data set:
I am stuck however on understanding how to interpret the data. I am wanting to find and compare the accuracy of each class (a-g).
I know accuracy is found using this formula:
However, lets take the class a. If I take the number of correctly classified instances - 313 - and divide it by the total number of 'a' (4953) from the row a, this gives ~6.32%. Would this be the accuracy?
EDIT: if we use the column instead of the row, we get 313/1199 which gives ~26.1% which seems a more reasonable number.
EDIT 2: I have done a calculation of the accuracy of a in excel which gives me 84% as the accuracy, using the accuracy calculation shown above:
This doesn't seem right, as the overall accuracy of classification successfully is ~24%
No -- all you've calculated is tp/(tp+fn), the total correct identifications of class a, divided by the total of actual a examples. This is recall, not accuracy. You need to include the other two figures.
fp is the rest of the a column; tn is all of the other figures in the non-a rows and columns, the 6x6 sub-matrix. This will reduce all 35K+ trials to a 2x2 matrix with labels a and not a, the 2x2 confusion matrix with which you're already familiar.
Yes, you get to repeat that reduction for each of the seven features. I recommend doing it programmatically.
RESPONSE TO OP UPDATE
Your accuracy is that high: you have a huge quantity of true negatives, not-a samples that were properly classified as not-a.
Perhaps it doesn't feel right because our experience focuses more on the class in question. There are [other statistics that handle that focus.
Recall is tp / (tp+fn) -- of all items actually in class a, what percentage did we properly identify? This is the 6.32% figure.
Precision is tp / (tp + fp) -- of all items identified as class a, what percentage were actually in that class. This is the 26.1% figure you calculated.

Predictive modelling

How to perform regression(Random Forest,Neural Networks) for this kind of data?
The data contains features and we need to predict sales qty based on week and attributes
here I am attaching the sample data
Here we are trying to predict sales quantity based on other attributes
Multivariate linear regression
Assuming
input variables x[][] (each row corresponds to a sample, each column corresponds to a variable such as week, season, ..)
expected output y[] (as many rows as x)
parameters being learned theta[] (as many as there are input variables + 1)
you are optimizing a function h:
h = sum for all j of { x[j][i] * p[i] - y[j] } is minimal
This can easily be achieved through gradient descent.
You can also include combinations of parameters (and simply include more thetas for those pseudo-parameters)
I have some code lying around in a GitHub repository that performs basic multivariate linear regression (for a course I sometimes teach).
https://github.com/jorisschellekens/ml/tree/master/linear_regression

SPSS Bootstrap with custom sample size

I have a very large sample of 11236 cases for each of my two variables (ms and gar). I now want to calculate Spearman's rho correlation with bootstrapping in SPSS.
I figured out the standard syntax for bootstrapping in SPSS with bias corrected and accelerated confidence intervals:
DATASET ACTIVATE DataSet1.
BOOTSTRAP
/SAMPLING METHOD=SIMPLE
/VARIABLES INPUT=ms gar
/CRITERIA CILEVEL=95 CITYPE=BCA NSAMPLES=10000
/MISSING USERMISSING=EXCLUDE.
NONPAR CORR
/VARIABLES=ms gar
/PRINT=SPEARMAN TWOTAIL NOSIG
/MISSING=PAIRWISE.
But this syntax is resampling my 11236 cases 10000 times.
How can i achieve taking a random sample of 106 cases (√11236), calculate Spearman's rho and repeat 10000 times (with new random sample of 106 cases each bootstrap step)?
Use the sample selection procedures - Data > Select Cases. You can specify an approximate or exact random sample or select specific cases. Then run the BOOTSTRAP and NONPAR CORR commands.

Resources