How to correct for multiple testing with enrichment test? - p-value

I run enrichment test to find genes mutated at a different rate within one clusters of samples compared outside it based on a two-tailed Fisher’s exact test.
So finally I have a matrix 5x10 of pvalues.
I wonder how to correct them for multiple testing. Should I correct by genes or by clusters ?

if you have SNP data such that all your clusters have more or less the same p-values then it is my understanding that you should pick a random p-value from each cluster and then use these when you are correcting from multiple testing, and subsequently correct the entire cluster with that p-values correction value, if there is large differences within the cluster I think you should correct by gene.

Related

Hierarchical Clustering

I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.
Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.
In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.

Use k-means test results for training set SPSS

I am student working with SPSS (statistics) for the first time. I used 1,000 rows of test data to run k-means cluster tool and obtained the results. I now want to take those results and run against a test set (another 1,000) to see how my model did.
I am not sure how to do this; any help is greatly appreciated!
Thanks
For clustering model (or any unsupervised model), there really is no right or wrong result. As such, there is no target variable that you can compare the cluster model result (the cluster allocation) to and the idea of splitting the data set into a training and a testing partition does not apply to these types of models.
The best you can do is to review the output of the model and explore the cluster allocations and determine whether these appear to be useful for the intended purpose.

How to check if a data point is within the boundary of a cluster or not

Suppose I have done clustering (using 3 features) and got 4 clusters, training on a set of data points.
Now in production I will be getting a different set of data points, and based on the values of the features of that data point, I need to know if it falls in the pre-defined cluster that I made earlier or not. This is not doing clustering but rather finding if a point falls within a pre-defined cluster.
How do I find whether the point is in a cluster?
Do I need to run linear regression to find the equation of the boundary covering the cluster?
There is no general answer to your question. The way new point is assigned to a cluster is a property of the cluster itself. Thus the crucial thing is "what is the clustering procedure used in the first place". Each well defined clustering method (in mathematical sense) provides you with the whole input space partitioning, not just finite training set. Such techniques include k-means, GMM, ...
However, there are exceptions - clustering methods which are simply heuristics, and not valid optimization problems. For example if you use hierarchical clustering there is no partitioning of the space, thus you cannot correctly assign new point to any cluser, and you are left with dozens of equally correct, heuristic methods which will do something - but you cannot say which one is correct. These heuristics include:
"closest point heuristics", which is essentialy equivalent of training 1-NN on your clusters
"build a valid model heuristics" which is a generalization of the above where you fit some classifier (of your choice) to mimic the original clustering (and select its hyperparameters through cross validation).
"what would happen if I re-run the clustering", if you can re-run the clustering from the previous solution you can simply check what cluster it falls into given previous clustering as a starting point.
...

Kmeans2 returning the ordered centroids list based on how many data points they correspond to

I am using the Opencv Kmeans2() clustering function on some data set. Practically the cluster centroids start by being initialized randomly. Then clustering precedes.
How do I request all those centroids in a certain order, say first cetroid of the largest chunk of the data set, then next (in decreasing order, assuming that there's no equal set of different point locations in multidimensional space) and so on ?
Why don't you order them as you like yourself?
For most CV applications, this is not necessary, and you can do it as easily yourself, I guess. It's not as if there was an optimized k-means that returned clusters in a particular order. As k-means is initialized randomly, there is no "natural" order of the clusters. And as you already noted, there can be ties, both in cluster assignment and cluster sizes.
I am planning to count up the labels and see to which of the centroid cluster belongs the most of the data set

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?
I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).
I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Resources