running SPSS Two Step Clustering analysis, and requesting a new column be added with cluster assignment. This column is created with mostly positive values, but I notice several -1 values. Does someone know what this means?
Thanks!
I found it, it's the outlier cluster.
Related
I am currently studying CluStream, and I have some doubts regarding the results. I will proceed to explain:
If the micro clusters are clustered using K means, we all know that every micro cluster will belong to the closest macro cluster (computing the euclidean distance between the centers).
Now, looking at the following sample result:
we can see that the macro clusters do not group all the micro clusters …
What does this mean? How should we consider the micro clusters that do not lie inside some macro cluster? Should I find every micro cluster closest macro one to label them?
EDIT:
Checking the MOA source code on Github, I found that the macro clusters radius is calculated multiplying the deviation AVG by the so called ‘radius factor’ (which value is fixed at 1.8). However, when I ask the macro clusters for their weights, if a huge time window is used and there is not a fading component, I can see that the macro clusters resume the information of all the points ... all the current micro clusters are considered! So, even if we see some micro clusters that stay out of the macro clusters spheres, we know that they belong to the closest one - it's K means after all!
So, I still have a question: why calculating the macro clusters radius that way? I mean, what does it represent? Should not the algorithm return the labeled micro clusters instead?
Any feedback is welcomed. TIA!
The key question is: what does the user need?
Labeling micro-clusters is okay, but where is the use for the user?
In most cases, all that people use of the k-means result are the cluster centers. Because the entire objective of k-means is essentially "find the best k-point approximation to the data".
So likely all the information users of CluStream are going to use are the k current cluster centers. maybe the weights each, and their age.
I am student working with SPSS (statistics) for the first time. I used 1,000 rows of test data to run k-means cluster tool and obtained the results. I now want to take those results and run against a test set (another 1,000) to see how my model did.
I am not sure how to do this; any help is greatly appreciated!
Thanks
For clustering model (or any unsupervised model), there really is no right or wrong result. As such, there is no target variable that you can compare the cluster model result (the cluster allocation) to and the idea of splitting the data set into a training and a testing partition does not apply to these types of models.
The best you can do is to review the output of the model and explore the cluster allocations and determine whether these appear to be useful for the intended purpose.
I run enrichment test to find genes mutated at a different rate within one clusters of samples compared outside it based on a two-tailed Fisher’s exact test.
So finally I have a matrix 5x10 of pvalues.
I wonder how to correct them for multiple testing. Should I correct by genes or by clusters ?
if you have SNP data such that all your clusters have more or less the same p-values then it is my understanding that you should pick a random p-value from each cluster and then use these when you are correcting from multiple testing, and subsequently correct the entire cluster with that p-values correction value, if there is large differences within the cluster I think you should correct by gene.
I'm trying to evaluate what is the right number of cluster needed for clusterize some data.
I know that this is possible using Davies–Bouldin Index (DBI).
To using DBI you have to compute it for any number of cluster and the one that minimize the DBI corresponds to the right number of cluster needed.
The question is:
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
Only considering the average DBI of all clusters apparently is not a good idea.
Certainly, increasing the number of clusters - k, without penalty, will always reduce the amount of DBI in the resulting clustering, to the extreme case of zero DBI if each data point is considered its own cluster (because each data point overlaps with its own centroid).
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
So it's hard to say which one is better if you only use the average DBI as the performance metric.
A good practical method is to use the Elbow method.
Another method looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion".
Some other good alternatives with respective to choosing the optimal number of clusters:
Determining the number of clusters in a data set
How to define number of clusters in K-means clustering?
I’m using following loglikelihood formula to compare the similarity between a document and a cluster:
log p(d|c) = sum (c(w,d) * log p(w|c));
c(w,d) is the frequency of a word in a document and p(w|c) is the likelihood of word w being generated by a cluster c.
The problem is that based on this similarity the document is often assigned to the wrong cluster. If I assign the document to the cluster with the highest log p(d|c) (as it is usually negative value I take –log p(d|c)) then it will be the cluster that contains a lot of words from a document but the probability of these words in the cluster is low.
If I assign the document to the cluster with the lowest log p(d|c) then it will be the cluster that has intersection with a document only in one word.
Can someone explain me how to use the loglikelihood correctly? I try to implement this function in java. I already looked on google scholar, but didn’t found suitable explanation of loglikelihood in text mining.
Thanks in advance
Your log likelihood formulation is correct for describing a document with a multinomial model (words in each document are generated independently from a multinomial distribution).
To get the maximum likelihood cluster assignment, you should be taking the cluster assignment, c, that maximizes log p(d|c). log p(d|c) should be a negative number - the maximum is the number closest to zero.
If you are getting cluster assignments that don't make sense, it is likely that this is because the multinomial model does not describe your data well. So, the answer to your question is most likely that you should either choose a different statistical model or use a different clustering method.