choose cluster in hierarchical clustering - machine-learning

How can i choose a cluster if a point is at the same distance with two different points?
Here, X1 is at the same distance to X2 and X3. Can I directly make a cluster of X1/X2/X3 or just go one by one as X1/X2 and then X1/X2/X3?

In general you should always follow the rule of merging two if you want to have all the typical properties of the hierarchical clustering (like uniform meaning of each "slice through" - if you start merging many steps into one, you will have "unbalanced" structure, thus the height of the clustering tree will have different meanings in multiple places). Furthermore, it actually only makes sense for min linkage, if you use avg linkage or other, more complex rules, then it is not even true then after merging two points, the third one will be the next now to add (it might even end up in a different cluster). However, in general, clustering of this type (greedy) is just a heuristic, with some particular properties. Thus alternating it a bit gives you yet another clustering with some properties. Saying which one is "correct" is impossible - they are both wrong to some extent, what matters is your exact usage later on.

Related

SPSS GLM Significance of Predictors are different when building interaction terms vs creating the interaction variables

I was wondering if anyone knows how SPSS builds the interaction terms/calculates the significance for predictors behind the scenes in a GLM? From my understanding it dummy codes variables and treats the one that comes alphabetically last as the reference group.
The reason I'm asking is I have a GLM model which has 3 continuous predictors and two categorical predictors (dummy coded). When I build all the 2-way and 3-way interactions with syntax ie:
Age_Centred Age_CentredDx Age_Centredgender Age_CentredDxgender BMI_Centred BMI_CentredDx BMI_Centredgender BMI_CentredDxgender BPS_Centred BPS_CentredDx BPS_Centredgender BPS_CentredDxgender Dx Dxgender DxICV_Centred DxICV_Centredgender gender ICV_Centred ICV_Centred*gender.
vs manually creating all the variables by hand ie:
Age_Centred Age_Centred_Dx Age_Centred_gender Age_Centred_gender_Dx BMI_Centred BMI_Centred_Dx BMI_Centred_gender BMI_Centred_gender_Dx BPS_Centred BPS_Centred_Dx BPS_Centred_gender BPS_Centred_gender_Dx Dx gender_Dx ICV_Dx ICV_Centred_Dx_gender gender ICV_Centred ICV_gender.
I end up with a model which has the same intercept, overall significance, and R squared however the individual significance of the predictors changes. Refer to output below. To troubleshoot I've tried to flip the references groups when manually creating the variables but it still does not replicate the results. I've had another statistician try the same thing and ended up reaching the same point as what I did. Does it have to do with some of the parameters being redundant?
Building the terms via syntax:
Physically creating the variables by multiplying them together
All the details one might reasonably want about how GLM (and UNIANOVA, which is the same underlying code) parameterizes models, estimates parameters, and conducts hypothesis tests are available in the IBM SPSS Statistics Algorithms manual, available for download as a pdf at ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/26.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf. (Note that this is a large file, about 78 MB; clicking on the link starts a download.) In addition to the information in the GLM chapter, appendices F (Indicator Method) and H (Sums of Squares) are relevant, respectively, for building the design matrix and specifying linear combinations of model parameters for computing sums of squares for testing hypotheses.
In building the design matrix, categorical predictors (factors) are indeed represented by sets of indicator (0-1) variables. For a factor with k levels, k indicator variables are created, one for each observed level of the factor. The procedure does not explicitly treat the last category (sorted in ascending order, alphabetical for strings) as a reference category, though in simpler models the effect of what's done is essentially the same. If there is an intercept in the model, then the kth indicator will be redundant (linearly dependent) on the intercept and the preceding k-1 indicators. The estimation algorithm used in GLM/UNIANOVA will set the row and column in the cross-product matrix representing the redundant column in the design matrix to 0s, alias the corresponding parameter estimate to 0, and the results are similar to a reparameterization approach treating the last category as a reference category, except that you have to remember that it's there if you want to specify a linear combination of the parameters to estimate.
If you suppress the intercept, then for the first factor entered into the model the kth indicator would not be redundant (unless the factor is preceded by an unusual covariate or set of covariates). Any subsequent factors included in the model would involve redundant parameters, as would any interactions among factors, whether or not an intercept is included. Interactions among factors are created by multiplying the 0s and 1s for each level of the factors by those for each level of the other factor. So for an interaction of two two-level factors, there are four columns generated, of which typically the last three are redundant.
Covariates are entered simply by copying the values of the variables into the design matrix. Interactions involving covariates and other covariates multiply values for the columns involved within each row, and interactions involving covariates and factors multiply covariates (or products of them) by the indicator variables for the factor(s). Usually covariate-by-covariate terms do not involve redundancies, but factor-by-covariate terms do.
To get to the specifics of what's going on with your data, I can't replicate your exact results without your data, but I am able to replicate the patterns shown if I assume you've used the binary Dx variable as a covariate and the binary gender variable as a factor in each analysis. (There seem to actually be four continuous predictors in your model rather than three, but that doesn't affect anything of importance for understanding what's going on.)
There are two aspects of the situation to be considered. One is the parameterization and how the two ways of entering the variables into the model treat the variables and whether or not they produce the same estimates of parameters. The second is how the model specification results in the Type III tests shown in the ANOVA tables.
If I'm understanding things correctly based on what you've posted here, you should find if you compare parameter estimates for the two analyses that the parameter estimates for the intercepts and the non-redundant estimates for gender ([gender=0]) are the same, and have the same standard errors. For the terms involving just covariates or products of covariates, I expect that you will find the parameter estimates to differ between the two analyses and produce different t statistics. For interactions involving gender and covariates (which is all the other variables or products created outside the procedure), I expect the estimates will be the same in magnitude and opposite in sign, with the same standard errors.
None of the estimates or tests here are wrong. The models fitted involve interaction effects. An interaction means that effect of one variable varies by the levels of the other variable(s) in the interaction, and in order to estimate the same simple effects you have to parameterize the model in the same way, at least as far as the non-redundant parameters are concerned. However, to get the Type III tests for all terms to be identical, it's not always enough to have the same parameter estimates and standard errors. Type III tests involve a concept called containment that must also be considered.
For two effects in a model, effect A is contained in effect B if:
A and B contain the same covariate terms, if any.
B contains all factor effects in A, and at least one more (with the intercept being contained in all factor-only effects).
In your original model, the intercept is included in the gender effect, gender is not included in any effects, and all the covariate main effects and two-way interactions among covariates are contained within the interactions between those terms and gender, while the three-way interactions (which include gender) are not contained within any other effects.
Type III sums of squares (not invented by SPSS, but by our friends at SAS) are based on linear combinations of parameters where a given effect is adjusted for any effects that do not contain it, and made orthogonal to any effects that contain it. The practical application of these rules is complicated (see Appendix H of the algorithms).
If you recode the gender variable to swap the 0 and 1 values, specify it as a covariate along with all the other variables, and fit the same model, you should be able to match all the non-redundant parameter estimates from the original model, along with their standard errors and t statistics. However, because the containment relationships in the original model are no longer there, the Type III tests for the terms not involving gender (which were previously contained in terms involving gender) will not match up.
The bottom line is that all results are translatable and all correct for what's being done, and that in order to make much sense out of individual terms you have to carefully focus on what's being estimated in a given parameterization, as well as the containment relationships. The difficult part gets simpler when you take seriously the fact that when variable X is involved in interaction terms, there is no single estimate of the effect of X. Any estimates are conditional one where you fix the value(s) of the terms with which X interacts.

How is Growing Neural Gas used for clustering?

I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.

How to check if a data point is within the boundary of a cluster or not

Suppose I have done clustering (using 3 features) and got 4 clusters, training on a set of data points.
Now in production I will be getting a different set of data points, and based on the values of the features of that data point, I need to know if it falls in the pre-defined cluster that I made earlier or not. This is not doing clustering but rather finding if a point falls within a pre-defined cluster.
How do I find whether the point is in a cluster?
Do I need to run linear regression to find the equation of the boundary covering the cluster?
There is no general answer to your question. The way new point is assigned to a cluster is a property of the cluster itself. Thus the crucial thing is "what is the clustering procedure used in the first place". Each well defined clustering method (in mathematical sense) provides you with the whole input space partitioning, not just finite training set. Such techniques include k-means, GMM, ...
However, there are exceptions - clustering methods which are simply heuristics, and not valid optimization problems. For example if you use hierarchical clustering there is no partitioning of the space, thus you cannot correctly assign new point to any cluser, and you are left with dozens of equally correct, heuristic methods which will do something - but you cannot say which one is correct. These heuristics include:
"closest point heuristics", which is essentialy equivalent of training 1-NN on your clusters
"build a valid model heuristics" which is a generalization of the above where you fit some classifier (of your choice) to mimic the original clustering (and select its hyperparameters through cross validation).
"what would happen if I re-run the clustering", if you can re-run the clustering from the previous solution you can simply check what cluster it falls into given previous clustering as a starting point.
...

K-Mean Clustering: Evaluating new Cluster centers

Is it better to evaluate new Cluster centers after each iteration of all data points, or after assigning a cluster to each data point? To clarify, which of the two methods is preferred:
You assign all the data points to various clusters and then find the new cluster center
Or, you assign the next data point to the nearest cluster and find the new Cluster center, move on to the next point as repeat...
These are more or less two main approaches
It is more or less Lloyd approach - you iterate over all datapoints, assign each to the nearest cluster, then move all centers accordingly, repeat.
It is more or less a Hartigan approach - you iterate over each data point and look if it is better to move it to other cluster (does it minimize the energy/make cluster more "dense"), repeat until no possible changes.
Which of the two is better? Empirical studies shows multiple advantages of Hartigan approach. In particular one can prove, that Hartigan will not work worse than Lloyd (each Hartigan optima is also a Lloyd optima, but not the other way around). There is a nice theoretical and practical analysis in http://ijcai.org/papers13/Papers/IJCAI13-249.pdf showing, that one should follow second approach, especially if there are many, potentially irrelevant features in the data.

What parameters can I play with using mcl?

I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.

Resources