Hierarchical Clustering with branching factor > 2? - machine-learning

All the hierarchical clustering methods that I have seen implemented in Python (scipy, scikit-learn, etc.,) split or combine two clusters at a time. This forces the branching factor to be 2 at each node. For my purpose, I want the model to allow branching factor to be greater than 2. That's helpful in situations where there are ties between clusters.
I'm not familiar with any hierarchical clustering techniques that have a branching factor greater than 2; do they exist?

Cluster this data set with single link:
0 0
0 1
1 0
1 1
And you will see a 4-way merge.
But for other linkages, always finding the best 3-way split would likely increase the runtime cost to O(n^4). You really don't want that.

Related

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

Hierarchical Clustering

I have read some resources and I found out how hierarchical clustering works. However, when I compare it with k-means clustering, it seems to me that k-means really constitues specific number of clusters,whereas hierarchical analysis shows me how the samples can be clustered. What I mean is that I do not get a specific number of clusters in hierarchical clustering. I get only a scheme about how the clusters can be constituted and portion of relation between the samples.
Thus, I cannot understand where I can use this clustering method.
Hierarchical clustering (HC) is just another distance-based clustering method like k-means. The number of clusters can be roughly determined by cutting the dendrogram represented by HC. Determining the number of clusters in a data set is not an easy task for all clustering methods, which is usually based on your applications. Tuning the thresholds in HC may be more explicit and straightforward for researchers, especially for a very large data set. I think this question is also related.
In k-means clustering k is a hyperparameter that you need to find in order to divide your data points into clusters whereas in hierarchical clustering (lets take one type of hierarchical clustering i.e. agglomerative) firstly you consider all the points in your dataset as a cluster and then merge two clusters based on a similarity metric and repeat this until you get a single cluster. I will explain this with an example.
Suppose initially you have 13 points (x_1,x_2,....,x_13) in your dataset so at start you will have 13 clusters, now in second step lets you get 7 clusters (x_1-x_2 , x_4-x_5, x_6-x_8, x_3-x_7, x_11-x_12, x_10, x_13) based on the similarity between the points. In the third step lets say you get 4 clusters(x_1-x_2-x_4-x_5, x_6-x_8-x_10, x_3-x_7-x_13, x_11-x_12) like this you would arrive to a step wherein all the points in your dataset form one cluster and which is also the last step of agglomerative clustering algorithm.
So in hierarchical clustering, there is no hyperparameter, depending upon your problem, if you want 7 clusters then stop at the second step if you want 4 clusters then stop at the third step and likewise.
A practical advantage in hierarchical clustering is the possibility of visualizing results using dendrogram. If you don’t know in advance what number of clusters you’re looking for (as is often the case…), you can use the dendrogram plot that can help you choose k with no need to create separate clusterings. Dendrogram can also give a great insight into the data structure, help identify outliers, etc. Hierarchical clustering is also deterministic, whereas k-means with random initialization can give you different results when running several times on the same data.
Hope this helps.

K-means: Only two optimal clusters

I am running a k-means algorithm in R and trying to find the optimal number of clusters, k. Using the the silhouette method, the gap statistic, and the elbow method, I determined that the optimal number of clusters is 2. While there are no predefined clusters for the business, I am concerned that k=2 is not too insightful, which leads me to a few questions.
1) What does an optimal k = 2 mean in terms of the data's natural clustering? Does this suggest that maybe there are no clear clusters or that no clusters are better than any clusters?
2) At k = 2, the R-squared is low (.1). At k = 5, the R-squared is much better (.32). What are the exact trade offs on selecting k = 5 knowing it's not optimal? Would it be that you can increase the clusters, but they may not be distinct enough?
3) My n=1000, I have 100 variables to choose from, but only selected 5 from domain knowledge. Would increasing the number of variables necessarily make the clustering better?
4) As a follow up to question 3, if a variable is introduced and lowers the R-squared, what does that say about the variable?
I am no expert but I will try to answer as best as I can:
1) Your optimal cluster number methods gave you k=2 so that would suggest there is clear clustering the number is just low (2). To help with this try using your knowledge of the domain to help with the interpretation, does 2 clusters make sense given your domain?
2) Yes you're correct. The optimal solution in terms of R-squared is to have as many clusters as data points, however this isn't optimal in terms of why you're doing k-means. You're doing k-means to gain more insightful information from the data, this is you're primary goal. As such if you choose k=5 you're data will fit your 5 clusters better but as you say there probably isn't much distinction between them so you're not gaining any insight.
3) Not necessarily, in fact adding blindly could make it worse. K-means operates in euclidean space so every variable is given an even weighting in determining the clusters. If you add variables that are not relevant their values will still distort the n-d space making your clusters worse.
4) (Double check my logic here i'm not 100% on this one) If a variable is introduced to the same number of clusters and it drops the R-squared then yes it is a useful variable to add, it means it has correlation with your other variables.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Determining optimal number of clusters and Davies–Bouldin Index?

I'm trying to evaluate what is the right number of cluster needed for clusterize some data.
I know that this is possible using Davies–Bouldin Index (DBI).
To using DBI you have to compute it for any number of cluster and the one that minimize the DBI corresponds to the right number of cluster needed.
The question is:
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
Only considering the average DBI of all clusters apparently is not a good idea.
Certainly, increasing the number of clusters - k, without penalty, will always reduce the amount of DBI in the resulting clustering, to the extreme case of zero DBI if each data point is considered its own cluster (because each data point overlaps with its own centroid).
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
So it's hard to say which one is better if you only use the average DBI as the performance metric.
A good practical method is to use the Elbow method.
Another method looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion".
Some other good alternatives with respective to choosing the optimal number of clusters:
Determining the number of clusters in a data set
How to define number of clusters in K-means clustering?

Resources