I want to cluster some data points but the maximum number of points per cluster is limited. So there is a maximum size per cluster. Is there any clustering algorithm for that?
Also Can I define my own size function. For example, instead of considering the number of points in a cluster as its size, I want to sum a column of all the points in the cluster.
A quick and not a optimal solution is spliting data into 2 parts iteratively until the number of data is under the limitation.
The problem of k-means clustering with minimum size constraints is addressed in this paper:
Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.
However, the approach proposed in this paper can be easily extended to the maximum size constraints.
Here is an implementation of this algorithm and an extension to it which addresses both minimum size and maximum size constraints.
AS for your question about custom size function, it will be a more difficult problem for which I guess local search approaches are more appropriate.
As clustering will usually try to make the clusters as large as possible, this isn't really clustering then anymore. More like a minimum spanning tree, where you remove the longest edges to find groups.
You could try something like x-means, i.e. a k-means variation where you split clusters that you consider to be too large.
I'm confused about the difference between the following parameters in HDBSCAN
Correct me if I'm wrong.
For min_samples, if it is set to 7, then clusters formed need to have 7 or more points.
For cluster_selection_epsilon if it is set to 0.5 meters, than any clusters that are more than 0.5 meters apart will not be merged into one. Meaning that each cluster will only include points that are 0.5 meters apart or less.
How is that different from min_cluster_size?
They technically do two different things.
min_samples = the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.
min_cluster_size = the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.
Increasing min_samples will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.
Increasing min_cluster_size while keeping min_samples small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size.
If you want many highly specific clusters, use a small min_samples and a small min_cluster_size.
If you want more generalized clusters but still want to keep most detail, use a small min_samples and a large min_cluster_size
If you want very very general clusters and to discard a lot of noise in the clusters, use a large min_samples and a large min_cluster_size.
(It's not possible to use min_samples larger than min_cluster_size, afaik)
I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.
I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions
Say in the document classification domain, if I'm having a dataset of 1000 instances but the instances (documents) are rather of small content; and I'm having another dataset of say 200 instances but each individual instance with richer content. If IDF is out of my concern, will the number of instances really matter in training? Do classification algorithms sort of take that into account?
You could pose this as a general machine learning problem. The simplest problem that can help you understand how the size of training data matters is curve fitting.
The uncertainty and bias of a classifier or a fitted model are functions of the sample size. Small sample size is a well known problem which we often try to avoid by collecting more training samples. This is because the uncertainty estimation of non-linear classifiers is estimated by a linear approximation of the model. And this estimation is accurate only if a large number samples are available as the main condition of the central limit theorem.
The proportion of outliers is also an important factor you should consider when deciding on the training sample size. If a larger sample size means a greater proportion of outliers then should limit the sample size.
The document size is actually is an indirect indicator of feature space size. If for example from each document you have got only 10 features then you're trying to separate/classify the documents in a 10-dimensional space. If you have got 100 features in each document then the same is happening in a 100-dimensional space. I guess it's easy for you to see drawing lines that separate the documents in a higher dimension is easier.
For both document size and sample size the rule of thumb is go to as high as possible but in practice this is not possible. And for example, if you estimate the uncertainty function of the classifier then you find a threshold that sample sizes higher than that lead to virtually no reduction of uncertainty and bias. Empirically you can also find this threshold for some problems by Monte Carlo simulation.
Most engineers don't bother to estimate uncertainty and that often leads to sub-optimal behavior of the methods they implement. This is fine for toy problems but in real-world problems considering uncertainty of estimations and computation is vital for most systems. I hope that answers your questions to some degree.
I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object
has about 30 dimensions. The distance is calculated using the Euclidean metric.
I need to know
How would I determine if my algorithms works correctly? I can't have a graph which will
give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions
instead of 30 ?
The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.
How would I determine if my [clustering] algorithms works correctly?
k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"
Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:
inter-centroidal separation / intra-cluster variance
As the value of this ratio increase, the quality of your clustering result increases.
This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?
But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.
In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).
The desired result is tight (small) clusters, each one far away from the others.
The calculation is simple:
For inter-centroidal separation:
calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.
For intra-cluster variance:
for each cluster, calculate the distance of every data point in a given cluster from
its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.
That's my answer to the first question. Here's the second question:
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?
Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:
subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.
Nowhere in this sequence of calculations is scale implicated.
But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.
In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.
To identify an appropriate similarity metric given your data:
Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.
Make it in 2d and show the picture - this is good option to see visually if it works.
Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.
Can't you just try sum |xi - yi| instead if (xi - yi)^2
in your code, and see if it makes much difference ?
I can't have a graph which will give some idea about the correctness of my algorithm.
A couple of possibilities:
look at some points midway between 2 clusters in detail
vary k a bit, see what happens (what is your k ?)
to map 30d down to 2d; see the plots under
also SO questions/tagged/pca
By the way, scipy.spatial.cKDTree
can easily give you say 3 nearest neighbors of each point,
in p=2 (Euclidean) or p=1 (Manhattan, L1), to look at.
It's fast up to ~ 20d, and with early cutoff works even in 128d.
Added: I like Cosine distance in high dimensions; see euclidean-distance-is-usually-not-good-for-sparse-data for why.
Euclidean distance is the intuitive and "normal" distance between continuous variable. It can be inappropriate if too noisy or if data has a non-gaussian distribution.
You might want to try the Manhattan distance (or cityblock) which is robust to that (bear in mind that robustness always comes at a cost : a bit of the information is lost, in this case).
There are many further distance metrics for specific problems (for example Bray-Curtis distance for count data). You might want to try some of the distances implemented in pdist from python module scipy.spatial.distance.