DBSCAN Clustering algoruthm - machine-learning

In DBSCAN algorithm i get cluster labels as -1 what does this mean? And how to find how many cluster is genrated when I use minpts=5 and eps=13.

If you are using scikit-learn, the label -1 indicates noisy samples, according to the scikit-learn's documentation: sklearn.cluster.DBSCAN
The method fit_predict returns a numpy array containing cluster labels. Therefore, the shape of this array indicates number of clusters. Fore more info, check the link above.
You can also check this tutorial from scikit-learn: Demo of DBSCAN clustering algorithm

Related

How does APPLY_KMEANS work in Vertica

I am testing the machine learning tools in Vertica. I understand how the KMEANS work since it just devides the data into clusters. However I do not understand how the APPLY_KMEANS works on new data.
It looks to me like it acts more like a classification method. Since it classifies new Data in the existing clusters. So what algorithm is used (K nearest neighbor)? Its not very clear from the documentation.
k-means is a clustering algorithm (not classification!) that iterates over 2 steps:
Assignement step: Assign each point a centroid
Update step: update centroids coordinates
When you build your k-means model, you first initialize centroids (different strategy, can be random initialization), then you iterate until your clustering is ok (your error is below a given threshold).
What defines your model is actually your centroids.
When using APPLY_KMEANS you will run an assignment step using data from your query and centroids from your model. Points will then be assigned to clusters depending on their distance with respect to centroids.
Hope it helps
pltrdy
Note about Clustering vs Classification:
We can be tempted to think that clustering is a kind of classification. Still, classification must only refer to supervised learning while clustering corresponds to unsupervised learning. Thus, don't do it :)

Choosing the number of clusters in heirarchical agglomerative clustering with scikit

The wikipedia article on determining the number of clusters in a dataset indicated that I do not need to worry about such a problem when using hierarchical clustering. However when I tried to use scikit-learn's agglomerative clustering I see that I have to feed it the number of clusters as a parameter "n_clusters" - without which I get the hardcoded default of two clusters. How can I go about choosing the right number of cluster's for the dataset in this case? Is the wiki article wrong?
Wikipedia is simply making an extreme simplification which has nothing to do with real life. Hierarchical clustering does not avoid the problem with number of clusters. Simply - it constructs the tree spaning over all samples, which shows which samples (later on - clusters) merge together to create a bigger cluster. This happend recursively till you have just two clusters (this is why default number of clusters is 2) which are merged to the whole dataset. You are left alone with "cutting" through the tree to get actual clustering. Once you fit AgglomerativeClustering you can traverse the whole tree and analyze which clusters to keep
import numpy as np
from sklearn.cluster import AgglomerativeClustering
import itertools
X = np.concatenate([np.random.randn(3, 10), np.random.randn(2, 10) + 100])
clustering = AgglomerativeClustering()
clustering.fit(X)
[{'node_id': next(itertools.count(X.shape[0])), 'left': x[0], 'right':x[1]} for x in clustering.children_]
ELKI (not scikit-learn, but Java) has a number of advanced methods that extract clusters from a hierarchical clustering. They are smarter than just cutting the tree at a particular height, but they can produce a hierarchy of clusters of a minimum size, for example.
You could check if these methods work for you.

How to compute accuracy for cluster evaluation in Weka

How do we compute accuracy for clusters using Weka?
I can use this formula:
Accuracy (A) = (tp+tn)/Total # samples
but how can I know what is the true positive, false positive, true negative and false negative in the output of experiment in the Weka tool?
There are a few different clustering modes in Weka:
Use training set (default): After clustering, Weka classifies the training instances into clusters it developed and computes the percentage of instances falling in each cluster. For example, X% in cluster 0 and Y% in cluster 1, etc.
Supplied test set: It is possible with Weka to evaluate clusterings on separate test data if the cluster representation is probabilistic like EM algorithm.
Clustering evaluation using classes: In this mode Weka first ignores the class attribute and generates the clustering. During testing, it assigns class labels to the clusters on the basis of the majority value of the class attribute within each cluster. Finally, it computes the classification error and also shows the corresponding confusion matrix.
Take a look on cross-validation principles. Use ClusterEvaluation 's methods crossValidateModel and evaluateClusterer in your java code. Or you can also experiment that with the weka GUI directly.
Based on this answer to a similar question the classificationViaClusteringmeta classifier which can be downloaded through the package manager will do what you want.

CvSVM.predict() gives 'NaN' output and low accuracy

I am using CvSVM to classify only two types of facial expression. I used LBP(Local Binary Pattern) based histogram to extract features from the images, and trained using cvSVM::train(data_mat,labels_mat,Mat(),Mat(),params), where,
data_mat is of size 200x3452, containing normalized(0-1) feature histogram of 200 samples in row major form, with 3452 features each(depends on number of neighbourhood points)
labels_mat is corresponding label matrix containing only two value 0 and 1.
The parameters are:
CvSVMParams params;
params.svm_type =CvSVM::C_SVC;
params.kernel_type =CvSVM::LINEAR;
params.C =0.01;
params.term_crit=cvTermCriteria(CV_TERMCRIT_ITER,(int)1e7,1e-7);
The problem is that:-
while testing I get very bad result (around 10%-30% accuracy), even after applying with different kernel and train_auto() function.
CvSVM::predict(test_data_mat,true) gives 'NaN' output
I will greatly appreciate any help with this, it's got me stumped.
I suppose, that your classes linearly hard/non-separable in feature space you use.
May be it will be better to apply PCA to your dataset before classifier training step
and estimate effective dimensionality of this problem.
Also I think it will be userful test your dataset with other classifiers.
You can adapt for this purpose standard opencv example points_classifier.cpp.
It includes a lot of different classifiers with similar interface you can play with.
The SVM generalization power is low.In the first reduce your data dimension by principal component analysis then change your SVM kerenl type to RBF.

Bad clustering results with mahout on Reuters 21578 dataset

I 've used a part of reuters 21578 dataset and mahout k-means for clustering.To be more specific I extracted only the texts that has a unique value for category 'topics'.So I ve been left with 9494 texts that belong to one among 66 categories. I ve used seqdirectory to create sequence files from texts and then seq2sparse to crate the vectors. Then I run k-means with cosine distance measure (I ve tried tanimoto and euclidean too, with no better luck), cd=0.1 and k=66 (same as the number of categories). So I tried to evaluate the results with silhouette measure using custom Java code and the matlab implementation of silhouette (just to be sure that there is no error in my code) and I get that the average silhouette of the clustering is 0.0405. Knowing that the best clustering could give an average silhouette value close to 1, I see that the clustering result I get is no good at all.
So is this due to Mahout or the quality of catgorization on reuters dataset is low?
PS: I m using Mahout 0.7
PS2: Sorry for my bad English..
I've never actually worked with Mahout, so I cannot say what it does by default, but you might consider checking what sort of distance metric it uses by default. For example, if the metric is Euclidean distance on unnormalized document word counts, you can expect very poor quality cluster quality, as document length will dominate any meaningful comparison between documents. On the other hand, something like cosine distance on normalized, or tf-idf weighted word counts can do much better.
One other thing to look at is the distribution of topics in the Reuters 21578. It is very skewed towards a few topics such as "acq" or "earn", while others are used only handfuls of times. This can it difficult to achieve good external clustering metrics.

Resources