Validating Output From a Clustering Algorithm - machine-learning

Is there an objective way to validate the output of a clustering algorithm?
I'm using scikit-learn's affinity propagation clustering against a dataset composed of objects with many attributes. The difference matrix supplied to the clustering algorithm is composed of the weighted difference of these attributes. I'm looking for a way to objectively validate tweaks in the distance weightings as reflected in the resulting clusters. The dataset is large and has enough attributes that manual examination of small examples is not a reasonable way to verify the produced clusters.

Yes:
Give the clusters to a domain expert, and have him analyze if the structure the algorithm found is sensible. Not so much if it is new, but if it is sensible.
... and No:
There is not automatic evaluation available that is fair. In the sense that it takes the objective of unsupervised clustering into account: knowledge discovery aka: learn something new about your data.
There are two common ways of evaluating clusterings automatically:
internal cohesion. I.e. there is some particular property such as in-cluser variance compared to between-cluster variance to minimize. The problem is that it's usually fairly trivial to cheat. I.e. to construct a trivial solution that scores really well. So this method must not be used to compare methods based on different assumptions. You can't even fairly compare different types of linkage for hiearchical clustering.
external evaluation. You use a labeled data set, and score algorithms by how well they rediscover existing knowledge. Sometimes this works quite well, so it is an accepted state of the art for evaluation. Yet, any supervised or semi-supervised method will of course score much better on this. As such, it is A) biased towards supervised methods, and B) actually going completely against the knowledge discovery idea of finding something you did not yet know.
If you really mean to use clustering - i.e. learn something about your data - you will at some point have to inspect the clusters, preferrably by a completely independent method such as a domain expert. If he can tell you that e.g. the user group identified by the clustering is a non-trivial group not yet investigated closely, then you are a winner.
However, most people want to have a "one click" (and one-score) evaluation, unfortunately.
Oh, and "clustering" is not really a machine learning task. There actually is no learning involved. To the machine learning community, it is the ugly duckling that nobody cares about.

There is another way to evaluate the clustering quality by computing a stability metric on subfolds, a bit like cross validation for supervised models:
Split the dataset in 3 folds A, B and C. Compute two clustering with you algorithm on A+B and A+C. Compute the Adjusted Rand Index or Adjusted Mutual Information of the 2 labelings on their intersection A and consider this value as an estimate of the stability score of the algorithm.
Rinse-repeat by shuffling the data and splitting it into 3 other folds A', B' and C' and recompute a stability score.
Average the stability scores over 5 or 10 runs to have a rough estimate of the standard error of the stability score.
As you can guess this is very computer intensive evaluation method.
It is still an open research area to know whether or not this Stability-based evaluation of clustering algorithms is really useful in practice and to identify when it can fail to produce a valid criterion for model selection. Please refer to Clustering Stability: An Overview by Ulrike von Luxburg and references therein for an overview of the state of the art on those matters.
Note: it is important to use Adjusted for Chance metrics such as ARI or AMI if you want to use this strategy to select the best value of k in k-means for instance. Non adjusted metrics such as NMI and V-measure will tend to favor models with higher k arbitrarily.

Related

Best Learning model for high numerical dimension data? (with Rapidminer)

I have a dataset of approx. 4800 rows with 22 attributes, all numerical, describing mostly the geometry of rock / minerals, and 3 different classes.
I tried out a cross validation with k-nn Model inside it, with k= 7 and Numerical Measure -> Camberra Distance as parameters set..and I got a performance of 82.53% and 0.673 kappa. Is that result representative for the dataset? I mean 82% is quite ok..
Before doing this, I evaluated the best subset of attributes with a decision table, I got out 6 different attributes for that.
the problem is, you still don't learn much from that kind of models, like instance-based k-nn. Can I get any more insight from knn? I don't know how to visualize the clusters in that high dimensional space in Rapidminer, is that somehow possible?
I tried decision tree on the data, but I got too much branches (300 or so) and it looked all too messy, the problem is, all numerical attributes have about the same mean and distribution, therefore its hard to get a distinct subset of meaningful attributes...
ideally, the staff wants to "Learn" something about the data, but my impression is, that you cannot learn much meaningful of that data, all that works best is "Blackbox" Learning models like Neural Nets, SVM, and those other instance-based models...
how should I proceed?
Welcome to the world of machine learning! This sounds like a classic real-world case: we want to make firm conclusions, but the data rows don't cooperate. :-)
Your goal is vague: "learn something"? I'm taking this to mean that you're investigating, hoping to find quantitative discriminations among the three classes.
First of all, I highly recommend Principal Component Analysis (PCA): find out whether you can eliminate some of these attributes by automated matrix operations, rather than a hand-built decision table. I expect that the messy branches are due to unfortunate choice of factors; decision trees work very hard at over-fitting. :-)
How clean are the separations of the data sets? Since you already used Knn, I'm hopeful that you have dense clusters with gaps. If so, perhaps a spectral clustering would help; these methods are good at classifying data based on gaps between the clusters, even if the cluster shapes aren't spherical. Interpretation depends on having someone on staff who can read eigenvectors, to interpret what the values mean.
Try a multi-class SVM. Start with 3 classes, but increase if necessary until your 3 expected classes appear. (Sometimes you get one tiny outlier class, and then two major ones get combined.) The resulting kernel functions and the placement of the gaps can teach you something about your data.
Try the Naive Bayes family, especially if you observe that the features come from a Gaussian or Bernoulli distribution.
As a holistic approach, try a neural net, but use something to visualize the neurons and weights. Letting the human visual cortex play with relationships can help extract subtle relationships.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Machine Learning Algorithm selection

I am new in machine learning. My problem is to make a machine to select a university for the student according to his location and area of interest. i.e it should select the university in the same city as in the address of the student. I am confused in selection of the algorithm can I use Perceptron algorithm for this task.
There are no hard rules as to which machine learning algorithm is the best for which task. Your best bet is to try several and see which one achieves the best results. You can use the Weka toolkit, which implements a lot of different machine learning algorithms. And yes, you can use the perceptron algorithm for your problem -- but that is not to say that you would achieve good results with it.
From your description it sounds like the problem you're trying to solve doesn't really require machine learning. If all you want to do is match a student with the closest university that offers a course in the student's area of interest, you can do this without any learning.
I second the first remark that you probably don't need machine learning if the student has to live in the same area as the university. If you want to use an ML algorithm, maybe it would best to think about what data you would have to start with. The thing that comes to mind is a vector for a university that has certain subjects/areas for each feature. Then compute a distance from a vector which is like an ideal feature vector for the student. Minimize this distance.
The first and formost thing you need is a labeled dataset.
It sounds like the problem could be decomposed into a ML problem however you first need a set of positive and negative examples to train from.
How big is your dataset? What features do you have available? Once you answer these questions you can select an algorithm that bests fits the features of your data.
I would suggest using decision trees for this problem which resembles a set of if else rules. You can just take the location and area of interest of the student as conditions of if and else if statements and then suggest a university for him. Since its a direct mapping of inputs to outputs, rule based solution would work and there is no learning required here.
Maybe you can use a "recommender system"or a clustering approach , you can investigate more deeply the techniques like "collaborative filtering"(recommender system) or k-means(clustering) but again, as some people said, first you need data to learn from, and maybe your problem can be solved without ML.
Well, there is no straightforward and sure-shot answer to this question. The answer depends on many factors like the problem statement and the kind of output you want, type and size of the data, the available computational time, number of features, and observations in the data, to name a few.
Size of the training data
Accuracy and/or Interpretability of the output
Accuracy of a model means that the function predicts a response value for a given observation, which is close to the true response value for that observation. A highly interpretable algorithm (restrictive models like Linear Regression) means that one can easily understand how any individual predictor is associated with the response while the flexible models give higher accuracy at the cost of low interpretability.
Speed or Training time
Higher accuracy typically means higher training time. Also, algorithms require more time to train on large training data. In real-world applications, the choice of algorithm is driven by these two factors predominantly.
Algorithms like Naïve Bayes and Linear and Logistic regression are easy to implement and quick to run. Algorithms like SVM, which involve tuning of parameters, Neural networks with high convergence time, and random forests, need a lot of time to train the data.
Linearity
Many algorithms work on the assumption that classes can be separated by a straight line (or its higher-dimensional analog). Examples include logistic regression and support vector machines. Linear regression algorithms assume that data trends follow a straight line. If the data is linear, then these algorithms perform quite good.
Number of features
The dataset may have a large number of features that may not all be relevant and significant. For a certain type of data, such as genetics or textual, the number of features can be very large compared to the number of data points.

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Performance Analysis of Clustering Algorithms

I have been given 2 data sets and want to perform cluster analysis for the sets using KNIME.
Once I have completed the clustering, I wish to carry out a performance comparison of 2 different clustering algorithms.
With regard to performance analysis of clustering algorithms, would this be a measure of time (algorithm time complexity and the time taken to perform the clustering of the data etc) or the validity of the output of the clusters? (or both)
Is there any other angle one look at to identify the performance (or lack of) for a clustering algorithm?
Many thanks in advance,
T
It depends a lot on what data you have available.
A common way of measuring the performance is with respect to existing ("external") labels (albeit that would make more sense for classification than for clustering). There are around two dozen measures you can use for this.
When using an "internal" quality measure, make sure that it is independent of the algorithms. For example, k-means optimizes such a measure, and will always come out best when evaluating with respect to this measure.
There are two categories of clustering evaluation methods and the choice depends
on whether a ground truth is available. The first category is the extrinsic methods which require the existence of a ground truth and the other category is the intrinsic methods. In general, extrinsic methods try to assign a score to a clustering, given the ground truth, whereas intrinsic methods evaluate clustering by examining how well the clusters are separated and how compact they are.
For extrinsic methods (remember you need to have a ground available) one option is to use the BCubed precision and recall metrics. The BCubed precion and recall metrics differ from the traditional precision and recall in the sense that clustering is an unsupervised learning technique and therefore we do not know the labels of the clusters beforehand. For this reason BCubed metrics evaluate the precion and recall for evry object in a clustering on a given dataset according to the ground truth. The precision of an example is an indication of how many other examples in the same cluster belong to the same category as the example. The recall of an example reflects how many examples of the same category are assigned to the same cluster. Finally, we can combine these two metrics in one using the F2 metric.
Sources:
Data Mining Concepts and Techniques by Jiawei Han, Micheline, Kamber and Jian Pei
http://www.cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf
My own experience in evaluating the performance of clustering
A simple approach for the extrinsic methods where there is a ground truth available is to use a distance metric between clusterings; the ground truth is simply considered to be a clustering. Two good measures to use are the Variation of Information by Meila and, in my humble opinion, the split join distance by myself also discussed by Meila. I do not recommend the Mirkin index or the Rand index - I've written more about it here on stackexchange.
These metrics can be split into two constituent parts, each representing the distance of one of the clusterings to the largest common subclustering. It is worthwhile to consider both parts; if the ground truth part (to common subclustering) is very small, it means that the tested clustering is close to a superclustering; if the other part is small it means that the tested clustering is close to the common subclustering and hence close to a subclustering of the ground truth. In both cases the clustering can be said to be compatible with the ground truth. For more information see the link above.
There are several benchmarks for the clustering algorithms evaluation with extrinsic quality measures (accuracy) and intrinsic measures (some internal statistics of the formed clusters):
Clubmark demonstrated in ICDM'18
WebOCD, see description in the paper
Circulo
ParallelComMetric
CluSim
CoDAR (the sources might be acquired from the paper authors)
Selection of the appropriate benchmark depends on the kind of the clustering algorithm (hard or soft clustering), kind (pairwise relations, attributed datasets or mixed) and size of the clustering data, required evaluation metrics and the admissible amount of the supervision. The Clubmark paper describes evaluation criteria in details.
The Clubmark is developed for the fully automatic parallel evaluation of many clustering algorithms (processing input data specified by the pairwise relations) on many large datasets (millions and billions of clustering elements) and evaluated mostly by accuracy metrics tracing resource consumption (processing and execution time, peak resident memory consumption, etc.).
But for a couple of algorithms on a couple of datasets even the manual evaluation is appropriate.

Resources