How to compute accuracy for cluster evaluation in Weka - machine-learning

How do we compute accuracy for clusters using Weka?
I can use this formula:
Accuracy (A) = (tp+tn)/Total # samples
but how can I know what is the true positive, false positive, true negative and false negative in the output of experiment in the Weka tool?

There are a few different clustering modes in Weka:
Use training set (default): After clustering, Weka classifies the training instances into clusters it developed and computes the percentage of instances falling in each cluster. For example, X% in cluster 0 and Y% in cluster 1, etc.
Supplied test set: It is possible with Weka to evaluate clusterings on separate test data if the cluster representation is probabilistic like EM algorithm.
Clustering evaluation using classes: In this mode Weka first ignores the class attribute and generates the clustering. During testing, it assigns class labels to the clusters on the basis of the majority value of the class attribute within each cluster. Finally, it computes the classification error and also shows the corresponding confusion matrix.

Take a look on cross-validation principles. Use ClusterEvaluation 's methods crossValidateModel and evaluateClusterer in your java code. Or you can also experiment that with the weka GUI directly.

Based on this answer to a similar question the classificationViaClusteringmeta classifier which can be downloaded through the package manager will do what you want.

Related

how to find the set of influential features in clusters?

I have 4 clusters and I need to find the set of most influential features in each cluster so that I can get some insight about the characteristics of the cluster and thus to understand the behavior of these clusters. How can I do this?
A rudimentary method of addressing the problem is by finding the descriptive statistics for the features of the cluster centroids.
Snippet to find the most influencing variables:
var_influence=cc.describe() #cc contains the cluster centroids
# The descriptive statistics of the cluster centroids are saved in a Dataframe var_influence.
# Sorting by standard deviation will give the variables with high standard deviation.
var_influence.sort_values(axis=1, by='std', ascending=False).iloc[:,:10]
This way it is quicker and better to find the influencing variables when compared to the box plot way (Which is hard to visualise with increasing features). As all the variables are normalised it is very easy to compare across features.
A max-min approach can also be used, this will allow us to see the variables with maximum bandwidth. As all the variables are normalised the max-min is a good way to validate the above result.Code for the same below
pd.Series(var_influence.loc['max']-var_influence.loc['min']).sort_values(ascending=False)[:10]
Multiclass classification
A more serious approach to find the influencing features is Multi-class classification: The cluster labels are used as a target variable to train a multi-class classification model on the data. The resulting model coefficients can be used to determine the importance of the features.
The approach that I use is to train a classifier to predict each cluster label (1 if the corresponding cluster, 0 else), and then use the model attributes to determine the most discriminating variables per cluster. I've been doing that with RandomForest and the attribute feature_importances_ in sickit learn and I always got very good results.
I then use boxplots / density plots to represent the distributions of those variables per cluster.
You can also more traditional approaches, like comparing the means by cluster for each variable, and use statistical tests like ANOVA to get more reliable results.
Edit: Here is an example in Python :
for cl in data.cluster.unique():
custom_target = data.cluster.copy()
custom_target.loc[custom_target != cl] = -1
custom_target.loc[custom_target == cl] = 1
clf = RandomForestClassifier(100 , random_state = 10)
clf.fit(data.values[: , 1:-4], custom_target)
imps , features = zip(*sorted(zip(clf.feature_importances_, cols) , reverse = True))
# store the results as you like

How does APPLY_KMEANS work in Vertica

I am testing the machine learning tools in Vertica. I understand how the KMEANS work since it just devides the data into clusters. However I do not understand how the APPLY_KMEANS works on new data.
It looks to me like it acts more like a classification method. Since it classifies new Data in the existing clusters. So what algorithm is used (K nearest neighbor)? Its not very clear from the documentation.
k-means is a clustering algorithm (not classification!) that iterates over 2 steps:
Assignement step: Assign each point a centroid
Update step: update centroids coordinates
When you build your k-means model, you first initialize centroids (different strategy, can be random initialization), then you iterate until your clustering is ok (your error is below a given threshold).
What defines your model is actually your centroids.
When using APPLY_KMEANS you will run an assignment step using data from your query and centroids from your model. Points will then be assigned to clusters depending on their distance with respect to centroids.
Hope it helps
pltrdy
Note about Clustering vs Classification:
We can be tempted to think that clustering is a kind of classification. Still, classification must only refer to supervised learning while clustering corresponds to unsupervised learning. Thus, don't do it :)

labelling of dataset in machine learning

I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.

data imbalance in SVM using libSVM

How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.

percentage in classify of learning algorithem

I'm using weka, I have a training set, and the classify of the examples in the training set is boolean.
After I have the training set, I want to predict the percentage of new input to be true or false. I want to get a number between 0-1, and not only o or 1.
How can I do that, I have seen that in the prediection there are only the possibels classifes.
Thanks in advance.
You can only make the same kind of prediction with the learned classifier -- it learns to make the predictions you train it to make. The kind of prediction you want sounds more like regression. That is, you're don't want a strict classification, but a continuous value designating the membership probability.
The easiest way to achieve what you want is to replace the Booleans in your training set with 0/1 values and learn a regression model. This will give you numbers, although not necessarily only between 0 and 1.
To get real probabilities, you would need to use a classifier that calculates probabilities (such as Naive Bayes) and write some custom code (using the Weka library) to retrieve them. See the javadoc of the method that gives you access to the class probabilities.

Resources