I have 4 clusters and I need to find the set of most influential features in each cluster so that I can get some insight about the characteristics of the cluster and thus to understand the behavior of these clusters. How can I do this?
A rudimentary method of addressing the problem is by finding the descriptive statistics for the features of the cluster centroids.
Snippet to find the most influencing variables:
var_influence=cc.describe() #cc contains the cluster centroids
# The descriptive statistics of the cluster centroids are saved in a Dataframe var_influence.
# Sorting by standard deviation will give the variables with high standard deviation.
var_influence.sort_values(axis=1, by='std', ascending=False).iloc[:,:10]
This way it is quicker and better to find the influencing variables when compared to the box plot way (Which is hard to visualise with increasing features). As all the variables are normalised it is very easy to compare across features.
A max-min approach can also be used, this will allow us to see the variables with maximum bandwidth. As all the variables are normalised the max-min is a good way to validate the above result.Code for the same below
pd.Series(var_influence.loc['max']-var_influence.loc['min']).sort_values(ascending=False)[:10]
Multiclass classification
A more serious approach to find the influencing features is Multi-class classification: The cluster labels are used as a target variable to train a multi-class classification model on the data. The resulting model coefficients can be used to determine the importance of the features.
The approach that I use is to train a classifier to predict each cluster label (1 if the corresponding cluster, 0 else), and then use the model attributes to determine the most discriminating variables per cluster. I've been doing that with RandomForest and the attribute feature_importances_ in sickit learn and I always got very good results.
I then use boxplots / density plots to represent the distributions of those variables per cluster.
You can also more traditional approaches, like comparing the means by cluster for each variable, and use statistical tests like ANOVA to get more reliable results.
Edit: Here is an example in Python :
for cl in data.cluster.unique():
custom_target = data.cluster.copy()
custom_target.loc[custom_target != cl] = -1
custom_target.loc[custom_target == cl] = 1
clf = RandomForestClassifier(100 , random_state = 10)
clf.fit(data.values[: , 1:-4], custom_target)
imps , features = zip(*sorted(zip(clf.feature_importances_, cols) , reverse = True))
# store the results as you like
Related
K-means clustering in sklearn, number of clusters is known in advance (it is 2).
There are multiple features. Feature values are initially without any weight assigned, i.e. they are treated equally weighted. However, task is to assign custom weights to each feature, in order to get best possible clusters separation.
How to determine optimum sample weights (sample_weight) for each feature, in order to get best possible separation of the two clusters?
If this is not possible for k-means, or for sklearn, I am interested in any alternative clustering solution, the point is that I need method of automatic determination of appropriate weights for multivariate features, in order to maximize clusters separation.
In meantime, I have implemented following: clustering by each component separately, then calculating silhouette score, calinski harabasz score, dunn score and inverse davies bouldin score for each component (feature) separately. Then scaling those scores to same magnitude, then PCA them to 1 feature. This produced weights for each component. It seems this approach produces reasonable results. I suppose better approach would be full factorial experiment (DOE), but it seems that this simple approach produces satisfactory results as well.
How do we compute accuracy for clusters using Weka?
I can use this formula:
Accuracy (A) = (tp+tn)/Total # samples
but how can I know what is the true positive, false positive, true negative and false negative in the output of experiment in the Weka tool?
There are a few different clustering modes in Weka:
Use training set (default): After clustering, Weka classifies the training instances into clusters it developed and computes the percentage of instances falling in each cluster. For example, X% in cluster 0 and Y% in cluster 1, etc.
Supplied test set: It is possible with Weka to evaluate clusterings on separate test data if the cluster representation is probabilistic like EM algorithm.
Clustering evaluation using classes: In this mode Weka first ignores the class attribute and generates the clustering. During testing, it assigns class labels to the clusters on the basis of the majority value of the class attribute within each cluster. Finally, it computes the classification error and also shows the corresponding confusion matrix.
Take a look on cross-validation principles. Use ClusterEvaluation 's methods crossValidateModel and evaluateClusterer in your java code. Or you can also experiment that with the weka GUI directly.
Based on this answer to a similar question the classificationViaClusteringmeta classifier which can be downloaded through the package manager will do what you want.
I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.
How should I set my gamma and Cost parameters in libSVM when I am using an imbalanced dataset that consists of 75% 'true' labels and 25% 'false' labels? I'm getting a constant error of having all the predicted labels set on 'True' due to the data imbalance.
If the issue isn't with libSVM, but with my dataset, how should I handle this imbalance from a Theoretical Machine Learning standpoint? *The number of features I'm using is between 4-10 and I have a small set of 250 data points.
Classes imbalance has nothing to do with selection of C and gamma, to deal with this issue you should use the class weighting scheme which is avaliable in for example scikit-learn package (built on libsvm)
Selection of best C and gamma is performed using grid search with cross validation. You should try vast range of values here, for C it is reasonable to choose values between 1 and 10^15 while a simple and good heuristic of gamma range values is to compute pairwise distances between all your data points and select gamma according to the percentiles of this distribution - think about putting in each point a gaussian distribution with variance equal to 1/gamma - if you select such gamma that this distribution overlaps will many points you will get very "smooth" model, while using small variance leads to the overfitting.
Imbalanced data sets can be tackled in various ways. Class balance has no effect on kernel parameters such as gamma for the RBF kernel.
The two most popular approaches are:
Use different misclassification penalties per class, this basically means changing C. Typically the smallest class gets weighed higher, a common approach is npos * wpos = nneg * wneg. LIBSVM allows you to do this using its -wX flags.
Subsample the overrepresented class to obtain an equal amount of positives and negatives and proceed with training as you traditionally would for a balanced set. Take note that you basically ignore a large chunk of data this way, which is intuitively a bad idea.
I know this has been asked some time ago, but I would like to answer it since you might find my answer useful.
As others have mentioned, you might want to consider using different weights for the minority classes or using different misclassification penalties. However, there is a more clever way of dealing with the imbalanced datasets.
You can use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to generate synthesized data for the minority class. It is a simple algorithm that can deal with some imbalance datasets pretty well.
In each iteration of the algorithm, SMOTE considers two random instances of the minority class and add an artificial example of the same class somewhere in between. The algorithm keeps injecting the dataset with the samples until the two classes become balanced or some other criteria(e.g. add certain number of examples). Below you can find a picture describing what the algorithm does for a simple dataset in 2D feature space.
Associating weight with the minority class is a special case of this algorithm. When you associate weight $w_i$ with instance i, you are basically adding the extra $w_i - 1$ instances on top of the instance i!
What you need to do is to augment your initial dataset with the samples created by this algorithm, and train the SVM with this new dataset. You can also find many implementation online in different languages like Python and Matlab.
There have been other extensions of this algorithm, I can point you to more materials if you want.
To test the classifier you need to split the dataset into test and train, add synthetic instances to the train set (DO NOT ADD ANY TO THE TEST SET), train the model on the train set, and finally test it on the test set. If you consider the generated instances when you are testing you will end up with a biased(and ridiculously higher) accuracy and recall.
I am working on a classification problem, which has different sensors. Each sensor collect a sets of numeric values.
I think its a classification problem and want to use weka as a ML tool for this problem. But I am not sure how to use weka to deal with the input values? And which classifier will best fit for this problem( one instance of a feature is a sets of numeric value)?
For example, I have three sensors A ,B, C. Can I define 5 collected data from all sensors,as one instance? Such as, One instance of A is {1,2,3,4,5,6,7}, and one instance of B is{3,434,534,213,55,4,7). C{424,24,24,13,24,5,6}.
Thanks a lot for your time on reviewing my question.
Commonly the first classifier to try is Naive Bayes (you can find it under "Bayes" directory in Weka) because it's fast, parameter less and the classification accuracy is hard to beat whenever the training sample is small.
Random Forest (you can find it under "Tree" directory in Weka) is another pleasant classifier since it process almost any data. Just run it and see whether it gives better results. It can be just necessary to increase the number of trees from the default 10 to some higher value. Since you have 7 attributes 100 trees should be enough.
Then I would try k-NN (you can find it under "Lazy" directory in Weka and it's called "IBk") because it commonly ranks amount the best single classifiers for a wide range of datasets. The only issues with k-nn are that it scales badly for large datasets (> 1GB) and it needs to fine tune k, the number of neighbors. This value is by default set to 1 but with increasing number of training samples it's commonly better to set it up to some higher integer value in range from 2 to 60.
And finally for some datasets where both, Naive Bayes and k-nn performs poorly, it's best to use SVM (under "Functions", it's called "Lib SVM"). However, it can be hassle to set up all the parameters of the SVM to get competitive results. Hence I leave it to the end when I already know what classification accuracies to expect. This classifier may not be the most convenient if you have more than two classes to classify.