How to derive cluster properties - machine-learning

I have clustered ~40000 points into 79 clusters. Each point is a vector of 18 features. I want to 'derive' the characteristics of each cluster - the prominent features/characteristics of the clusters. Are there machine-learning algorithms to derive this?

If you are confident the clusters are meaningful for your particular needs, you could view it as a classification problem.
One option would be to apply a feature selection algorithm to rank the features. You could use recursive feature elimination to identify a subset of features that are predictive for the cluster labels.
Another good option for interpreting the clusters could be building a decision tree. With decision trees you can see what features are used to best separate the classes (clusters in your case). You could also use an ensemble like Random Forest and ask for feature importance scores.

Related

Relation between coefficients in linear regression and feature importance in decision trees

Recently I have a Machine Learning(ML) project, which needs to identify the features(inputs, a1,a2,a3 ... an) that have large impacts on target/outputs.
I used linear regression to get the coefficients of the feature, and decision trees algorithm (for example Random Forest Regressor) to get important features (or feature importance).
Is my understanding right that the feature with large coefficient in linear regression shall be among the top list of importance of features in Decision tree algorithm?
Not really, if your input features are not normalized, you could have a relatively big co-efficient for features with a relatively big mean/std. If your features are normalized, then yes, this could be an indicator to the features importance, but there are still other things to consider.
You could try some of sklearn's feature selection classes that should do this automatically for you here.
Short answer to your question is No, not necessarily. Considering the fact that we do not know what are your different inputs, if they are in the same unit system, range of variation and etc.
I am not sure why you have combined Linear regression with Decision tree. But I just assume you have a working model, say a linear regression which provides good accuracy on the test set. From what you have asked, you probably need to look at sensitivity analysis based on the obtained model. I would suggest doing some reading on "SALib" library and generally the subject of sensitivity analysis.

Difference between Content Based Recommender and K means clustering

As the name says, it's a relatively straightforward question. In both, we calculate the similarity between two items (could be using different measures, of course). And we recommend the items closest to the item user just used. Can anybody explain to me how the two are different things?
From a conceptual perspective, a Content Based Recommender is a recommender system and it does not necessary work with clustering strategies, instead of this, it could implement any strategy. A Content Based Recommender could apply classifications, prediction, clustering or merge all these strategies to provide a recommendation for something we call as a Decision Support System.
K-means is a strategy that use the atrtibutes of a dataset as vectors and based on euclidean distance between the items, it meansures a given k number of clusters of each item on the dataset belongs.
A Content Based Recommender could use k-means as part of a strategy to provider a recommendation to a Decision Support System.

Difference between feature selection, clustering ,dimensionality reduction algorithm

Could someone indicate difference between feature selection and clustering and dimensionality reduction algorithms?
feature selection algorithms: allows to find the predominant variables either which best represent the data or best parameters to indicate the class
for eg : gbm / lasso
Clustering helps us to indicate which clusters of variables clearly define the output
Isnt this same as dimensionality reduction algorithm?
Doesn't feature selection + clustering do the same as dimensionality reduction algorithms?
Feature Selection:
In machine learning and statistics, feature selection, also known as
variable selection, attribute selection or variable subset selection,
is the process of selecting a subset of relevant features (variables,
predictors) for use in model construction.
Clustering:
Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters).
Dimensionality Reduction:
In machine learning and statistics, dimensionality reduction or
dimension reduction is the process of reducing the number of random
variables under consideration, and can be divided into feature
selection and feature extraction.
When you have many features and want to use some of them then you can apply feature selection (i.e. mRMR). So, it means that you have applied a dimensionality reduction.
However, clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some
sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields (check Clustering in Machine Learning). When you want to group (cluster) different data points according to their features you can apply clustering (i.e. k-means) with/without using dimensionality reduction.

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

More accurate approach than k-mean clustering

In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.
Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.
There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,
In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.

Resources