Could someone indicate difference between feature selection and clustering and dimensionality reduction algorithms?
feature selection algorithms: allows to find the predominant variables either which best represent the data or best parameters to indicate the class
for eg : gbm / lasso
Clustering helps us to indicate which clusters of variables clearly define the output
Isnt this same as dimensionality reduction algorithm?
Doesn't feature selection + clustering do the same as dimensionality reduction algorithms?
Feature Selection:
In machine learning and statistics, feature selection, also known as
variable selection, attribute selection or variable subset selection,
is the process of selecting a subset of relevant features (variables,
predictors) for use in model construction.
Clustering:
Cluster analysis or clustering is the task of grouping a set of
objects in such a way that objects in the same group (called a
cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters).
Dimensionality Reduction:
In machine learning and statistics, dimensionality reduction or
dimension reduction is the process of reducing the number of random
variables under consideration, and can be divided into feature
selection and feature extraction.
When you have many features and want to use some of them then you can apply feature selection (i.e. mRMR). So, it means that you have applied a dimensionality reduction.
However, clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some
sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields (check Clustering in Machine Learning). When you want to group (cluster) different data points according to their features you can apply clustering (i.e. k-means) with/without using dimensionality reduction.
Related
The end goal is to create a binary classifier that would output "YES" for around 10% of the instances (based on training data). The classifier would use binary, continuous and maybe some categorical features.
Currently I am extracting a continuous feature in range [0; 1] that should describe similarity between the real name of a product and its potential mention in a text field. I am trying out different methods for extracting this feature (Levenshtein distance and some other algorithms).
I am not sure what kind of feature metrics I should use to select (or at least approximate) the best extraction method for this feature; and the question would be:
What kind of metrics should be used to reason about the best extraction method for a particular feature that would later be used with different binary classification algorithms if this feature is
binary
continuous
categorical
Would I use something like Pearson correlation for 2) or would Information gain be a better metric?
The metrics should not be very classifier specific (I would like use the extracted features on multiple algorithms like decision tree, logical regression, neural network with small adjustments.).
I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.
In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.
Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.
There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,
In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.
1) How can i apply feature reduction methods like LSI etc in weka for text classification?
2) Do applying feature reduction methods like LSI etc can improve the accuracy of classification ?
Take a look at FilteredClassifier class or at AttributeSelectedClassifier. With FilteredClassifier you can use such features reduction method as Principal Component Analysis (PCA). Here is a video how to filter your dataset using PCA, so that you could try different classifiers on reduced dataset.
It can help, but there is no guarantee about that. If you remove redundant features, or transform features in some way (like SVM or PCA do) classification task can become simpler. Anyway big number of features usually lead to curse of dimensionality and attribute selection is a way to avoid it.
I'm quite new in machine learning environment, and I'm trying to understand properly some basis concept. My problem is the following:
I have a set of data observation and the corresponding target values {x,t}. I'm trying to train a function with this data in order to predict the value of unobserved data and I'm trying to achieve this by using the maximum posterior (MAP) technique (and so Bayesian approach) with Gaussian basis function of the form:
\{Phi}Gaussian_{j}(x)=exp((x−μ_{j})^2/2*sigma_{j}^2)
How can I choose
1) The number of basis functions to use (M)
2) The mean for every function (μ_{j})
3) The variance for every function (sigma_{j})
?
There are different approaches to this in the literature. The most common approach is to perform an unsupervised clustering of the input data (see the Netlab toolbox). Some other approaches are described in the papers "EMRBF: A Statistical Basis for Using Radial Basis Functions for Process Control" and "Robust Full Bayesian Learning for Radial Basis Networks".