Determine most important feature per class - machine-learning

Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?

When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example

Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.

Related

Word2vec is a Generalization or memorization algorithm?

I need to know that word2vec is a Generalization algorithm like all ML algorithms or its Memorization algorithm like KNN ?
Because we have 2 types of algorithms model based and memory based , word2vec is coming in which category when it's using for most_similar_items
Let me define generalization as the ability of a model which has completed training to be effective in prediction across a whole range of inputs, including include inputs that is not part of training. From that perspective, Word2Vec cannot predict words that are not part of the training dataset because it simply would not have trained on the context of it to create an embedding. To qualify as a generalization method, it needs to be able to predict on an input which was not part of the training dataset.
Word2Vec model maintains a dictionary of words to the corresponding embedding/vector. In summary, cannot predict on unknown words. This was one of the important differences between fastText model and Word2Vec.

Regression like quantification of the importance of variables in random forest

Is it possible to quantify the importance of variables in figuring out the probability of an observation falling into one class. Something similar to Logistic regression.
For example:
If I have the following independent variables
1) Number of cats the person has
2) Number of dogs a person has
3) Number of chickens a person has
With my dependent variable being: Whether a person is a part of PETA or not
Is it possible to say something like "if the person adopts one more cat than his existing range of animals, his probability of being a part of PETA increases by 0.12"
I am currently using the following methodology to reach this particular scenario:
1) Build a random forest model using the training data
2) Predict the customer's probability to fall in one particular class(Peta vs non Peta)
3) Artificially increase the number of cats owned by each observation by 1
4) Predict the customer's new probability to fall in one of the two classes
5) The average change between (4)'s probability and (2)'s probability is the average increase in a person's probability if he has adopted a cat.
Does this make sense? Is there any flaw in the methodology that I haven't thought of? Is there a better way of doing the same ?
If you're using scikitlearn, you can easily do this by accessing the feature_importance_ property of the fitted RandomForestClassifier. According to SciKitLearn:
The relative rank (i.e. depth) of a feature used as a decision node in
a tree can be used to assess the relative importance of that feature
with respect to the predictability of the target variable. Features
used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected
fraction of the samples they contribute to can thus be used as an
estimate of the relative importance of the features.
By averaging
those expected activity rates over several randomized trees one can
reduce the variance of such an estimate and use it for feature
selection.
The property feature_importance_ stores the average depth of each feature among the trees.
Here's an example. Let's start by importing the necessary libraries.
# using this for some array manipulations
import numpy as np
# of course we're going to plot stuff!
import matplotlib.pyplot as plt
# dummy iris dataset
from sklearn.datasets import load_iris
#random forest classifier
from sklearn.ensemble import RandomForestClassifier
Once we have these, we're going to load the dummy dataset, define a classification model and fit the data to the model.
data = load_iris()
​
# we're gonna use 100 trees
forest = RandomForestClassifier(n_estimators = 100)
​
# fit data to model by passing features and labels
forest.fit(data.data, data.target)
Now we can use the Feature Importance Property to get a score of each feature, based on how well it is able to classify the data into different targets.
# find importances of each feature
importances = forest.feature_importances_
# find the standard dev of each feature to assess the spread
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
​
# find sorting indices of importances (descending)
indices = np.argsort(importances)[::-1]
​
# Print the feature ranking
print("Feature ranking:")
​
for f in range(data.data.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
Feature ranking:
1. feature 2 (0.441183)
2. feature 3 (0.416197)
3. feature 0 (0.112287)
4. feature 1 (0.030334)
Now we can plot the importance of each feature as a bar graph and decide if it's worth keeping them all. We also plot the error bars to assess the significance.
plt.figure()
plt.title("Feature importances")
plt.bar(range(data.data.shape[1]), importances[indices],
color="b", yerr=std[indices], align="center")
plt.xticks(range(data.data.shape[1]), indices)
plt.xlim([-1, data.data.shape[1]])
plt.show()
Bar graph of feature importances
I apologize. I didn't catch the part where you mention what kind of statement you're trying to make. I'm assuming your response variable is either 1 or zero. You could try something like this:
Fit a linear regression model over the data. This won't really give you the most accurate fit, but it will be robust to get the information you're looking for.
Find the response of the model with the original inputs. (It most likely won't be ones or zeros)
Artificially change the inputs, and find the difference in the outputs of the original data and modified data, like you suggested in your question.
Try it out with a logistic regression as well. It really depends on your data and how it is distributed to find what kind of regression will work best. You definitely have to use a regression to find the change in probability with change in input.
You can even try a single layer neural network with a regression/linear output layer to do the same thing. Add layers or non-linear activation functions if the data is less likely to be linear.
Cheers!

LDA and PCA on a dataset containing two classes

I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).

How to do text classification with label probabilities?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

Model in Naive Bayes

When we train a training set using decision tree classifier, we will get a tree model. And this model can be converted to rules and can be incorporated into a java code.
Now if I train the training set using Naive Bayes, in what form is the model? And how can I incorporated the model into my java code?
If there is no model resulted from the training, then what is the difference between Naive Bayes and lazy learner (ex. kNN)?
Thanks in advance.
Naive Bayes constructs estimations of conditional probabilities P(f_1,...,f_n|C_j), where f_i are features and C_j are classes, which, using bayes rule and estimation of priors (P(C_j)) and evidence (P(f_i)) can be translated into x=P(C_j|f_1,...,f_n), which can be roughly read as "Given features f_i I think, that their describe object of class C_j and my certainty is x". In fact, NB assumes that festures are independent, and so it actualy uses simple propabilities in form of x=P(f_i|C_j), so "given f_i I think that it is C_j with probability x".
So the form of the model is set of probabilities:
Conditional probabilities P(f_i|C_j) for each feature f_i and each class C_j
priors P(C_j) for each class
KNN on the other hand is something completely different. It actually is not a "learned model" in a strict sense, as you don't tune any parameters. It is rather a classification algorithm, which given training set and number k simply answers question "For given point x, what is the major class of k nearest points in the training set?".
The main difference is in the input data - Naive Bayes works on objects that are "observations", so you simply need some features which are present in classified object or absent. It does not matter if it is a color, object on the photo, word in the sentence or an abstract concept in the highly complex topological object. While KNN is a distance-based classifier which requires you to classify object which you can measure distance between. So in order to classify abstract objects you have to first come up with some metric, distance measure, which describes their similarity and the result will be highly dependent on those definitions. Naive Bayes on the other hand is a simple probabilistic model, which does not use the concept of distance at all. It treats all objects in the same way - they are there or they aren't, end of story (of course it can be generalised to the continuous variables with given density function, but it is not the point here).
The Naive Bayes will construct/estimate the probability distribution from which your training samples have been generated.
Now, given this probability distribution for all your output classes, you take a test sample, and depending on which class has the highest probability of generating this sample, you assign the test sample to that class.
In short, you take the test sample and run it through all the probability distributions (one for each class) and calculate the probability of generating this test sample for that particular distribution.

Resources