Multi class classification using gaussian naive bayes - machine-learning

I know that the Naive Bayes is good at binary classification, but I wanted to know how does the Multiclass classification works.
For example: I did a text classification using Naive Bayes earlier in which I performed vectorization of text to find the probability of each word in the document, and later used the vectorized data to fit naive bayes classifier.
Now, I am working with the data which looks like:
A, B, C, D, E, F, G
210, 203, 0, 30, 710, 2587452, 0
273, 250, 0, 30, 725, 3548798, 1
283, 298, 0, 31, 785, 3987452, 3
In the above data, there are 6 features (A-F) and G is the class having value(0,1 or 2)
I have almost 70000 entries in dataset having the class (output) 1, 2, or 3.
After splitting the data into test and training data, I am fitting the training data into sklearn- GaussianNB algo.
After fitting when I try to predict testing data it just classify either 0 or 2.
So, my question is as I performed vectorization before fitting the navie bayes classifier during text classification, is there and pre-processing of data I need to do for the above data before fitting the GaussianNB classifier with training data, so that it can predict multi-class(0,1 and 2) instead of only (0 and 2).

I know that the Naive Bayes is good at binary classification, but I wanted to know how does the Multiclass classification works.
There is nothing in Naive Bayes specific to binary classification ,it is designed to do multiclass classification just fine.
So, my question is as I performed vectorization before fitting the navie bayes classifier during text classification, is there and pre-processing of data I need to do for the above data before fitting the GaussianNB classifier with training data, so that it can predict multi-class(0,1 and 2) instead of only (0 and 2).
No, there is no preprocessing, for the multiclass bit. However, for Gaussian bit - as name suggests this model will try to fit Gaussian pdf to each feature. Consequently if your features do not follow Gaussian distribution - it can fail. If you can figure out transformation of each feature (based on the data you have) to make them more Gaussian-like, it will help the model. For example, some of your features seem to be huge numbers, which can cause serious difficulties if they do not follow gaussian distribution. You might want to normalise your data, or even drop these features.
The only reason why your model never predicts 1 is because under Naive Bayes assumptions, and with data provided - it is not probable enough to be ever considered. You can try normalising features as described above. If this fails, you can also artificially "overweight" selected classes by providing your own prior attribute to sklearn (which is normally estimated from data as "how often sample with class X is encountered", and if you change this to higher number - a class will be considered more probable).

Related

LDA and PCA on a dataset containing two classes

I would like to compare the accuracies of running logistic regression on a dataset following PCA and LDA. The dataset I am using is the wisconsin cancer dataset, which contains two classes: malignant or benign tumors and 30 features. I have already conducted PCA on this data and have been able to get good accuracy scores with 10 PCAs. I know that LDA is similar to PCA. My understanding is that you calculate the mean vectors of each feature for each class, compute scatter matricies and then get the eigenvalues for the dataset. Is LDA similar to PCA in the sense that I can choose 10 LDA eigenvalues to better separate my data? I have tried LDA with scikit learn, however it has only given me one LDA back. Is this becasue I only have 2 classes, or do I need to do an addiontional step? I would like to have 10 LDAs in order to compare it with my 10 PCAs. Is this even possible?
Actually both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels). You can picture PCA as a technique that finds the directions of maximal variance.And LDA as a technique that also cares about class separability (note that here, LD 2 would be a very bad linear discriminant).Remember that LDA makes assumptions about normally distributed classes and equal class covariances (at least the multiclass version; the generalized version by Rao).

Machine Learning - one class classification/novelty detection/anomaly assessment?

I need a machine learning algorithm that will satisfy the following requirements:
The training data are a set of feature vectors, all belonging to the same, "positive" class (as I cannot produce negative data samples).
The test data are some feature vectors which might or might not belong to the positive class.
The prediction should be a continuous value, which should indicate the "distance" from the positive samples (i.e. 0 means the test sample clearly belongs to the positive class and 1 means it is clearly negative, but 0.3 means it is somewhat positive)
An example:
Let's say that the feature vectors are 2D feature vectors.
Positive training data:
(0, 1), (0, 2), (0, 3)
Test data:
(0, 10) should be an anomaly, but not a distinct one
(1, 0) should be an anomaly, but with higher "rank" than (0, 10)
(1, 10) should be an anomaly, with an even higher anomaly "rank"
The problem you described is usually referred to as outlier, anomaly or novelty detection. There are many techniques that can be applied to this problem. A nice survey of novelty detection techniques can be found here. The article gives a thorough classification of the techniques and a brief description of each, but as a start, I will list some of the standard ones:
K-nearest neighbors - a simple distance-based method which assumes that normal data samples are close to other normal data samples, while novel samples are located far from the normal points. Python implementation of KNN can be found in ScikitLearn.
Mixture models (e.g. Gaussian Mixture Model) - probabilistic models modeling the generative probability density function of the data, for instance using a mixture of Gaussian distributions. Given a set of normal data samples, the goal is to find parameters of a probability distribution so that it describes the samples best. Then, use the probability of a new sample to decide if it belongs to the distribution or is an outlier. ScikitLearn implements Gaussian Mixture Models and uses the Expectation Maximization algorithm to learn them.
One-class Support Vector Machine (SVM) - an extension of the standard SVM classifier which tries to find a boundary that separates the normal samples from the unknown novel samples (in the classic approach, the boundary is found by maximizing the margin between the normal samples and the origin of the space, projected to the so called "feature space"). ScikitLearn has an implementation of one-class SVM which allows you to use it easily, and a nice example. I attach the plot of that example to illustrate the boundary one-class SVM finds "around" the normal data samples:

scikit learn classifies stopwords

Here is the example where there is step by step procedure to make system learn and classify input data.
It classifies correctly for given 5 datasets domains. Additionally it also classifies stopwords.
e.g
Input : docs_new = ['God is love', 'what is where']
Output :
'God is love' => soc.religion.christian
'what is where' => soc.religion.christian
Here what is where should not be classified as it contains only stopwords. How scikit learn functions in this scenario?
I am not sure what classifier you are using. But let's assume you use a Naive Bayes classifier.
In this case, the sample is labeled as the class for which the posterior probability is maximum given a particular pattern of words.
And the posterior probability is calculated as
posterior = likelihood x prior
Note that the evidence term was dropped since it is constant). Additionally, there is an additive smoothening to avoid scenarios where the likelihood is zero.
Anyway, if you have only stop words in your input text, the likelihood is constant for all classes and the posterior probability is entirely determined by your prior probability. So, what basically happens is that a Naive Bayes classifier (if the priors were estimated from the training data) will assign the class label that occurs most often in the training data.
A classifier always predicts one of the classes that it saw during its training phase, by definition. I don't know what you did to produce the classifier, but most likely it's just predicting the majority class for any sample without interesting features; that what naive Bayes, linear SVMs and other typical text classifiers do.
Standard text classification uses TfidfVectorizer to transform text to tokens and to vectors of features as input to classifier.
One of the init parameters is stop_words, in case stop_words='english' the vectorizer will produce no features for the sentence 'what is where'.
Stop words are matched lexically against every input token using a built in english stop words list you can examine here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

Classifier prediction results biased

I built a classifier with 13 features ( no binary ones ) and normalized individually for each sample using scikit tool ( Normalizer().transform).
When I make predictions it predicts all training sets as positives and all test sets as negatives ( irrespective of fact whether it is positive or negative )
What anomalies I should focus on in my classifier, feature or data ???
Notes: 1) I normalize test and training sets (individually for each sample) separately.
2) I tried cross validation but the performance is same
3) I used both SVM linear and RBF Kernels
4) I tried without normalizing too. But same poor results
5) I have same number of positive and negative datasets ( 400 each) and 34 samples of positive and 1000+ samples of negative test sets.
If you're training on balanced data the fact that "it predicts all training sets as positive" is probably enough to conclude that something has gone wrong.
Try building something very simple (e.g. a linear SVM with one or two features) and look at the model as well as a visualization of your training data; follow the scikit-learn example: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
There's also a possibility that your input data has many large outliers impacting the transform process...
Try doing feature selection on the training data (Seperately from your test/validation data).
Feature selection on your whole dataset can easily lead to overfitting.

How to do text classification with label probabilities?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

Resources