Calculating Probabilities in Naive Bayes Classification

Calculating Probabilities in Naive Bayes Classification - machine-learning

I have a data set consisting of both categorical and continuous attributes. I want to apply Naive Bayes classification method to classify the data.
How to calculate probabilities for both of these types?
Should I use count method for calculating on categorical data and assume some distribution and calculating from that on continuous data ?

As Naive Bayes assumes independence of each feature obervation given a class label you have
P(cat1, con1|y) = P(cat1|y)P(con1|y)
where cat1 is some categorical variable and con1 is continuous, you model each of these probabilities completely independently. And as you suggested, for categorical you can use simple empirical estimator (however remember about some smoothing techniques so you do not get 0 probabilities) and for continuous you need some more sophisticated estimator (such as MLE using fixed distributions family - for example gaussians; or something more complex - as any probabilistic classifier/model)

Related

Use categorical data as feature/target without encoding it

I am recently found a model to classify the Irish flower based on the size of its leaf. There are 3 types of flowers as a target (dependent variable). As I know, the categorical data should be encoded so that it can be used in machine learning. However, in the model the data is used directly without encoding process.
Can anyone help to explain when to use encoding? Thank you in advance!

Relevant question - encoding of continuous feature variables.
Originally, the Iris data were published by Fisher when he published his linear discriminant classifier.
Generally, a distinction is made between:
Real-value classifiers
Discrete feature classifiers
Linear discriminant analysis and quadratic discriminant analysis are real-value classifiers. Trying to add discrete variables as extra input does not work. Special procedures for working with indicator variables (the name used in statistics) in discriminant analysis have been developed. Also the k-nearest neighbour classifier really only works well with real-valued feature variables.
The naive Bayes classifier is most commonly used for classification problems with discrete features. When you don't want to assume conditional independence between the feature variables, the multinomial classifier can be applied to discrete features. A classifier service that does all this for you in one go, is insight classifiers.
Neural networks and support vector machines combine real-valued and discrete features. My advice is to use one separate input node for each discrete outcome - don't use one single input node provided with values like: (0: small, 1: minor, 2: medium, 3: larger, 4: big). One input-node-per-outcome-encoding will improve your training result and yield better test set performance.
The random forest classifier also combines real-valued and discrete features seamlessly.
Final advice is to train and test-set compare at least 4 different types of classifiers, as there is no such thing as the universal best type of classifier.

How to use Genetic Algorithm to find weight of voting classifier in WEKA?

I am working from this article: "A novel method for predicting kidney stone type using ensemble learning". The author used a genetic algorithm for finding the optimal weight vector for voting with WEKA, but i don't know see can they did that. How can i use a genetic algorithm to find weight of voting classifier with WEKA?
This below paragraph has been extracted from the article:
In order to enhance the performance of the voting algorithm,a weighted
majority vote is used. Simple majority vote algorithm is usually an
effective way to combine different classifiers, but not all
classifiers have the same effect on the classification problem. To
optimize the results from weight majority vote classifier, we need to
find the optimal weight vector. Applying Genetic algorithms is our
solution for finding the optimal weight vector in this problem.

Assuming you have some trained classifiers and a test set, you can create a method calculateFitness(double[] weights). In this method for each Instance calculate all predictions and a merged prediction according to the weights. Use the combined predictions and the real values to calculate the total score you want to maximize/minimize.
Using the calculateFitness method you can create a custom GA to find best weights.

Determine most important feature per class

Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?

When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example

Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.

scikit learn classifies stopwords

Here is the example where there is step by step procedure to make system learn and classify input data.
It classifies correctly for given 5 datasets domains. Additionally it also classifies stopwords.
e.g
Input : docs_new = ['God is love', 'what is where']
Output :
'God is love' => soc.religion.christian
'what is where' => soc.religion.christian
Here what is where should not be classified as it contains only stopwords. How scikit learn functions in this scenario?

I am not sure what classifier you are using. But let's assume you use a Naive Bayes classifier.
In this case, the sample is labeled as the class for which the posterior probability is maximum given a particular pattern of words.
And the posterior probability is calculated as
posterior = likelihood x prior
Note that the evidence term was dropped since it is constant). Additionally, there is an additive smoothening to avoid scenarios where the likelihood is zero.
Anyway, if you have only stop words in your input text, the likelihood is constant for all classes and the posterior probability is entirely determined by your prior probability. So, what basically happens is that a Naive Bayes classifier (if the priors were estimated from the training data) will assign the class label that occurs most often in the training data.

A classifier always predicts one of the classes that it saw during its training phase, by definition. I don't know what you did to produce the classifier, but most likely it's just predicting the majority class for any sample without interesting features; that what naive Bayes, linear SVMs and other typical text classifiers do.

Standard text classification uses TfidfVectorizer to transform text to tokens and to vectors of features as input to classifier.
One of the init parameters is stop_words, in case stop_words='english' the vectorizer will produce no features for the sentence 'what is where'.
Stop words are matched lexically against every input token using a built in english stop words list you can examine here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

How to do text classification with label probabilities?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?

In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.

If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart