I am trying to implement the Gaussian Naive Bayes from a scikit-learn library. I know that the Naive Bayes is based on the Bayes' theorem which is defined in high level as:
posterior = (prior * likelihood) / evidence.
As far as I know, the prior and evidence are learned from the training data.
I am not sure about likelihood whether Q1: is it also learned from the training data or by using Maximum likelihood estimation ?. Q2: Is there any hyper-parameter required to be tuned or not ?.
Suppose you have Bayes Theorem as,
P(A|B) = (P(B|A)*P(A))/P(B)
Where,
P(A|B) = Posterior Probability
P(B|A) = Likelihood
P(A) = Prior Probability
P(B) = Marginal Likelihood
Answers to Your Question
Likelihood is calculated using the training data, and the Maximum Likelihood estimation is used to calculated the maximum value of the likelihood.
Naive Bayes has almost no hyperparameters to tune, so it usually generalizes well.
Related
I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:
33.3% Positive;
33.3% Neutral
33.3% Negative
My question is:
Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?
You are correct, balancing data is important for many discriminative models, but not really for NB.
However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.
In my opinion the best dataset for training purposes if a sample of the real world data that your classifier will be used with.
This is true for all classifiers (but some of them are indeed not suitable to unbalanced training sets in which cases you don't really have a choice to skew the distribution), but particularly for probabilistic classifiers such as Naive Bayes. So the best sample should reflect the natural class distribution.
Note that this is important not only for the class priors estimates. Naive Bayes will calculate for each feature the likelihood of predicting the class given the feature. If your bayesian classifier is built specifically to classify texts, it will use global document frequency measures (the number of times a given word occurs in the dataset, across all categories). If the number of documents per category in the training set doesn't reflect their natural distribution, the global term frequency of terms usually seen in unfrequent categories will be overestimated, and that of frequent categories underestimated. Thus not only the prior class probability will be incorrect, but also all the P(category=c|term=t) estimates.
I just coded a Naive Bayes classifier for text classification that is giving me expected results. My features are words, and my classes are text classes. I've coded a multinomial Naive Bayes classifier.
However I would prefer my classifier to output real percentage values ...
To do so I've got to compute the evidence probability as explained in this wikipedia page.
I've got no problem to compute the prior and the conditional probabilities. However I do not know how to compute the evidence probability P(X). And the few documentations talking about it are not very clear.
I've tried :
P(X) as the product of P(Xi) where Xi is my feature (basically it is the product of the percentage of feature within the pool).
P(X) as the sum of P(Ck) * (product of P(Xi/Ck) for all classes.
None of these solutions give me correct percentages ...
Do you know how to compute the evidence probability in my case?
I have a OneVsRestClassifier (scikit-learn) which has been trained.
clf = OneVsRestClassifier(LogisticRegression(C=1.2, penalty='l1')).fit(X_train, y_train)
I want to find out the loss for my test data. I used log_loss function but it does not seem to work because I have multiple classes as outputs for each test case. What do I do?
The classification problem that you are referring to is known as a Multi-Label Classification problem. You have made a good decision of using the OneVsRestClassifier for this purpose. By default the score method uses the subset accuracy which is a very harsh metric as it requires you to guess the entire subset of labels correctly.
Some other loss functions, provided by scikit-learn, that you can use are as follows:
Hamming Loss - This measures the hamming distance between your prediction of labels and the true label. This is an intuitive formula to understand the hamming distance.
Jaccard Similarity Coefficient Score - This measures the Jaccard similarity between your predicted labels and the true labels.
Precision, Recall and F-Measures - In the case of multi-label classification, the notion of Precision, Recall and F-Measures can be applied to each class independently. The following guide explains how to combine them across all labels in multi-label classification.
If you need to also rank the labels as it is done in multi-label ranking problems, then there are other more advanced techniques available in scikit-learn which are very well documented with examples here. If you are dealing with this kind of a problem, then let me know in the comments, I will explain each of these metrics in more details.
Hope this helps!
I'm trying to modify an standard kNN algorithm to obtain the probability of belonging to a class instead of just the usual classification. I haven't found much information about Probabilistic kNN, but as far as I understand, it works similar to kNN, with the difference that it calculates the percentage of examples of every class inside the given radius.
So I wonder, what's the difference then between Naive Bayes and Probabilistic kNN? I just can spot that Naive Bayes takes into consideration the prior possibility, while PkNN does not. Am I getting it wrong?
Thanks in advance!
To be honest there is nearly no similarity.
Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For contiuous case - It will fit a radial Normal distribution to your whole class (each of them) and then make a decision through argmax_y N(m_y, Sigma_y)
KNN on the other hand is not a probabilistic model. Modification that you are refering to is simply a "smooth" version of the original idea, where you return ratio of each class in the nearest neighbours set (and this is not really any "probabilistic kNN", it is just regular kNN which rough estimate of probability). This assumes nothing about data distribution (besides being localy smooth). In particular - it is a nonparametric model which, given enough training samples, will fit perfectly to any dataset. Naive Bayes will fit perfectly only to K gaussians (where K is number of classes).
(I don't know how to format math formulas. For more details and clear representations, please see this.)
I would like to propose an opposite view that KNN is a kind of simplified Naive Bayes (NB) by viewing KNN as a mean of density estimation.
To perform density estimation, we attempt to estimate p(x) = k/NV, where k is the number of samples lying in a region R, N is the total sample number, and V is the volume of the region R. Usually, there are two ways to estimate it: (1) fixing V, calculate k, which is known as kernel density estimation or Parzen window; (2) fixing k, calculate V, which is the KNN-based density estimation. The latter one is much less famous than the former one due to its many drawbacks.
Yet, we can use KNN-based density estimation to connect KNN and NB. Given total N samples, Ni samples for class ci, we can write the NB in the form of KNN-based density estimation by considering a region contain x:
P(ci|x) = P(x|ci)P(ci)/P(x) = (ki/NiV)(Ni/N)/(k/NV) = ki/k,
where ki is the sample number of class ci lying in the region. The final form ki/k is actually the KNN classifier.
I have a data set consisting of both categorical and continuous attributes. I want to apply Naive Bayes classification method to classify the data.
How to calculate probabilities for both of these types?
Should I use count method for calculating on categorical data and assume some distribution and calculating from that on continuous data ?
As Naive Bayes assumes independence of each feature obervation given a class label you have
P(cat1, con1|y) = P(cat1|y)P(con1|y)
where cat1 is some categorical variable and con1 is continuous, you model each of these probabilities completely independently. And as you suggested, for categorical you can use simple empirical estimator (however remember about some smoothing techniques so you do not get 0 probabilities) and for continuous you need some more sophisticated estimator (such as MLE using fixed distributions family - for example gaussians; or something more complex - as any probabilistic classifier/model)