I have to implement a naive bayes classifier for classifying a document to a class. So, in getting the conditional probability for a term belonging to class, along with laplace smoothing, we have:
prob(t | c) = Num(Word occurences in the docs of the class c) + 1 / Num(documents in class c) + |V|
Its a bernoulli model, which will have either 1 or 0 and the vocabulary is really large, like perhaps 20000 words and so on. So, won't the laplace smoothing give really small values due to the large size of the vocabulary or am I doing something wrong.
According to the psuedo code from this link: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html, for the bernoulli model we just add 2 instead of |V|. Why so?
Consider the case of multinomial naive Bayes. The smoothing you defined above is such that you can never get a zero probability.
With the multivariate/Bernoulli case, there is an additional constraint: probabilities of exactly 1 are not allowed either. This is because when some t from the known vocabulary is not present in the document d, a probability of 1 - prob(t | c) is multiplied to the document probability. If prob(t | c) is 1, then once again this is going to produce a posterior probability of 0.
(Likewise, when using logs instead, log(1 - prob(t | c)) is undefined when the probability is 1)
So in the Bernoulli equation (Nct + 1) / (Nc + 2), both cases are protected against. If Nct == Nc, then the probability will be 1/2 rather than 1. This also has the consequence of producing a likelihood of 1/2 regardless of whether t exists (P(t | c) == 1/2) or not (1 - P(t | c) == 1/2)
Related
I'd like to classify a set of 3d images (MRI). There are 4 classes (i.e. grade of disease A, B, C, D) where the distinction between the 4 grades is not trivial, therefore the labels I have for the training data is not one class per image. It's a set of 4 probabilities, one per class, e.g.
0.7 0.1 0.05 0.15
0.35 0.2 0.45 0.0
...
... would basically mean that
The first image belongs to class A with a probability of 70%, class B with 10%, C with 5% and D with 15%
etc., I'm sure you get the idea.
I don't understand how to fit a model with these labels, because scikit-learn classifiers expect only 1 label per training data. Using just the class with the highest probability results in miserable results.
Can I train my model with scikit-learn multilabel classification (and how)?
Please note:
Feature extraction is not the problem.
Prediction is not the problem.
Can I handle this somehow with the multilable classification framework?
For predict_proba to return the probability for each class A, B, C, D the classifier needs to be trained with one label per image.
If yes: How?
Use the image class as the label (Y) in your training set. That is your input dataset will look something like this:
F1 F2 F3 F4 Y
1 0 1 0 A
0 1 1 1 B
1 0 0 0 C
0 0 0 1 D
(...)
where F# are the features per each image and Y is the class as classified by doctors.
If no: Any other approaches?
For the case where you have more than one label per image, that is multiple potential classes or their respective probabilities, multilabel models might be a more appropriate choice, as documented in Multiclass and multilabel algorithms.
This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.
In Graphical Models and Bayesian Networks, how do you implement XOR problem?
I read bayesian network vs bayes classifier here:
A Naive Bayes classifier is a simple model that describes particular class of Bayesian network - where all of the features are class-conditionally independent. Because of this, there are certain problems that Naive Bayes cannot solve (example below). However, its simplicity also makes it easier to apply, and it requires less data to get a good result in many cases.
Example: XOR You have a learning problem with binary features x_1, x_2 and a target variable y = x_1 XOR x_2.
In a Naive Bayes classifier, x_1 and x_2 must be treated independently - so you would compute things like "The probability that y = 1 given that x_1 = 1" - hopefully you can see that this isn't helpful, because x_1 = 1 doesn't make y = 1 any more or less likely. Since a Bayesian network does not assume independence, it would be able to solve such a problem.
I googled, but could not figure out how. Can someone give me a hint or good references? Thanks!
This is actually fairly simple.
The DAG of the model would look like
x1 -> XOR <- x2
The probability distribution for the XOR node can then be written
x1 x2 | P(XOR=1|x1,x2)
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 0
What should be taken as m in m estimate of probability in Naive Bayes?
So for this example
what m value should I take? Can I take it to be 1.
Here p=prior probabilities=0.5.
So can I take P(a_i|selected)=(n_c+ 0.5)/ (3+1)
For Naive Bayes text classification the given P(W|V)=
In the book it says that this is adopted from the m-estimate by letting uniform priors and with m equal to the size of the vocabulary.
But if we have only 2 classes then p=0.5. So how can mp be 1? Shouldn't it be |vocabulary|*0.5? How is this equation obtained from m-estimate?
In calculating the probabilities for attribute profession,As the prior probabilities are 0.5 and taking m=1
P(teacher|selected)=(2+0.5)/(3+1)=5/8
P(farmer|selected)=(1+0.5)/(3+1)=3/8
P(Business|Selected)=(0+0.5)/(3+1)= 1/8
But shouldn't the class probabilities add up to 1? In this case it is not.
Yes, you can use m=1. According to wikipedia if you choose m=1 it is called Laplace smoothing. m is generally chosen to be small (I read that m=2 is also used). Especially if you don't have that many samples in total, because a higher m distorts your data more.
Background information: The parameter m is also known as pseudocount (virtual examples) and is used for additive smoothing. It prevents the probabilities from being 0. A zero probability is very problematic, since it puts any multiplication to 0. I found a nice example illustrating the problem in this book preview here (search for pseudocount)
"m estimate of probability" is confusing.
In the given examples, m and p should be like this.
m = 3 (* this could be any value. you can specify this.)
p = 1/3 = |v| (* number of unique values in the feature)
If you use m=|v| then m*p=1, so it is called Laplace smoothing. "m estimate of probability" is the generalized version of Laplace smoothing.
In the above example you may think m=3 is too much, then you can reduce m to 0.2 like this.
I believe the uniform prior should be 1/3, not 1/2. This is because you have 3 professions, so you're assigning equal prior probability to each one. Like this, mp=1, and the probabilities you listed sum to 1.
From p = uniform priors and with m equal to the size of the vocabulary.
Will get :
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.