Naive Bayes Confusion; - machine-learning

I'm working on homework for my machine learning course and am having trouble understanding the question on Naive Bayes. The problem I have is a variation of question number 2 on the following page:
https://www.cs.utexas.edu/~mooney/cs343/hw3-old/hw3.html
The numbers I have are slightly different, so I'll replace the numbers from my assignment with the example above. I'm currently attempting to figure out the probability that the first text is physics. To do so, I have something that looks a little like this:
P(physics|c) = P(physics) * P(carbon|physics) * p(atom|physics) * p(life|physics) * p(earth|physics) / [SOMETHING]
P(physics|c) = .35 * .005 * .1 * .001 * .005 / [SOMETHING]
I'm basing this off of an example that I've seen in my notes, but I can't seem to figure out what I'm supposed to divide by. I'll provide the example from the notes as well.
Perhaps I'm going about this in the wrong way, but I'm unsure where the P(X) term that we're dividing by is coming from. How does this relate to the probability that the text is physics? I feel that getting this issue resolved will make the remainder of the assignment simple.

The denominator P(X) is just the sum of P(X|Y)*P(Y) for all possible classes.
Now, it's important to note that in Naive Bayes, you do not have to compute this P(X). You only have to compute P(X|Y)*P(Y) for each class, and then select the class that produced the highest probability.
In your case, I assume you must have several classes. You mentioned physics, but there must be others like chemistry or math.
So you can compute:
P(physics|X) = P(X|physics) * P(physics) / P(X)
P(chemistry|X) = P(X|chemistry) * P(chemistry) / P(X)
P(math|X) = P(X|math) * P(math) / P(X)
P(X) is the sum of P(X|Y)*P(Y) for all classes:
P(X) = P(X|physics)*P(physics) + P(X|chemistry)*P(chemistry) + P(X|math)*P(math)
(By the way, the above statement is exactly analogous to the example in the image that you provided. The equations are a bit complicated there, but if you rearrange them, you will find that P(X) = P(X|positive)*P(positive) + P(X|negative)*P(negative) in that example).
To produce the answer (that is, to determine Y among physics, chemistry, or math), you would select the maximum value among P(physics|X), P(chemistry|X), and P(math|X).
As I mentioned, you do not need to compute P(X) because this term exists in the denominator of all of P(physics|X), P(chemistry|X), and P(math|X). Thus, you only need to determine the max among P(X|physics)*P(physics), P(X|chemistry)*P(chemistry), and P(X|math)*P(math).

The point is that you don't really need a value for P(x) because it is the same among all classes. So you should ignore it and just compare the numbers before the division step. The highest number is the predicted class.
The reason it is in the equation is originating from the Bayes rule:
P(C1|X) = P(X|C1) * P(C1) / P(X)

Related

Trying to understand expected value in Linear Regression

I'm having trouble understanding a lecture slide in my school's machine learning course
why does the expected value of Y = f(X)? what does it mean
my understanding is that X, Y are vectors and f(X) outputs a vector of Y where each individual value (y_i) in the Y vector corresponds to a f(x_i) where x_i is the value in X at index i; But now it's taking the expected value of Y, which is going to be a single value, so how is that equal to f(X)?
X, Y (uppercase) are vectors
x_i,y_i (lowercase with subscript) are scalars at index i in X,Y
There is a lot of confusion here. First let's start with definitions
Definitions
Expectation operator E[.]: Takes a random variable as an input and gives a scalar/vector as an output. Let's say Y is a normally distributed random variable with mean Mu and Variance Sigma^{2} (usually stated as:
Y ~ N( Mu , Sigma^{2} ), then E[Y] = Mu
Function f(.): Takes a scalar/vector (not a random variable) and gives a scalar/vector. In this context it is an affine function, that is f(X) = a*X + b where a and b are fixed constants.
What's Going On
Now you can view linear regression from two angles.
Stats View
One angle assumes that your response variable-Y- is a normally distributed random variable because:
Y ~ a*X + b + epsilon
where
epsilon ~ N( 0 , sigma^sq )
and X is some other distribution. We don't really care how X is distributed and treat it as given. In that case the conditional distribution is
Y|X ~ N( a*X + b , sigma^sq )
Notice here that a,b and also X is a number, there is no randomness associated with them.
Maths View
The other view is the math view where I assume that there is a function f(.) that governs the real life process, that if in real life I observe X, then f(X) should be the output. Of course this is not the case and the deviations are assumed to be due to various reasons such as gauge error etc. The claim is that this function is linear:
f(X) = a*X + b
Synthesis
Now how do we combine these? Well, as follows:
E[Y|X] = a*X + b = f(X)
About your question, I first would like to challenge that it should be Y|X and not Y by itself.
Second, there are tons of possible ontological discussions over what each term here represents in real life. X,Y (uppercase) could be vectors. X,Y (uppercase) could also be random variables. A sample of these random variables might be stored in vectors and both would be represented with uppercase letters (the best way is to use different fonts for each). In this case, your sample will become your data. Discussions about the general view of the model and its relevance to real life should be made at random variable level. The way to infer the parameters, how linear regression algorithms works should be made at matrix and vectors levels. There could be other discussion where you should care about both.
I hope this overly unorganized answer helps you. In general if you want to learn such stuff, be sure you know what kind of math objects and operators you are dealing with , what do they take as input and what are their relevance to real life.

Z3: express linear algebra properties

I would like to prove properties of expressions involving matrices and vectors (potentially large size, but size is fixed).
For example I want to prove that the outcome of an expression is a diagonal matrix or a triangular matrix, or it is positive definite, ...
To that end I'd like encode well known properties and identities from linear algebra, such as:
||x + y|| <= ||x|| + ||y||
(A * B) * C = A * (B * C)
det(A+B) = det(A) + det(B)
Tr(zA) = z * Tr(A)
(I + AB) ^ (-1) = I - A(I + BA) ^ (-1) * B
...
I have attempted to implement this in Z3. But even for simple properties it returns unknown or times out. I've tried with array theory and quantifiers.
I'd like know if this problem can be solved with an SMT solver or is it not suited for these kind of problems? Could you give a hint by giving a small example?
You can certainly use Z3 to do this.
I have constructed a small example here, which defines the identity matrix and what it means to be a diagonal matrix, and then proves that the identity matrix is diagonal.
So, it is definitely possible to do this kind of work in Z3. Though you may find you have a better time using a tool built on top of Z3 that has more interactive proving features, such as Dafny or F*.

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

Sum of all the bits in a Bit Vector of Z3

Given a bit vector in Z3, I am wondering how can I sum up each individual bit of this vector?
E.g.,
a = BitVecVal(3, 2)
sum_all_bit(a) = 2
Is there any pre-implemented APIs/functions that support this? Thank you!
It isn't part of the bit-vector operations.
You can create an expression as follows:
def sub(b):
n = b.size()
bits = [ Extract(i, i, b) for i in range(n) ]
bvs = [ Concat(BitVecVal(0, n - 1), b) for b in bits ]
nb = reduce(lambda a, b: a + b, bvs)
return nb
print sub(BitVecVal(4,7))
Of course, log(n) bits for the result will suffice if you prefer.
The page:
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
has various algorithms for counting the bits; which can be translated to Z3/Python with relative ease, I suppose.
My favorite is: https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
which has the nice property that it loops as many times as there are set bits in the input. (But you shouldn't extrapolate from that to any meaningful complexity metric, as you do arithmetic in each loop, which might be costly. The same is true for all these algorithms.)
Having said that, if your input is fully symbolic, you can't really beat the simple iterative algorithm, as you can't short-cut the iteration count. Above methods might work faster if the input has concrete bits.
So you're computing the Hamming Weight of a bit vector. Based on a previous question I had, one of the developers had this answer. Based on that original answer, this is how I do it today:
def HW(bvec):
return Sum([ ZeroExt(int(ceil(log2(bvec.size()))), Extract(i,i,bvec)) for i in range(bvec.size())])

Need help maximizing 3 factors in multiple, similar objects and ordering appropriately

I need to write an algorithm in any language that would order an array based on 3 factors. I use resorts as an example (like Hipmunk). Let's say I want to go on vacation. I want the cheapest spot, with the best reviews, and the most attractions. However, there is obviously no way I can find one that is #1 in all 3.
Example (assuming there are 20 important attractions):
Resort A: $150/night...98/100 in favorable reviews...18 of 20 attractions
Resort B: $99/night...85/100 in favorable reviews...12 of 20 attractions
Resort C: $120/night...91/100 in favorable reviews...16 of 20 attractions
Resort B looks the most appealing in price, but is 3rd in the other 2 categories. Wherein, I can choose resort C for only $21 more a night and get more attractions and better reviews. Price is still important to me, but Resort A has outstanding reviews and a ton of attractions: Is $51 more worth the splurge?
I want to be able to populate a list that will order a lit from "best to worst" (I quote bc it is subjective to the consumer). How would I go about maximizing the value for each resort?
Should I put a weight for each factor (ie: 55% price, 30% reviews, 15% amenities) and come to the result of a set number and order them that way?
Do I need a mode, median and range for all the hotels and determine the average price, and have the hotels around the average price hold the most weight?
If it is a little confusing then check out www.hipmunk.com. They have an airplane sort they call Agony (and a hotel sort which is similar to my question) that they use as their own. I used resorts as an example to make my question hopefully make a little more sense. How does one put math to a problem like this?
I was about to ask the same question about multiple-factor weighted sorting, because my research only came up with answers (e.g. formulas with explanations) for two-factor sorting.
Even though we're both asking about 3 factors, I'll list the possibilities I've found in case they're helpful.
Possibilities:
Note: S is the "sorting score", which is what you'd sort by (asc or desc).
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax).
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 and #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but then the problem is how do you do this for more than two factors?
In your case, assigning the weights in #1 would probably be fine. You'll need to fine-tune the algorithm depending on what your users consider more important - you could expose the weights wx as a filter (like 1-10 dropdown) so your users can adjust their search on the fly. Or if you wanted to get clever you could poll your users before they're searching ("Which is more important to you?") and then assign a weighting set based on the response, and after tracking enough polls you could autosuggest the weighting scheme based on most responses.
Hope that gets you on the right track.
What about having variable weights, and letting the user adjust it through some input like levers, so that the sort order will be dynamically updated?

Resources