In Machine Learning course on coursera in Week-3 Andrew-Ng discusses about Decision boundary and at 1:00 states that-
hθ(x)≥0.5 → y=1
(Hypothesis will predict y=1 if it's value is greater than or equal to 0.5)
hθ(x)<0.5 → y=0
(Hypothesis will predict y=0 if it's value is less than 0.5)
where hθ(x) is a sigmoid fumction of θtX.
Doubt-
If this is the case then Pr(y=1) will always be greater than or equal to 0.5 as y=1 is assumed to be predicted only when hθ(x)≥0.5 and this hθ(x) is considered to be the probability of y=1 as disscussed here at 4:45.
Same is the case for Pr(y=0) which is assumed to be predicted when hθ(x)<0.5 and so it's value will always be less than 0.5.
But this shouldn't be the case, probabilities of y=0 and y=1 should range from 0 to 1.
I am afraid there is a misunderstanding of what is being modeled and/or what probability of a dependent variable is.
First of all we are talking about conditional probabilities P(y|x), not marginals P(y), second of all:
h(x) = P(y=1|x) = 1-P(y=0|x)
there is no claim that "probability of P(y=0|x) is modeled when h(x)<0.5", this is false. This model provides both quantities at the same time, it predicts P(y=1|x) = h(x) and at the same time (due to basic properties of probabilities) P(y=0|x) = 1-h(x). This is also why we have the 0.5 threshold, as what you try to answer when class is predicted is what is the most probable class, and notice that:
P(y=1|x) > P(y=0|x) <-> h(x) > 1-h(x) <-> 2h(x) > 1 <-> h(x) > 0.5
It does not mean that probability of one class or another is "always bigger than 0.5" or always smaller - there is just one probability, being modeled by h(x), and 0.5 comes from the above equation to get final label, not its probability.
I think the doubt is coming when you say probability of a class p(y=1) , we only calculate conditional probability for each class given data point x , also h(x)=0.5 means the equal chances for data point to be in one of the class and graphically represent the straight line partitioning the two classes.
Related
It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?
At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.
A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge
Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.
This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.
What should be taken as m in m estimate of probability in Naive Bayes?
So for this example
what m value should I take? Can I take it to be 1.
Here p=prior probabilities=0.5.
So can I take P(a_i|selected)=(n_c+ 0.5)/ (3+1)
For Naive Bayes text classification the given P(W|V)=
In the book it says that this is adopted from the m-estimate by letting uniform priors and with m equal to the size of the vocabulary.
But if we have only 2 classes then p=0.5. So how can mp be 1? Shouldn't it be |vocabulary|*0.5? How is this equation obtained from m-estimate?
In calculating the probabilities for attribute profession,As the prior probabilities are 0.5 and taking m=1
P(teacher|selected)=(2+0.5)/(3+1)=5/8
P(farmer|selected)=(1+0.5)/(3+1)=3/8
P(Business|Selected)=(0+0.5)/(3+1)= 1/8
But shouldn't the class probabilities add up to 1? In this case it is not.
Yes, you can use m=1. According to wikipedia if you choose m=1 it is called Laplace smoothing. m is generally chosen to be small (I read that m=2 is also used). Especially if you don't have that many samples in total, because a higher m distorts your data more.
Background information: The parameter m is also known as pseudocount (virtual examples) and is used for additive smoothing. It prevents the probabilities from being 0. A zero probability is very problematic, since it puts any multiplication to 0. I found a nice example illustrating the problem in this book preview here (search for pseudocount)
"m estimate of probability" is confusing.
In the given examples, m and p should be like this.
m = 3 (* this could be any value. you can specify this.)
p = 1/3 = |v| (* number of unique values in the feature)
If you use m=|v| then m*p=1, so it is called Laplace smoothing. "m estimate of probability" is the generalized version of Laplace smoothing.
In the above example you may think m=3 is too much, then you can reduce m to 0.2 like this.
I believe the uniform prior should be 1/3, not 1/2. This is because you have 3 professions, so you're assigning equal prior probability to each one. Like this, mp=1, and the probabilities you listed sum to 1.
From p = uniform priors and with m equal to the size of the vocabulary.
Will get :
Could someone please explain to me how to update the bias throughout backpropagation?
I've read quite a few books, but can't find bias updating!
I understand that bias is an extra input of 1 with a weight attached to it (for each neuron). There must be a formula.
Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)
∂E/∂w[i,j] = delta[j] * o[i]
where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.
These values can then be used in weight updates, e.g.
// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]
where gamma is the learning rate.
The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
bias[j] -= gamma_bias * 1 * delta[j]
where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.
The amount you change each individual weight and bias will be the partial derivative of your cost function in relation to each individual weight and each individual bias.
∂C/∂(index of bias in network)
Since your cost function probably doesn't explicitly depend on individual weights and values (Cost might equal (network output - expected output)^2, for example), you'll need to relate the partial derivatives of each weight and bias to something you know, i.e. the activation values (outputs) of neurons. Here's a great guide to doing this:
https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
This guide states how to do these things clearly, but can sometimes be lacking on explanation. I found it very helpful to read chapters 1 and 2 of this book as I read the guide linked above:
http://neuralnetworksanddeeplearning.com/chap1.html
(provides essential background for the answer to your question)
http://neuralnetworksanddeeplearning.com/chap2.html
(answers your question)
Basically, biases are updated in the same way that weights are updated: a change is determined based on the gradient of the cost function at a multi-dimensional point.
Think of the problem your network is trying to solve as being a landscape of multi-dimensional hills and valleys (gradients). This landscape is a graphical representation of how your cost changes with changing weights and biases. The goal of a neural network is to reach the lowest point in this landscape, thereby finding the smallest cost and minimizing error. If you imagine your network as a traveler trying to reach the bottom of these gradients (i.e. Gradient Descent), then the amount you will change each weight (and bias) by is related to the the slope of the incline (gradient of the function) that the traveler is currently climbing down. The exact location of the traveler is given by a multi-dimensional coordinate point (weight1, weight2, weight3, ... weight_n), where the bias can be thought of as another kind of weight. Thinking of the weights/biases of a network as the variables for the network's cost function make it clear that ∂C/∂(index of bias in network) must be used.
I understand that the function of bias is to make level adjust of the
input values. Below is what happens inside the neuron. The activation function of course
will make the final output, but it is left out for clarity.
O = W1 I1 + W2 I2 + W3 I3
In real neuron something happens already at synapses, the input data is level adjusted with average of samples and scaled with deviation of samples. Thus the input data is normalized and with equal weights they will make the same effect. The normalized In is calculated from raw data in (n is the index).
Bn = average(in); Sn = 1/stdev((in); In= (in+Bn)Sn
However this is not necessary to be performed separately, because the neuron weights and bias can do the same function. When you subsitute In with the in, you get new formula
O = w1 i1 + w2 i2 + w3 i3+ wbs
The last wbs is the bias and new weights wn as well
wbs = W1 B1 S1 + W2 B2 S2 + W3 B3 S3
wn =W1 (in+Bn) Sn
So there exists a bias and it will/should be adjusted automagically with the backpropagation