How L1 norm select features? - machine-learning

How L1 norm select features? - machine-learning

It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?

At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.

A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge

Related

Why Dice Coefficient and not IOU for segmentation tasks?

I have seen people using IOU as the metric for detection tasks and Dice Coeff for segmentation tasks. The two metrics looks very much similar in terms of equation except that dice gives twice the weightage to the intersection part. If I am correct, then
Dice: (2 x (A*B) / (A + B))
IOU : (A * B) / (A + B)
Is there any particular reason for preferring dice for segmentation and IOU for detection?

This is not exactly right.
The Dice coefficient (also known as the Sørensen–Dice coefficient and F1 score) is defined as two times the area of the intersection of A and B, divided by the sum of the areas of A and B:
Dice = 2 |A∩B| / (|A|+|B|) = 2 TP / (2 TP + FP + FN)
(TP=True Positives, FP=False Positives, FN=False Negatives)
The IOU (Intersection Over Union, also known as the Jaccard Index) is defined as the area of the intersection divided by the area of the union:
Jaccard = |A∩B| / |A∪B| = TP / (TP + FP + FN)
Note that the sum of the areas of A and B is not the same as the area of the union of A and B. In particular, if there is 100% overlap, then the one is twice the other. This is the reason of the "two times" in the Dice coefficent: they are both defined such that, with 100% overlap, the values are 1, and with 0% overlap the values are 0.
Which one to use depends on personal preference and customs in each field. That you see one used more in one field is related more to chance than anything else. Someone started using the Dice coefficient for segmentation, and other people just followed along. Someone started using IOU for detection, and other people just followed along.

In segmentation tasks, Dice Coeff (Dice loss = 1-Dice coeff) is used as a Loss function because it is differentiable where as IoU is not differentiable.
Both can be used as metric to evaluate the performance of your model but as a loss function only Dice Coeff/loss is used

ML Classification - Decision Boundary Algorithm

Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?

This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.

Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?

Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

Understanding softmax classifier

I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.
Here is the full code of the softmax classifier. Or you can see the link I have provided.
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
I cant understand how they computed the gradient here. I assume that they computed the gradient here -
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
But How? I mean Why gradient of dW is np.dot(X.T, dscores)? And Why the gradient of db is np.sum(dscores, axis=0, keepdims=True)?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient?
I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.

The gradients start being computed here:
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
First, this sets dscores equal to the probabilities computed by the softmax function. Then, it subtracts 1 from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.
Why does it subtract 1? Because you want the probabilities of the correct labels to be 1, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.
Your activation function is simply w*x + b. Its derivative with respect to w is x, which is why dW is the dot product between x and the gradient of the scores / output layer.
The derivative of w*x + b with respect to b is 1, which is why you simply sum dscores when backpropagating.

Gradient Descent
Backpropagation is to reduce the cost J of the entire system (softmax classifier here) and it is a problem to optimize the weight parameter W to minimize the cost. Providing the cost function J = f(W) is convex, the gradient descent W = W - α * f'(W) will result in the Wmin which minimizes J. The hyperparameter α is called learning rate which we need to optimize too, but not in this answer.
Y should be read as J in the diagram. Imagine you are on the surface of a place whose shape is defined as J = f(W) and you need to reach the point Wmin. There is no gravity so you do not know which way is toward the bottom but you know the function and your coordinate. How do you know which way you should go? You can find the direction from the derivative f'(W) and move to a new coordinate by W = W - α * f'(W). By repeating this, you can get closer and closer to the point Wmin.
Back propagation at Affin Layer
At the node where multiply or dot operation happens (affin), the function is J = f(W) = X * W. Suppose there are m number of fixed two dimensional coordinates represented as X. How can we find the hyper-plane which minimizes J = f(W) = X * W and its vector W?
We can get closer to the optimal W by repeating the gradient descent W += -α * X if α is appropriate.
Chain Rule
When there are layers after the Affine layer such as the softmax layer and the log loss layer in the softmax classifier, we can calculate the gradient with the chain rule. In the diagram, replace sigmoid with softmax.
As stated in Computing the Analytic Gradient with Backpropagation in the cs321 page, the gradient contribution from the softmax layer and the log loss layer is the dscore part. See the Note section below too.
By applying the gradient to that of the affine layer via the chain rule, the code is derived where α is replaced with step_size. In reality, the step_size needs to be learned as well.
dW = np.dot(X.T, dscores)
W += -step_size * dW
The bias gradient can be derived by applying the chain rule towards the bias b with the gradients (dscore) from the post layers.
db = np.sum(dscores, axis=0, keepdims=True)
Regularization
As stated in Regularization of the cs231 page, the cost function (objective) is adjusted by adding the regularization, which is reg_loss in the code. It is to reduce the over-fitting. The intuition is, in my understanding, if specific feature(s) cause overfitting, we can reduce it by inflating the cost with their weight parameters W, because the gradient descent will work to reduce the cost contributions from the weights. Since we do not know which ones, use all W. The reason of 0.5 * W*W is because it gives simple derivative W.
reg_loss = 0.5*reg*np.sum(W*W)
The gradient contribution reg*W is from the derivative of reg_loss. The reg is a hyper parameter to be learned in the real training.
reg_loss/dw -> 0.5 * reg * 2 * W
It is added to the gradient from the layers after the affin.
dW += reg*W # regularization gradient
The process to get the derivative from the cost including the regularization is omitted in the cs231 page referenced in the post, probably because it is a common practice to just put the gradient of the regularization, but confusing for those who are learning. See Coursera Machine Learning Week 3 Cost Function by Andrew Ng for the regularization.
Note
The bias parameter b is substituted with X0 as the bias can be omitted by shifting to the base.

How to update the bias in neural network backpropagation?

Could someone please explain to me how to update the bias throughout backpropagation?
I've read quite a few books, but can't find bias updating!
I understand that bias is an extra input of 1 with a weight attached to it (for each neuron). There must be a formula.

Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)
∂E/∂w[i,j] = delta[j] * o[i]
where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.
These values can then be used in weight updates, e.g.
// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]
where gamma is the learning rate.
The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
bias[j] -= gamma_bias * 1 * delta[j]
where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.

The amount you change each individual weight and bias will be the partial derivative of your cost function in relation to each individual weight and each individual bias.
∂C/∂(index of bias in network)
Since your cost function probably doesn't explicitly depend on individual weights and values (Cost might equal (network output - expected output)^2, for example), you'll need to relate the partial derivatives of each weight and bias to something you know, i.e. the activation values (outputs) of neurons. Here's a great guide to doing this:
https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
This guide states how to do these things clearly, but can sometimes be lacking on explanation. I found it very helpful to read chapters 1 and 2 of this book as I read the guide linked above:
http://neuralnetworksanddeeplearning.com/chap1.html
(provides essential background for the answer to your question)
http://neuralnetworksanddeeplearning.com/chap2.html
(answers your question)
Basically, biases are updated in the same way that weights are updated: a change is determined based on the gradient of the cost function at a multi-dimensional point.
Think of the problem your network is trying to solve as being a landscape of multi-dimensional hills and valleys (gradients). This landscape is a graphical representation of how your cost changes with changing weights and biases. The goal of a neural network is to reach the lowest point in this landscape, thereby finding the smallest cost and minimizing error. If you imagine your network as a traveler trying to reach the bottom of these gradients (i.e. Gradient Descent), then the amount you will change each weight (and bias) by is related to the the slope of the incline (gradient of the function) that the traveler is currently climbing down. The exact location of the traveler is given by a multi-dimensional coordinate point (weight1, weight2, weight3, ... weight_n), where the bias can be thought of as another kind of weight. Thinking of the weights/biases of a network as the variables for the network's cost function make it clear that ∂C/∂(index of bias in network) must be used.

I understand that the function of bias is to make level adjust of the
input values. Below is what happens inside the neuron. The activation function of course
will make the final output, but it is left out for clarity.
O = W1 I1 + W2 I2 + W3 I3
In real neuron something happens already at synapses, the input data is level adjusted with average of samples and scaled with deviation of samples. Thus the input data is normalized and with equal weights they will make the same effect. The normalized In is calculated from raw data in (n is the index).
Bn = average(in); Sn = 1/stdev((in); In= (in+Bn)Sn
However this is not necessary to be performed separately, because the neuron weights and bias can do the same function. When you subsitute In with the in, you get new formula
O = w1 i1 + w2 i2 + w3 i3+ wbs
The last wbs is the bias and new weights wn as well
wbs = W1 B1 S1 + W2 B2 S2 + W3 B3 S3
wn =W1 (in+Bn) Sn
So there exists a bias and it will/should be adjusted automagically with the backpropagation

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How L1 norm select features? - machine-learning

It has been often said that L1 regularization helps in feature selection? How the L1 norm does that? And also why L2 normalization is not able to do that?

A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression: https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge

Related

Why Dice Coefficient and not IOU for segmentation tasks?

ML Classification - Decision Boundary Algorithm

Why not logistic regression uses multiplication instead of addition for error matrix?

Understanding softmax classifier

How to update the bias in neural network backpropagation?

Categories

Resources