Understanding softmax classifier - machine-learning

I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.
Here is the full code of the softmax classifier. Or you can see the link I have provided.
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
I cant understand how they computed the gradient here. I assume that they computed the gradient here -
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
But How? I mean Why gradient of dW is np.dot(X.T, dscores)? And Why the gradient of db is np.sum(dscores, axis=0, keepdims=True)?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient?
I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.

The gradients start being computed here:
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
First, this sets dscores equal to the probabilities computed by the softmax function. Then, it subtracts 1 from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.
Why does it subtract 1? Because you want the probabilities of the correct labels to be 1, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.
Your activation function is simply w*x + b. Its derivative with respect to w is x, which is why dW is the dot product between x and the gradient of the scores / output layer.
The derivative of w*x + b with respect to b is 1, which is why you simply sum dscores when backpropagating.

Gradient Descent
Backpropagation is to reduce the cost J of the entire system (softmax classifier here) and it is a problem to optimize the weight parameter W to minimize the cost. Providing the cost function J = f(W) is convex, the gradient descent W = W - α * f'(W) will result in the Wmin which minimizes J. The hyperparameter α is called learning rate which we need to optimize too, but not in this answer.
Y should be read as J in the diagram. Imagine you are on the surface of a place whose shape is defined as J = f(W) and you need to reach the point Wmin. There is no gravity so you do not know which way is toward the bottom but you know the function and your coordinate. How do you know which way you should go? You can find the direction from the derivative f'(W) and move to a new coordinate by W = W - α * f'(W). By repeating this, you can get closer and closer to the point Wmin.
Back propagation at Affin Layer
At the node where multiply or dot operation happens (affin), the function is J = f(W) = X * W. Suppose there are m number of fixed two dimensional coordinates represented as X. How can we find the hyper-plane which minimizes J = f(W) = X * W and its vector W?
We can get closer to the optimal W by repeating the gradient descent W += -α * X if α is appropriate.
Chain Rule
When there are layers after the Affine layer such as the softmax layer and the log loss layer in the softmax classifier, we can calculate the gradient with the chain rule. In the diagram, replace sigmoid with softmax.
As stated in Computing the Analytic Gradient with Backpropagation in the cs321 page, the gradient contribution from the softmax layer and the log loss layer is the dscore part. See the Note section below too.
By applying the gradient to that of the affine layer via the chain rule, the code is derived where α is replaced with step_size. In reality, the step_size needs to be learned as well.
dW = np.dot(X.T, dscores)
W += -step_size * dW
The bias gradient can be derived by applying the chain rule towards the bias b with the gradients (dscore) from the post layers.
db = np.sum(dscores, axis=0, keepdims=True)
Regularization
As stated in Regularization of the cs231 page, the cost function (objective) is adjusted by adding the regularization, which is reg_loss in the code. It is to reduce the over-fitting. The intuition is, in my understanding, if specific feature(s) cause overfitting, we can reduce it by inflating the cost with their weight parameters W, because the gradient descent will work to reduce the cost contributions from the weights. Since we do not know which ones, use all W. The reason of 0.5 * W*W is because it gives simple derivative W.
reg_loss = 0.5*reg*np.sum(W*W)
The gradient contribution reg*W is from the derivative of reg_loss. The reg is a hyper parameter to be learned in the real training.
reg_loss/dw -> 0.5 * reg * 2 * W
It is added to the gradient from the layers after the affin.
dW += reg*W # regularization gradient
The process to get the derivative from the cost including the regularization is omitted in the cs231 page referenced in the post, probably because it is a common practice to just put the gradient of the regularization, but confusing for those who are learning. See Coursera Machine Learning Week 3 Cost Function by Andrew Ng for the regularization.
Note
The bias parameter b is substituted with X0 as the bias can be omitted by shifting to the base.

Related

How L1 norm select features?

It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?
At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.
A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge

ML Classification - Decision Boundary Algorithm

Given a classification problem in Machine Learning the hypothesis is described as below.
hθ(x)=g(θ'x)
z = θ'x
g(z) = 1 / (1+e^−z)
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
hθ(x)≥0.5→y=1
hθ(x)<0.5→y=0
The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
g(z)≥0.5
whenz≥0
Remember.
z=0,e0=1⇒g(z)=1/2
z→∞,e−∞→0⇒g(z)=1
z→−∞,e∞→∞⇒g(z)=0
So if our input to g is θTX, then that means:
hθ(x)=g(θTx)≥0.5
whenθTx≥0
From these statements we can now say:
θ'x≥0⇒y=1
θ'x<0⇒y=0
If The decision boundary is the line that separates the area where y = 0 and where y = 1 and is created by our hypothesis function:
What part of this relates to the Decision Boundary? Or where does the Decision Boundary algorithm come from?
This is basic logistic regression with a threshold. So your theta' * x is just the vector notation of your weight vector multiplied by your input. If you put that into the logistic function which outputs a value between 0 and 1 exclusively, you'll threshold that value at 0.5. So if it's equal and above this, you'll treat it as a positive sample and as a negative one otherwise.
The classification algorithm is just that simple. The training is a bit more complicated and the goal of it is the find a weight vector theta which satisfies the condition to correctly classify all your labeled data...or at least as much as possible. The way to do this is to minimize a cost function which measures the difference between the output of your function and the expected label. You can do this using gradient descent. I guess, Andrew Ng is teaching this.
Edit: Your classification algorithm is g(theta'x)>=0.5 and g(theta'x)<0.5, so a basic step function.
Courtesy of other posters on a different tech forum.
Solving for theta'*x >= 0 and theta'*x<0 gives the decision boundary. The RHS of the inequality ( i.e. 0) comes from the sigmoid function.
Theta gives you the hypothesis that best fits the training set.
From theta, you can compute the decision boundary - it is the locus of points where (X * theta) = 0, or equivalently where g(X * theta) = 0.5.

Why input is scaled in tf.nn.dropout in tensorflow?

I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a = np.dot(Wp.T, x)
h[str(p)].append(a)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

How to update the bias in neural network backpropagation?

Could someone please explain to me how to update the bias throughout backpropagation?
I've read quite a few books, but can't find bias updating!
I understand that bias is an extra input of 1 with a weight attached to it (for each neuron). There must be a formula.
Following the notation of Rojas 1996, chapter 7, backpropagation computes partial derivatives of the error function E (aka cost, aka loss)
∂E/∂w[i,j] = delta[j] * o[i]
where w[i,j] is the weight of the connection between neurons i and j, j being one layer higher in the network than i, and o[i] is the output (activation) of i (in the case of the "input layer", that's just the value of feature i in the training sample under consideration). How to determine delta is given in any textbook and depends on the activation function, so I won't repeat it here.
These values can then be used in weight updates, e.g.
// update rule for vanilla online gradient descent
w[i,j] -= gamma * o[i] * delta[j]
where gamma is the learning rate.
The rule for bias weights is very similar, except that there's no input from a previous layer. Instead, bias is (conceptually) caused by input from a neuron with a fixed activation of 1. So, the update rule for bias weights is
bias[j] -= gamma_bias * 1 * delta[j]
where bias[j] is the weight of the bias on neuron j, the multiplication with 1 can obviously be omitted, and gamma_bias may be set to gamma or to a different value. If I recall correctly, lower values are preferred, though I'm not sure about the theoretical justification of that.
The amount you change each individual weight and bias will be the partial derivative of your cost function in relation to each individual weight and each individual bias.
∂C/∂(index of bias in network)
Since your cost function probably doesn't explicitly depend on individual weights and values (Cost might equal (network output - expected output)^2, for example), you'll need to relate the partial derivatives of each weight and bias to something you know, i.e. the activation values (outputs) of neurons. Here's a great guide to doing this:
https://medium.com/#erikhallstrm/backpropagation-from-the-beginning-77356edf427d
This guide states how to do these things clearly, but can sometimes be lacking on explanation. I found it very helpful to read chapters 1 and 2 of this book as I read the guide linked above:
http://neuralnetworksanddeeplearning.com/chap1.html
(provides essential background for the answer to your question)
http://neuralnetworksanddeeplearning.com/chap2.html
(answers your question)
Basically, biases are updated in the same way that weights are updated: a change is determined based on the gradient of the cost function at a multi-dimensional point.
Think of the problem your network is trying to solve as being a landscape of multi-dimensional hills and valleys (gradients). This landscape is a graphical representation of how your cost changes with changing weights and biases. The goal of a neural network is to reach the lowest point in this landscape, thereby finding the smallest cost and minimizing error. If you imagine your network as a traveler trying to reach the bottom of these gradients (i.e. Gradient Descent), then the amount you will change each weight (and bias) by is related to the the slope of the incline (gradient of the function) that the traveler is currently climbing down. The exact location of the traveler is given by a multi-dimensional coordinate point (weight1, weight2, weight3, ... weight_n), where the bias can be thought of as another kind of weight. Thinking of the weights/biases of a network as the variables for the network's cost function make it clear that ∂C/∂(index of bias in network) must be used.
I understand that the function of bias is to make level adjust of the
input values. Below is what happens inside the neuron. The activation function of course
will make the final output, but it is left out for clarity.
O = W1 I1 + W2 I2 + W3 I3
In real neuron something happens already at synapses, the input data is level adjusted with average of samples and scaled with deviation of samples. Thus the input data is normalized and with equal weights they will make the same effect. The normalized In is calculated from raw data in (n is the index).
Bn = average(in); Sn = 1/stdev((in); In= (in+Bn)Sn
However this is not necessary to be performed separately, because the neuron weights and bias can do the same function. When you subsitute In with the in, you get new formula
O = w1 i1 + w2 i2 + w3 i3+ wbs
The last wbs is the bias and new weights wn as well
wbs = W1 B1 S1 + W2 B2 S2 + W3 B3 S3
wn =W1 (in+Bn) Sn
So there exists a bias and it will/should be adjusted automagically with the backpropagation

Resources