L2 Regularization must be added into cost function when using Linear Regression?
Im not adding l2 or taking into account when computing cost. Is that wrong?
The code snippet below should be sufficient :
def gradient(self, X, Y, alpha, minibatch_size):
predictions = None
for batch in self.iterate_minibatches(X, Y, minibatch_size, shuffle=True):
x, y = batch
predictions = x.dot(self.theta)
for it in range(self.theta.size):
temp = x[:, it]
temp.shape = (y.size, 1)
errors_x1 = (predictions - y) * temp
self.theta[it] = self.theta[it] - alpha * (1.0 / y.size) * errors_x1.sum() + self.lambda_l2 * self.theta[it] * self.theta[it].T
print self.cost(X, Y, self.theta)
def cost(self, X, Y, theta, store=True):
predictions = X.dot(theta)
from sklearn.metrics import mean_squared_error
cost = mean_squared_error(Y, predictions, sample_weight=None, multioutput='uniform_average')
if store is True:
self.cost_history.append(cost)
return cost
It is not necessary to add L2 (or L1) regularization to your Linear Regression (LR) implementation.
However, adding L2 regularization term to our cost function has an advantage over LR without a regularization term. Most importantly, regularization term helps you to reduce the model overfitting and improve the generalization of your models. LR with L2 regularization is commonly known as "Ridge Regression".
In addition to Ridge Regression, LR with L1 regularization is know as Lasso Regression. If you build regression models using Lasso Regression you models would be sparse models. Hence, Lasso can be used for feature selection as well.
Good Luck!
Related
The recursive feature elimination with cross-validation is taking too long to run. How do I increase the speed?
X and y
X = df.iloc[:,7:-2]
y = df["subtype"]
X.shape
(867, 142513)
Scale the dataset
# scale the dataset
sc = StandardScaler()
X = pd.DataFrame(sc.fit_transform(X))
RFE with cross-validation
# Recursive feature elimination with cross-validation
## Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
## The "accuracy" scoring shows the proportion of correct classifications
min_features_to_select = 7000 # Minimum number of features to consider
rfecv = RFECV(
estimator=svc,
step=7,
cv=StratifiedKFold(5),
scoring="accuracy",
min_features_to_select=min_features_to_select,
)
rfecv.fit_transform(X, y)
In multi-class logistic regression, lets say we use softmax and cross entropy.
Does SGD one training example update all the weights or only a portion of the weights which are associated to the label ?
For example, the label is one-hot [0,0,1]
Does the whole matrix W_{feature_dim \times num_class} updated or only W^{3}_{feature_dim \times 1} updated ?
Thanks
All of your weights are updated.
You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.
Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.
I hope this is clear
I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
]
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)
I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the link, there are random 300 points on a 2D space and a label associated with them. The softmax classifier will learn which point belong to which class.
Here is the full code of the softmax classifier. Or you can see the link I have provided.
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in xrange(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
I cant understand how they computed the gradient here. I assume that they computed the gradient here -
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
But How? I mean Why gradient of dW is np.dot(X.T, dscores)? And Why the gradient of db is np.sum(dscores, axis=0, keepdims=True)?? So how they computed the gradient on weight and bias? Also why they computed the regularization gradient?
I am just starting to learn about convolutional neural networks and deep learning. And I heard that CS231n - Convolutional Neural Networks for Visual Recognition is a good starting place for that. I did not know where to place deep learning related post. So, i placed them on stackoverflow. If there is any place to post questions related to deep learning please let me know.
The gradients start being computed here:
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
First, this sets dscores equal to the probabilities computed by the softmax function. Then, it subtracts 1 from the probabilities computed for the correct classes in the second line, and then it divides by the number of training samples in the third line.
Why does it subtract 1? Because you want the probabilities of the correct labels to be 1, ideally. So it subtracts what it should predict from what it actually predicts: if it predicts something close to 1, the subtraction will be a large negative number (close to zero), so the gradient will be small, because you're close to a solution. Otherwise, it will be a small negative number (far from zero), so the gradient will be bigger, and you'll take larger steps towards the solution.
Your activation function is simply w*x + b. Its derivative with respect to w is x, which is why dW is the dot product between x and the gradient of the scores / output layer.
The derivative of w*x + b with respect to b is 1, which is why you simply sum dscores when backpropagating.
Gradient Descent
Backpropagation is to reduce the cost J of the entire system (softmax classifier here) and it is a problem to optimize the weight parameter W to minimize the cost. Providing the cost function J = f(W) is convex, the gradient descent W = W - α * f'(W) will result in the Wmin which minimizes J. The hyperparameter α is called learning rate which we need to optimize too, but not in this answer.
Y should be read as J in the diagram. Imagine you are on the surface of a place whose shape is defined as J = f(W) and you need to reach the point Wmin. There is no gravity so you do not know which way is toward the bottom but you know the function and your coordinate. How do you know which way you should go? You can find the direction from the derivative f'(W) and move to a new coordinate by W = W - α * f'(W). By repeating this, you can get closer and closer to the point Wmin.
Back propagation at Affin Layer
At the node where multiply or dot operation happens (affin), the function is J = f(W) = X * W. Suppose there are m number of fixed two dimensional coordinates represented as X. How can we find the hyper-plane which minimizes J = f(W) = X * W and its vector W?
We can get closer to the optimal W by repeating the gradient descent W += -α * X if α is appropriate.
Chain Rule
When there are layers after the Affine layer such as the softmax layer and the log loss layer in the softmax classifier, we can calculate the gradient with the chain rule. In the diagram, replace sigmoid with softmax.
As stated in Computing the Analytic Gradient with Backpropagation in the cs321 page, the gradient contribution from the softmax layer and the log loss layer is the dscore part. See the Note section below too.
By applying the gradient to that of the affine layer via the chain rule, the code is derived where α is replaced with step_size. In reality, the step_size needs to be learned as well.
dW = np.dot(X.T, dscores)
W += -step_size * dW
The bias gradient can be derived by applying the chain rule towards the bias b with the gradients (dscore) from the post layers.
db = np.sum(dscores, axis=0, keepdims=True)
Regularization
As stated in Regularization of the cs231 page, the cost function (objective) is adjusted by adding the regularization, which is reg_loss in the code. It is to reduce the over-fitting. The intuition is, in my understanding, if specific feature(s) cause overfitting, we can reduce it by inflating the cost with their weight parameters W, because the gradient descent will work to reduce the cost contributions from the weights. Since we do not know which ones, use all W. The reason of 0.5 * W*W is because it gives simple derivative W.
reg_loss = 0.5*reg*np.sum(W*W)
The gradient contribution reg*W is from the derivative of reg_loss. The reg is a hyper parameter to be learned in the real training.
reg_loss/dw -> 0.5 * reg * 2 * W
It is added to the gradient from the layers after the affin.
dW += reg*W # regularization gradient
The process to get the derivative from the cost including the regularization is omitted in the cs231 page referenced in the post, probably because it is a common practice to just put the gradient of the regularization, but confusing for those who are learning. See Coursera Machine Learning Week 3 Cost Function by Andrew Ng for the regularization.
Note
The bias parameter b is substituted with X0 as the bias can be omitted by shifting to the base.
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.