How does scikitlearn implement line search? - machine-learning

In this section of the documentation on gradient boosting, it says
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent direction is the negative gradient of the loss function evaluated at the current model F_{m-1} which can be calculated for any differentiable loss function:
Where the step length \gamma_m is chosen using line search:
I understand the purpose of the line search, but I don't understand the algorithm itself. I read through the source code, but it's still not clicking. An explanation would be much appreciated.

The implementation is depending on which loss function you choose when initialize a GradientBoostingClassifier instance(use this for example, the regression part should be similar). The default loss function is 'deviance' and the corresponding optimization algorithm is implemented here. In the _update_terminal_region function, a simple Newton iteration is implemented with only one step.
Is this the answer you want?

I suspect the thing you find confusing is this: you can see where scikit-learn computes the negative gradient of the loss function and fits a base estimator to that negative gradient. It looks like the _update_terminal_region method is responsible for figuring out the step size, but you can't see anywhere it might be solving the line search minimization problem as written in the documentation.
The reason you can't find a line search happening is that, for the special case of decision tree regressors, which are just piecewise constant functions, the optimal solution is usually known. For example, if you look at the _update_terminal_region method of the LeastAbsoluteError loss function, you see that the leaves of the tree are given the value of the weighted median of the difference between y and the predicted value for the examples for which that leaf is relevant. This median is the known optimal solution.
To summarize what's happening, for each gradient descent iteration the following steps are taken:
Compute the negative gradient of the loss function at the current prediction.
Fit a DecisionTreeRegressor to the negative gradient. This fitting produces a tree with good splits for decreasing the loss.
Replace the values at the leaves of the DecisionTreeRegressor with values that minimize loss. These are usually computed from some simple known formula that takes advantage of the fact that the decision tree is just a piecewise constant function.
This method should be at least as good as what is described in the docs, but I think in some cases might not be identical to it.

From your comments it seems the algorithm itself is unclear and not the way scikitlearn implements it.
Notation in the wikipedia article is slightly sloppy, one does not simply differentiate by a function evaluated at a point. Once you replace F_{m-1}(x_i) with \hat{y_i} and replace partial derivative with a partial derivative evaluated at \hat{y}=F_{m-1}(x) things become clearer:
This would also remove x_{i} (sort of) from the minimization problem and shows the intent of line search - to optimize depending on the current prediction and not depending on the training set. Now, notice that:
Hence you're just minimizing:
So line search simply optimizes one degree of freedom you have (once you've found the right gradient direction) - the step size.

Related

Difference between linear regression and gradient descent

based assignment and I chose machine learning as my topic. I'm still in highschool so I don't know much about calculus.
My end goal is to try using a machine learning algorithm to predict stock values. But I want to understand what I'm doing without copying and analyzing existing codes that perform my required function.
This also isn't programming-related but mostly concerns over the theory part of it? I read through articles on linear regression and watched the lecture that Stanford has on its youtube. But I don't get it. These are my main confusions:
Are linear regression and gradient descent different algorithms or a set of algorithms used together to predict or classify stuff?
Are y = mx + c and f(x) = ϴ0 + ϴx same? What can I calculate with this?
This equation is shown in the linear regression part so what exactly does this do?
I will try to answer all three questions you asked.
First, let me classify ML into some categories.
Regression - Predicting continuous valued output (example, stock prediction)
Classification - Predicting discrete valued output (example, spam classification)
Now regression can be also classified as linear regression or polynomial regression.
Linear Regression is the simplest one. This is how it works.
Suppose I have this data.
These are the house prices plotted against size of the house. Now I want a straight line that can best fit this data. Maybe I will try this line.
And I will try more and more lines to see which actually fit best to the data. Now, to obtain different lines I will vary parameters like a and b in y=a+bx. This answers your second question, this equation represents a straight line which you are trying to fit to the data.
But, how will I decide if one line is better fit than the other. I will calculate some value which represents the error my line makes in correctly predicting the y values of all the x values in my data. This is actually called cost function. I can choose a cost function like this :
(Ignore if it doesn't make sense).
But basically I want my cost function (error representing value) to be minimum and Gradient Descent is one such algorithm that can minimize my cost function. Gradient Descent can actually minimize any general function and hence it is not exclusive to Linear Regression but still it is popular for linear regression. This answers your first question.
Next step is to know how Gradient descent work. This is the algo:
This is what you have asked in your third question. This is the line of code which actually adjusts your fitting line(called hypothesis) while minimizing the cost function.

Will the shape of the Loss function change during training?

I have some problem understanding the theory of loss function and hope some one can help me.
Usually when people try to explain gradient descent to you, they will show you a loss function that looks like the very first image in this post gradient descent: all you need to know. I understand the entire theory of gradient descent is to adjust the weights and minimize the loss function.
My question is, will the shape of the Loss function change during the training or it will just stay remain as the image shown in the above post? I know that the weights are something that we are always tuning so the parameters that determines the shape of the Loss function should be the inputs X={x1,x2,...xn}. Let's make an easy example: suppose our inputs are [[1,2,3,4,5],[5,4,3,2,1]] and labels are [1,0] (Only two training sample for ease, and we are setting the batch size to 1). Then the loss function should be some thing like this for the first training sample
L = (1-nonlinear(1*w1+2*w2+3*w3+4*w4+5*w5+b))^2
and for the second training sample the loss function should be:
L = (0-nonlinear(5*w1+4*w2+3*w3+2*w4+1*w5+b))^2
Apparently, these two loss functions doesn't looks like the same if we plot them so does that mean the shape of the Loss function are changing during training? Then why are people still using that one image ( A point that slides down from the Loss function and finds the global minima) to explain the gradient descent theory?
Note: I'm not changing the loss function, the loss function are still mean square error. I'm trying to say that the shape of the Loss function seems to be changing.
I know where my problem comes from! I thought that we are not able to plot a function such as f(x,y) = xy without any constant in it, but we actually could! I searched the graph on google for f(x,y)=xy and truly we can plot them out! So now I understand, as long as we get the lost function, we can get the plot! Thanks guys
The function stays the same. The point of Gradient Decent is to find the lowest point on a given loss function that you define.
Generally, the loss function you are training to minimize does not change throughout the course of a training session. The flaw in reasoning is that you are assuming that the loss function is characterized by weights of the network, when in fact the weights of that network are a sort-of input to the loss function.
To clarify, let us assume we are predicting some N-dimensional piece of information and we have a ground truth vector, call it p, and a loss function L taking in a prediction vector p_hat which we define as
L(p_hat) := norm(p - p_hat).
This is a very primitive (and quite ineffective) loss function, but it is one nonetheless. Once we begin training, this loss function will be the function that we will try to minimize to get our network to perform the best with respect to. Notice that this loss function will attain different values for different inputs of p_hat, this does not mean the loss function is changing! In the end, the loss function will be an N-dimensional hypersurface in an N+1-dimensional hyperspace that stays the same no matter what (similar to the thing you see in the image where it is a 2-dimensional surface in a 3-dimensional space).
Gradient descent tries to find a minimum on this surface that is constructed by the loss function, but we do not really know what the surface looks like as a whole, instead, we find out small things about the surface by evaluating the loss function as the values of p_hat we give it.
Note, this is all a huge oversimplification, but can be a useful way to think about it getting started.
A Loss Function is a metric that measures the distance from your predictions to your targets.
The ideia is to choose the weighs so your predictions are close to your targets, that is, your model learned/memorized the input.
The loss function should usually not be changed during training, because the minimum in the original function might not coincide with the new one, so the gradient descent's work is lost.

How Support Vector Regression works?

I'm trying to understand SVR model.
To do it I looked at SVM and it's pretty clear for me. But there is no much explications about SVR.
The first question is why it's called Support Vector Regression or how we use vectors to predict numerical values?
Also I don't understand some parameters such as epsilon and gamma. How they influence predicted result?
A SVM learns a so called decision function from your features, such that features from you positive class produce positive real numbers, and features from the negative class produce negative numbers (at least most of the time, depending on your data).
For two features you can visualize this in a 2D plane. The function assigns a real value to each point in the plane, this value can be depicted as color. This plot shows the values as different blue colors.
The feature values resulting in zero form the so called decision boundary.
This function itself has two kind of parameters:
kernel dependend parameters. In your case for the radial basis functions, these parameters are epsilon and gamma, which you set before learning.
And the so called support-vectors which are determined during learning. support-vectors are just parameters of your decision function.
Learning is nothing than determining good support-vectors (parameters !).
In this 2d example video the colors don't show the actual function value, but only the sign. You can see how gamma influences the smoothness of the decision function.
To answer you question:
SVR builds such a function but with a different goal. The function does not try to assign positive outcomes to your postive examples, and negative outcomes to the negative examples.
Instead the function is built to approximate the given numeric outcomes.

Instead LBFGS, using gradient descent in sparse autoencoder

In Andrew Ng's lecture notes, they use LBFGS and get some hidden features. Can I use gradient descent instead and produce the same hidden features? All the other parameters are the same, just change the optimization algorithm.
Because When I use LBFGS, my autoencoder can produce the same hidden features as in the lectures notes, but when I use gradient descent, the features in the hidden layer are gone, seems like totally random.
To be specific, in order to optimize the cost function, I implement 1)the cost function, 2)gradient of each Weight and Bias. And throw them into scipy optimize tool box to optimize the cost function. And this setting can give me the reasonable hidden features.
But when I change to gradient descent. I tried to let the "Weight - gradient of the Weight" and "Bias - gradient of the Bias". But the resulted hidden features looks like totally random.
Can somebody help me to know the reason? Thanks.
Yes, you can use SGD instead, in fact, it is the most popular choice in practise. L-BFGS-B is not a typical method for training neural networks. However:
you will have to tweak hyperparameters of the training method, you cannot just use the same ones that were used for LBFGS as this is completely different method (ok, not completely, but it uses first order optimization instead of second order)
you should include momentum in your SGD, it is an extremely easy way to get a kind of second order approximation, and is known to (when carefully tuned) perform as good as actual second-order methods in practise

How to make the labels of superpixels to be locally consistent in a gray-level map?

I have a bunch of gray-scale images decomposed into superpixels. Each superpixel in these images have a label in the rage of [0-1]. You can see one sample of images below.
Here is the challenge: I want the spatially (locally) neighboring superpixels to have consistent labels (close in value).
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested. I have also heard about Conditional Random Field (CRF). Is it helpful?
Any suggestion would be welcome.
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested.
And why is that? Why do you not consider helpful advice of your colleagues, which are actually right. Applying smoothing function is the most reasonable way to go.
I have also heard about Conditional Random Field (CRF). Is it helpful?
This also suggests, that you should rather go with collegues advice, as CRF has nothing to do with your problem. CRF is a classifier, sequence classifier to be exact, requiring labeled examples to learn from and has nothing to do with the setting presented.
What are typical approaches?
The exact thing proposed by your collegues, you should define a smoothing function and apply it to your function values (I will not use a term "labels" as it is missleading, you do have values in [0,1], continuous values, "label" denotes categorical variable in machine learning) and its neighbourhood.
Another approach would be to define some optimization problem, where your current assignment of values is one goal, and the second one is "closeness", for example:
Let us assume that you have points with values {(x_i, y_i)}_{i=1}^N and that n(x) returns indices of neighbouring points of x.
Consequently you are trying to find {a_i}_{i=1}^N such that they minimize
SUM_{i=1}^N (y_i - a_i)^2 + C * SUM_{i=1}^N SUM_{j \in n(x_i)} (a_i - a_j)^2
------------------------- - --------------------------------------------
closeness to current constant to closeness to neighbouring values
values weight each part
You can solve the above optimization problem using many techniques, for example through scipy.optimize.minimize module.
I am not sure that your request makes any sense.
Having close label values for nearby superpixels is trivial: take some smooth function of (X, Y), such as constant or affine, taking values in the range [0,1], and assign the function value to the superpixel centered at (X, Y).
You could also take the distance function from any point in the plane.
But this is of no use as it is unrelated to the image content.

Resources