I have some problem understanding the theory of loss function and hope some one can help me.
Usually when people try to explain gradient descent to you, they will show you a loss function that looks like the very first image in this post gradient descent: all you need to know. I understand the entire theory of gradient descent is to adjust the weights and minimize the loss function.
My question is, will the shape of the Loss function change during the training or it will just stay remain as the image shown in the above post? I know that the weights are something that we are always tuning so the parameters that determines the shape of the Loss function should be the inputs X={x1,x2,...xn}. Let's make an easy example: suppose our inputs are [[1,2,3,4,5],[5,4,3,2,1]] and labels are [1,0] (Only two training sample for ease, and we are setting the batch size to 1). Then the loss function should be some thing like this for the first training sample
L = (1-nonlinear(1*w1+2*w2+3*w3+4*w4+5*w5+b))^2
and for the second training sample the loss function should be:
L = (0-nonlinear(5*w1+4*w2+3*w3+2*w4+1*w5+b))^2
Apparently, these two loss functions doesn't looks like the same if we plot them so does that mean the shape of the Loss function are changing during training? Then why are people still using that one image ( A point that slides down from the Loss function and finds the global minima) to explain the gradient descent theory?
Note: I'm not changing the loss function, the loss function are still mean square error. I'm trying to say that the shape of the Loss function seems to be changing.
I know where my problem comes from! I thought that we are not able to plot a function such as f(x,y) = xy without any constant in it, but we actually could! I searched the graph on google for f(x,y)=xy and truly we can plot them out! So now I understand, as long as we get the lost function, we can get the plot! Thanks guys
The function stays the same. The point of Gradient Decent is to find the lowest point on a given loss function that you define.
Generally, the loss function you are training to minimize does not change throughout the course of a training session. The flaw in reasoning is that you are assuming that the loss function is characterized by weights of the network, when in fact the weights of that network are a sort-of input to the loss function.
To clarify, let us assume we are predicting some N-dimensional piece of information and we have a ground truth vector, call it p, and a loss function L taking in a prediction vector p_hat which we define as
L(p_hat) := norm(p - p_hat).
This is a very primitive (and quite ineffective) loss function, but it is one nonetheless. Once we begin training, this loss function will be the function that we will try to minimize to get our network to perform the best with respect to. Notice that this loss function will attain different values for different inputs of p_hat, this does not mean the loss function is changing! In the end, the loss function will be an N-dimensional hypersurface in an N+1-dimensional hyperspace that stays the same no matter what (similar to the thing you see in the image where it is a 2-dimensional surface in a 3-dimensional space).
Gradient descent tries to find a minimum on this surface that is constructed by the loss function, but we do not really know what the surface looks like as a whole, instead, we find out small things about the surface by evaluating the loss function as the values of p_hat we give it.
Note, this is all a huge oversimplification, but can be a useful way to think about it getting started.
A Loss Function is a metric that measures the distance from your predictions to your targets.
The ideia is to choose the weighs so your predictions are close to your targets, that is, your model learned/memorized the input.
The loss function should usually not be changed during training, because the minimum in the original function might not coincide with the new one, so the gradient descent's work is lost.
Related
I'm running a FCN in Keras that uses the binary cross-entropy as the loss function. However, im not sure how the losses are accumulated.
I know that the loss gets applied at the pixel level, but then are the losses for each pixel in the image summed up to form a single loss per image? Or instead of being summed up, is it being averaged?
And furthermore, are the loss of each image simply summed(or is it some other operation) over the batch?
I assume that you question is a general one, and to specific to a particular model (if not can you share your model?).
You are right that if the cross-entropy is used at a pixel level, the results have to be reduced (summed or averaged) over all pixels to get a single value.
Here is an example of a convolutional autoencoder in tensorflow where this step is specific:
https://github.com/udacity/deep-learning/blob/master/autoencoder/Convolutional_Autoencoder_Solution.ipynb
The relevant lines are:
loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=targets_, logits=logits)
cost = tf.reduce_mean(loss)
Whether you take the mean or sum of the cost function does not change the value of the minimizer. But If you take the mean, then the value of the cost function is more easily comparable between experiments when you change the batch size or image size.
I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
Convergence
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?
In this section of the documentation on gradient boosting, it says
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent direction is the negative gradient of the loss function evaluated at the current model F_{m-1} which can be calculated for any differentiable loss function:
Where the step length \gamma_m is chosen using line search:
I understand the purpose of the line search, but I don't understand the algorithm itself. I read through the source code, but it's still not clicking. An explanation would be much appreciated.
The implementation is depending on which loss function you choose when initialize a GradientBoostingClassifier instance(use this for example, the regression part should be similar). The default loss function is 'deviance' and the corresponding optimization algorithm is implemented here. In the _update_terminal_region function, a simple Newton iteration is implemented with only one step.
Is this the answer you want?
I suspect the thing you find confusing is this: you can see where scikit-learn computes the negative gradient of the loss function and fits a base estimator to that negative gradient. It looks like the _update_terminal_region method is responsible for figuring out the step size, but you can't see anywhere it might be solving the line search minimization problem as written in the documentation.
The reason you can't find a line search happening is that, for the special case of decision tree regressors, which are just piecewise constant functions, the optimal solution is usually known. For example, if you look at the _update_terminal_region method of the LeastAbsoluteError loss function, you see that the leaves of the tree are given the value of the weighted median of the difference between y and the predicted value for the examples for which that leaf is relevant. This median is the known optimal solution.
To summarize what's happening, for each gradient descent iteration the following steps are taken:
Compute the negative gradient of the loss function at the current prediction.
Fit a DecisionTreeRegressor to the negative gradient. This fitting produces a tree with good splits for decreasing the loss.
Replace the values at the leaves of the DecisionTreeRegressor with values that minimize loss. These are usually computed from some simple known formula that takes advantage of the fact that the decision tree is just a piecewise constant function.
This method should be at least as good as what is described in the docs, but I think in some cases might not be identical to it.
From your comments it seems the algorithm itself is unclear and not the way scikitlearn implements it.
Notation in the wikipedia article is slightly sloppy, one does not simply differentiate by a function evaluated at a point. Once you replace F_{m-1}(x_i) with \hat{y_i} and replace partial derivative with a partial derivative evaluated at \hat{y}=F_{m-1}(x) things become clearer:
This would also remove x_{i} (sort of) from the minimization problem and shows the intent of line search - to optimize depending on the current prediction and not depending on the training set. Now, notice that:
Hence you're just minimizing:
So line search simply optimizes one degree of freedom you have (once you've found the right gradient direction) - the step size.
I have been using the training method proposed in the cifar10_multi_gpu_train example for (local) multi-gpu training, i.e., creating several towers and then average the gradient. However, I was wondering the following: What does happen if I just take the losses coming from the different GPUs, sum them up and then just apply gradient descent to that new loss.
Would that work? Probably this is a silly question, and there must be a limitation somewhere. So I would be happy if you could comment on this.
Thanks and best regards,
G.
It would not work with the sum. You would get a bigger loss and consequentially bigger and probably erroneous gradients. While averaging the gradients you get an average of the direction that the weights have to take in order to minimize the loss, but each single direction is the one computed for the exact loss value.
One thing that you can try is to run the towers independently and then average the weights from time to time, slower convergence rate but faster processing on each node.
Recently I was told that the labels of regression data should also be normalized for better result but I am pretty doubtful of that. I have never tried normalizing labels in both regression and classification that's why I don't know if that state is true or not. Can you please give me a clear explanation (mathematically or in experience) about this problem?
Thank you so much.
Any help would be appreciated.
When you say "normalize" labels, it is not clear what you mean (i.e. whether you mean this in a statistical sense or something else). Can you please provide an example?
On Making labels uniform in data analysis
If you are trying to neaten labels for use with the text() function, you could try the abbreviate() function to shorten them, or the format() function to align them better.
The pretty() function works well for rounding labels on plot axes. For instance, the base function hist() for drawing histograms calls on Sturges or other algorithms and then uses pretty() to choose nice bin sizes.
The scale() function will standardize values by subtracting their mean and dividing by the standard deviation, which in some circles is referred to as normalization.
On the reasons for scaling in regression (in response to comment by questor). Suppose you regress Y on covariates X1, X2, ... The reasons for scaling covariates Xk depend on the context. It can enable comparison of the coefficients (effect sizes) of each covariate. It can help ensure numerical accuracy (these days not usually an issue unless covariates on hugely different scales and/or data is big). For a readable intro see Psychosomatic medicine editors' guide. For a mathematically intense discussion see Sylvain Sardy's guide.
In particular, in Bayesian regression, rescaling is advisable to ensure convergence of MCMC estimation; e.g. see this discussion.
You mean features not labels.
It is not necessary to normalize your features for regression or classification, even though in some cases, it is a trick that can help converging faster. You might want to check this post.
To my experience, when using a simple model like a linear regression with only a few variables, keeping the features as they are (without normalization) is preferable since the model is more interpretable.
It may be that what you mean is that you should scale your labels. The reason is so convergence is faster, and you don't get numeric instability.
For example, if your labels are in the range (1000, 1000000) and the weights are initialized close to zero, a mse loss would be so large, you'd likely get NaN errors.
See https://datascience.stackexchange.com/q/22776/38707 for a similar discussion.
for a regression problem with algorithms including decision tree or logistic regression and linear regression I tested in two modes: 1- with label scaling using MinMaxScaler 2- without label scaling the result that i got was : r2 score is the same in 2 mode mse and mae scales
for diabetes dataset using linear regression the result before and after is
without scaling:
Mean Squared Error: 3424.3166
Mean Absolute Error: 46.1742
R2_score : 0.33
after scaling labels:
Mean Squared Error: 0.0332
Mean Absolute Error: 0.1438
R2_score : 0.33
also below link can be useful which says scaling can be helpful in fast convergence enter scale or not scale labels in deep leaning?