I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.
Related
I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.
I am learning machine learning from coursera. But I'm little confused between gradient descent and cost function. When and where I should use those?
J(ϴ) is minimized by trial and error approach i.e. trying lot of values and then checking the output. So in practice this means that this work is done by hand and is time consuming.
Gradient Descent basically just does what J(ϴ) does but in a automated way — change the theta values, or parameters, bit by bit, until we hopefully arrived a minimum. This is an iterative method where the model moves to the direction of steepest descent i.e. the optimal value of theta.
Why use Gradient descent? it is easy to implement and is generic optimization technique so will work even if you change your model. It is also better to use GD if you have a lot of features because in this case, normal J(ϴ) computation becomes very expensive.
Gradient Descent requires a cost function(there are many types of cost functions). One common function that is often used is mean squared error, which measure the difference between the estimator (the dataset) and the estimated value (the prediction).
We need this cost function because we want to minimize it. Minimizing any function means finding the deepest valley in that function. Keep in mind that, the cost function is used to monitor the error in predictions of an ML model. So minimizing this, basically means getting to the lowest error value possible or increasing the accuracy of the model. In short, We increase the accuracy by iterating over a training data set while tweaking the parameters(the weights and biases) of our model.
In short, the whole point of Gradient descent is to minimize the cost function
In machine learning cost function, if we want to minimize the influence of two parameters, let's say theta3 and theta4, it seems like we have to give a large value of regularization parameter just like the equation below.
I am not quite sure why the bigger regularization parameter reduces the influence instead of increasing it. How does this function work?
It is because that the optimum values of thetas are found by minimizing the cost function.
As you increase the regularization parameter, optimization function will have to choose a smaller theta in order to minimize the total cost.
Quoting from similar question's answer:
At a high level you can think of regularization parameters as applying a kind of Occam's razor that favours simple solutions. The complexity of models is often measured by the size of the model w viewed as a vector. The overall loss function as in your example above consists of an error term and a regularization term that is weighted by λ, the regularization parameter. So the regularization term penalizes complexity (regularization is sometimes also called penalty). It is useful to think what happens if you are fitting a model by gradient descent. Initially your model is very bad and most of the loss comes from the error terms, so the model is adjusted to primarily to reduce the error term. Usually the magnitude of the model vector increases as the optimization progresses. As the model is improving and the model vector is growing the regularization term becomes a more significant part of the loss. Regularization prevents the model vector growing arbitrarily for negligible reductions in the error. λ just determines the relative importance of keeping the model simple relative to reducing training error.
There are different types of regularization terms in common use. The one you have, and most commonly used in SVMs, is L2 regularization. It has the side effect of spreading weight more evenly between the components of the model vector. The main alternative is L1 or lasso regularization which has the form λ∑i|wi|, ie it penalizes the sum absolute values of the model parameters. It favors concentrating the size of the model in only a few components, the opposite of L2 regularization. Generally L2 tends to be preferable for low dimensional models while lasso tends to work better for high dimensional models like text classification where it leads to sparse models, ie models with few non-zero parameters.
There is also elastic net regularization, which is just a weighted combination of L1 and L2 regularization. So you have 3 terms in your loss function: error term and the 2 regularization terms each with its own regularization parameter.
You said that you want to minimize the influence of two parameters, theta3 and theta4, meaning those two are both NOT important, so we are going to tell the model we want to fit by:
minimize the weights of theta3 and theta4 cause they don't really matter
And here is the learning process of the model:
Given theta3 and theta4 a really big parameter lambda , when theta3 or theta4 grows, your loss functions grows heavily relatively cause they(theta3 and theta4) both have a big multiplier(lambda), to minimize your object function(loss function), both theta3 and theta4 can only be chosen a very small value, saying that they are not important.
As regularization parameter increases from 0 to infinity, the residual sum of squares in linear regression decreases ,Variance of model decreases and Bias increases .
I will try it in most simple language. i think what you are asking is, how does adding a regularization term at the end deceases the value of parameters like theta3 and theta4 here.
So, lets first assume you added this to the end of your loss function which should massively increase the loss, making the function a bit more bias compared to before. Now we will use any optimization method, lets say gradient descent here and the job of gradient descent is to find all values of theta, now remember the fact that until this point we dont any value of theta and if you solve it you will realize the the values of theta are gonna be different if you hadnt used the regularization term at the end. To be exact, its gonna be less for theta3 and theta4.
So this will make sure your hypothesis has more bias and less variance. In simple term, it will make the equation is bit worse or not as exact as before but it will generalize the equation better.
I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function
When we have a high degree linear polynomial that is used to fit a set of points in a linear regression setup, to prevent overfitting, we use regularization, and we include a lambda parameter in the cost function. This lambda is then used to update the theta parameters in the gradient descent algorithm.
My question is how do we calculate this lambda regularization parameter?
The regularization parameter (lambda) is an input to your model so what you probably want to know is how do you select the value of lambda. The regularization parameter reduces overfitting, which reduces the variance of your estimated regression parameters; however, it does this at the expense of adding bias to your estimate. Increasing lambda results in less overfitting but also greater bias. So the real question is "How much bias are you willing to tolerate in your estimate?"
One approach you can take is to randomly subsample your data a number of times and look at the variation in your estimate. Then repeat the process for a slightly larger value of lambda to see how it affects the variability of your estimate. Keep in mind that whatever value of lambda you decide is appropriate for your subsampled data, you can likely use a smaller value to achieve comparable regularization on the full data set.
CLOSED FORM (TIKHONOV) VERSUS GRADIENT DESCENT
Hi! nice explanations for the intuitive and top-notch mathematical approaches there. I just wanted to add some specificities that, where not "problem-solving", may definitely help to speed up and give some consistency to the process of finding a good regularization hyperparameter.
I assume that you are talking about the L2 (a.k. "weight decay") regularization, linearly weighted by the lambda term, and that you are optimizing the weights of your model either with the closed-form Tikhonov equation (highly recommended for low-dimensional linear regression models), or with some variant of gradient descent with backpropagation. And that in this context, you want to choose the value for lambda that provides best generalization ability.
CLOSED FORM (TIKHONOV)
If you are able to go the Tikhonov way with your model (Andrew Ng says under 10k dimensions, but this suggestion is at least 5 years old) Wikipedia - determination of the Tikhonov factor offers an interesting closed-form solution, which has been proven to provide the optimal value. But this solution probably raises some kind of implementation issues (time complexity/numerical stability) I'm not aware of, because there is no mainstream algorithm to perform it. This 2016 paper looks very promising though and may be worth a try if you really have to optimize your linear model to its best.
For a quicker prototype implementation, this 2015 Python package seems to deal with it iteratively, you could let it optimize and then extract the final value for the lambda:
In this new innovative method, we have derived an iterative approach to solving the general Tikhonov regularization problem, which converges to the noiseless solution, does not depend strongly on the choice of lambda, and yet still avoids the inversion problem.
And from the GitHub README of the project:
InverseProblem.invert(A, be, k, l) #this will invert your A matrix, where be is noisy be, k is the no. of iterations, and lambda is your dampening effect (best set to 1)
GRADIENT DESCENT
All links of this part are from Michael Nielsen's amazing online book "Neural Networks and Deep Learning", recommended reading!
For this approach it seems to be even less to be said: the cost function is usually non-convex, the optimization is performed numerically and the performance of the model is measured by some form of cross validation (see Overfitting and Regularization and why does regularization help reduce overfitting if you haven't had enough of that). But even when cross-validating, Nielsen suggests something: you may want to take a look at this detailed explanation on how does the L2 regularization provide a weight decaying effect, but the summary is that it is inversely proportional to the number of samples n, so when calculating the gradient descent equation with the L2 term,
just use backpropagation, as usual, and then add (λ/n)*w to the partial derivative of all the weight terms.
And his conclusion is that, when wanting a similar regularization effect with a different number of samples, lambda has to be changed proportionally:
we need to modify the regularization parameter. The reason is because the size n of the training set has changed from n=1000 to n=50000, and this changes the weight decay factor 1−learning_rate*(λ/n). If we continued to use λ=0.1 that would mean much less weight decay, and thus much less of a regularization effect. We compensate by changing to λ=5.0.
This is only useful when applying the same model to different amounts of the same data, but I think it opens up the door for some intuition on how it should work, and, more importantly, speed up the hyperparametrization process by allowing you to finetune lambda in smaller subsets and then scale up.
For choosing the exact values, he suggests in his conclusions on how to choose a neural network's hyperparameters the purely empirical approach: start with 1 and then progressively multiply÷ by 10 until you find the proper order of magnitude, and then do a local search within that region. In the comments of this SE related question, the user Brian Borchers suggests also a very well known method that may be useful for that local search:
Take small subsets of the training and validation sets (to be able to make many of them in a reasonable amount of time)
Starting with λ=0 and increasing by small amounts within some region, perform a quick training&validation of the model and plot both loss functions
You will observe three things:
The CV loss function will be consistently higher than the training one, since your model is optimized for the training data exclusively (EDIT: After some time I've seen a MNIST case where adding L2 helped the CV loss decrease faster than the training one until convergence. Probably due to the ridiculous consistency of the data and a suboptimal hyperparametrization though).
The training loss function will have its minimum for λ=0, and then increase with the regularization, since preventing the model from optimally fitting the training data is exactly what regularization does.
The CV loss function will start high at λ=0, then decrease, and then start increasing again at some point (EDIT: this assuming that the setup is able to overfit for λ=0, i.e. the model has enough power and no other regularization means are heavily applied).
The optimal value for λ will be probably somewhere around the minimum of the CV loss function, it also may depend a little on how does the training loss function look like. See the picture for a possible (but not the only one) representation of this: instead of "model complexity" you should interpret the x axis as λ being zero at the right and increasing towards the left.
Hope this helps! Cheers,
Andres
The cross validation described above is a method used often in Machine Learning. However, choosing a reliable and safe regularization parameter is still a very hot topic of research in mathematics.
If you need some ideas (and have access to a decent university library) you can have a look at this paper:
http://www.sciencedirect.com/science/article/pii/S0378475411000607