So I've done a coursera ml course, and now i am looking at scikit-learn logistic regression which is a little bit different. I've been using sigmoid function and a cost function was divided to two separate cases when y=0 and y=1. But scikit learn have one function(I've found out that this is Generalised logistic function) which really don't make sense for me.
http://scikit-learn.org/stable/_images/math/760c999ccbc78b72d2a91186ba55ce37f0d2cf37.png
I am sorry i don't have enough reputation to post image.
So the main concern of the function is the case when y=0, than the cost function always have this log(e^0+1) value, so it does not matter what was the X or w. Can someone explain me that ?
If I am correct the $y_i$ you have in this formula can only assume the values -1 or 1 (as derived in the link given by #jinyu0310). Normally you use this cost function (with regularisation)
(sorry for the equation inserted as image, but cannot use latex here, the image comes from goo.gl/M53u44)
So you always have the two terms that plays a role when yi = 0 or when yi=1. I am trying to find a better explanation on the notation that scikit uses in this formula but so far no luck.
Hope that helps you. Is just another way of writing the cost function with a regularisation factor. Keep also in mind that a constant factor in front of everything will not play a role in the optimisation process. Since you want to find the minimum and are not interested in overall factors that multiply everything.
Related
I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.
What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .
While going through Andrew NG's Coursera course on machine learning . I found this particular thing that prices of a house might goes down after certain value of x in Quadratic regression equation. Can anyone explain why is it so?
Andrew Ng is trying to show that a Quadratic function doesn't really make sense to represent the price of houses.
This what the graph of a quadratic function might look like -->
The values of a, b and c were chosen randomly for this example.
As you can see in the figure, the graph first rises to a maximum and then begins to dip. This isn't representative of the real-world since the price of a house wouldn't normally come down with an increasingly larger house.
He recommends that we use a different polynomial function to represent this problem better, such as the cubic function.
The values of a, b, c and d were chosen randomly for this example.
In reality, we would use a different method altogether for choosing the best polynomial function to fit a problem. We would try different polynomial functions on a cross-validation dataset and have an algorithm choose the best suited one. We could also manually chose a polynomial function for a dataset if we already know the trend that our data would follow (due to prior mathematical or physical knowledge).
I am implementing Probabilistic Matrix Factorization models in theano and would like to make use of Adam gradient descent rules.
My goal is to have a code that is as uncluttered as possible, which means that I do not want to keep explicitly track of the 'm' and 'v' quantities from Adam's algorithm.
It would seem that this is possible, especially after seeing how Lasagne's Adam is implemented: it hides the 'm' and 'v' quantities inside the update rules of theano.function.
This works when the negative log-likelihood is formed with each term that handles a different quantity. But in Probabilistic Matrix Factorization each term contains a dot product of one latent user vector and one latent item vector. This way, if I make an instance of Lasagne's Adam on each term, I will have multiple 'm' and 'v' quantities for the same latent vector, which is not how Adam is supposed to work.
I also posted on Lasagne's group, actually twice, where there is some more detail and some examples.
I thought about two possible implementations:
each existing rating (which means each existing term in the global NLL objective function) has its own Adam updated via a dedicated call to a theano.function. Unfortunately this lead to a non correct use of Adam, as the same latent vector will be associated to different 'm' and 'v' quantities used by the Adam algorithm, which is not how Adam is supposed to work.
Calling Adam on the entire objective NLL, which will make the update mechanism like simple Gradient Descent instead of SGD, with all the disadvantages that are known (high computation time, staying in local minima etc.).
My questions are:
maybe is there something that I did not understand correctly on how Lasagne's Adam works?
Would the option number 2 be actually like SGD, in the sense that every update to a latent vector is going to make an influence to another update (in the same Adam call) that make use of that updated vector?
Do you have any other suggestions on how to implement it?
Any idea on how to solve this issue and avoid manually keeping replicated vectors and matrices for the 'v' and 'm' quantities?
It looks like in the paper they are suggesting that you optimise the entire function at once using gradient descent:
We can then perform gradient descent in Y , V , and W to minimize the objective function given by Eq. 10.
So I would say your option 2 is the correct way to implement what they have done.
There aren't many complexities or non-linearities in there (beside the sigmoid) so you probably aren't that likely to run into typical optimisation problems associated with neural networks that necessitate the need for something like Adam. So as long as it all fits in memory I guess this approach will work.
If it doesn't fit in memory, perhaps there's some way you could devise a minibatch version of the loss to optimise over. Would also be interested to see if you could add one or more neural networks to replace some of these conditional probabilities. But that's getting a bit off-topic...
I have been working on Support Vector Machine for about 2 months now. I have coded SVM myself and for the optimization problem of SVM, I have used Sequential Minimal Optimization(SMO) by Dr. John Platt.
Right now I am in the phase where I am going to grid search to find optimal C value for my dataset. ( Please find details of my project application and dataset details here SVM Classification - minimum number of input sets for each class)
I have successfully checked my custom implemented SVM`s accuracy for C values ranging from 2^0 to 2^6. But now I am having some issues regarding the convergence of the SMO for C> 128.
Like I have tried to find the alpha values for C=128 and it is taking long time before it actually converges and successfully gives alpha values.
Time taken for the SMO to converge is about 5 hours for C=100. This huge I think ( because SMO is supposed to be fast. ) though I`m getting good accuracy?
I am screwed right not because I can not test the accuracy for higher values of C.
I am actually displaying number of alphas changed in every pass of SMO and getting 10, 13, 8... alphas changing continuously. The KKT conditions assures convergence so what is so weird happening here?
Please note that my implementation is working fine for C<=100 with good accuracy though the execution time is long.
Please give me inputs on this issue.
Thank You and Cheers.
For most SVM implementations, training time can increase dramatically with larger values of C. To get a sense of how training time in a reasonably good implementation of SMO scales with C, take a look at the log-scale line for libSVM in the graph below.
SVM training time vs. C - From Sentelle et al.'s A Fast Revised Simplex Method for SVM Training.
alt text http://dmcer.net/StackOverflowImages/svm_scaling.png
You probably have two easy ways and one not so easy way to make things faster.
Let's start with the easy stuff. First, you could try loosening your convergence criteria. A strict criteria like epsilon = 0.001 will take much longer to train, while typically resulting in a model that is no better than a looser criteria like epsilon = 0.01. Second, you should try to profile your code to see if there are any obvious bottlenecks.
The not so easy fix, would be to switch to a different optimization algorithm (e.g., SVM-RSQP from Sentelle et al.'s paper above). But, if you have a working implementation of SMO, you should probably only really do that as a last resort.
If you want complete convergence, especially if C is large, it takes a very long time.You can consider defining a large stop criterion, and give the maximum number of iterations, the default in Libsvm is 1000000, if the number of iterations is more, the time will multiply, but the loss is not worth the cost, but the result may not fully meet the KKT condition, some support vectors are in the band, non-support vectors are out of the band, but the error is small and acceptable.In my opinion, it is recommended to use other quadratic programming algorithms instead of SMO algorithm if the accuracy is higher