Scikit-learn - Stochastic Gradient Descent with custom cost and gradient functions - machine-learning

I am implementing matrix factorization to predict a movie rating by a reviewer. The dataset is taken from MovieLen (http://grouplens.org/datasets/movielens/). This is a well-studied recommendation problem so I just implement this matrix factorization method as for my learning purpose.
I model the cost function as a root-mean-square error between predict rating and actual rating in the training dataset. I use scipy.optimize.minimize function (I use conjugate gradient descent) to factor the movie rating matrix, but this optimization tool is too slow even for only a dataset with 100K items. I plan to scale my algorithms for the dataset with 20 million items.
I have been searching for a Python-based solution for Stochastic Gradient Descent, but the stochastic gradient descent I found on scikit-learn does not allow me to use my custom cost and gradient functions.
I can implement my own stochastic gradient descent but I am checking with you guys if there exists a tool for doing this already.
Basically, I am wondering if there is such as API that is similar to this:
optimize.minimize(my_cost_function,
my_input_param,
jac=my_gradient_function,
...)
Thanks!
Un

This is so simple (at least the vanilla method) to implement that I don't think there is a "framework" around it.
It is just
my_input_param += alpha * my_gradient_function
Maybe you want to have a look at theano, which will do the differentiation for you, though. Depending on what you want to do, it might be a bit overkill, though.

I've been trying to do something similar in R but with a different custom cost function.
As I understand it the key is to find the gradient and see which way takes you towards the local minimum.
With linear regression (y = mx + c) and a least squares function, our cost function is
(mx + c - y)^2
The partial derivative of this with relation to m is
2m(mX + c - y)
Which with the more traditional machine learning notation where m = theta gives us theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
I don't know this for sure but I would assume that for linear regression and a cost function of sqrt(mx + c - y) that the gradient step is the partial derivative with relation to m, which I believe is
m/(2*sqrt(mX + c - y))
If any/all of this is incorrect please (anybody) correct me. This is something I am trying to learn myself and would appreciate knowing if I'm heading in completely the wrong direction.

Related

Math Behind Linear Regression

Am trying to understand math behind Linear Regression and i have verified in multiple sites that Linear Regression works under OLS method with y=mx+c to get best fit line
So in order to calculate intercept and slope we use below formula(if am not wrong)
m = sum of[ (x-mean(x)) (y-mean(y)) ] / sum of[(x-mean(x))]
c = mean(y) - b1(mean(x))
So with this we get x and c values to substitute in above equation to get y predicted values and can predict for newer x values.
But my doubt is when does "Gradient Descent" used. I understood it is also used for calculating co-efficients only in such a way it reduces the cost function by finding local minima value.
Please help me in this
Are this two having separate functions in python/R.
Or linear regression by default works on Gradient Descent(if so then when does above formula used for calculating m and c values)

Why is make_friedman1 used?

I have started to learn ML, and am confused with make_friedman1. It highly improved my accuracy, and increased the data size. But the data isn't the same, it's changed after using this function. What does friedman! actually do?
If make_friedman1 asked here is the one in sklearn.datasets then it is the function which generates the “Friedman #1” regression problem. Here inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula::
y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e
where e is N(0,sd)
Quoting from the Friedman's original paper, Multivariate Adaptive Regression Splines ::
A new method is presented for flexible regression modeling of high
dimensional data. The model takes the form of an expansion in product
spline basis functions, where the number of basis functions as well as
the parameters associated with each one (product degree and knot
locations) are automatically determined by the data. This procedure is
motivated by the recursive partitioning approach to regression and
shares its attractive properties. Unlike recursive partitioning,
however, this method produces continuous models with continuous
derivatives. It has more power and flexibility to model relationships
that are nearly additive or involve interactions in at most a few
variables
A spline is adding many polynomial curves end-to-end to make a new smooth curve.

Gradient from non-trainable weights function

I'm trying to implement a self-written loss function. My pipeline is as follows
x -> {constant computation} = x_feature -> machine learning training -> y_feature -> {constant computation} = y_produced
These "constant computations" are necessary to bring out the differences between the desired o/p and produced o/p.
So if I take the L2 norm of the y_produced and y_original, how should I incorporate this loss in the original loss.
Please Note that y_produced has a different dimension than y_feature.
As long as you are using differentiable operations there is no difference between "constant transformations" and "learnable ones". There is no such distinction, look even at the linear layer of a neural net
f(x) = sigmoid( W * x + b )
is it constant or learnable? W and b are trained, but "sigmoid" is not, yet gradient flows the same way, no matter if something is a variable or not. In particular gradient wrt. to x is the same for
g(x) = sigmoid( A * x + c )
where A and c are constants.
The only problem you will encounter is using non-differentiable operations, such as: argmax, sorting, indexing, sampling etc. these operations do not have a well defined gradient thus you cannot directly use first order optimisers with them. As long as you stick with the differentiable ones - the problem described does not really exist - there is no difference between "constant transromations" and any other transformations - no matter change of the size etc.

Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

I am currently doing Andrew NG's ML course. From my calculus knowledge, the first derivative test of a function gives critical points if there are any. And considering the convex nature of Linear / Logistic Regression cost function, it is a given that there will be a global / local optima. If that is the case, rather than going a long route of taking a miniscule baby step at a time to reach the global minimum, why don't we use the first derivative test to get the values of Theta that minimize the cost function J in a single attempt , and have a happy ending?
That being said, I do know that there is a Gradient Descent alternative called Normal Equation that does just that in one successful step unlike the former.
On a second thought, I am thinking if it is mainly because of multiple unknown variables involved in the equation (which is why the Partial Derivative comes into play?) .
Let's take an example:
Gradient simple regression cost function:
Δ[RSS(w) = [(y-Hw)T(y-Hw)]
y : output
H : feature vector
w : weights
RSS: residual sum of squares
Equating this to 0 for getting the closed form solution will give:
w = (H T H)-1 HT y
Now assuming there are D features, the time complexity for calculating transpose of matrix is around O(D3). If there are a million features, it is computationally impossible to do within reasonable amount of time.
We use these gradient descent methods since they give solutions with reasonably acceptable solutions within much less time.

Are there any machine learning regression algorithms that can train on ordinal data?

I have a function f(x): R^n --> R (sorry, is there a way to do LaTeX here?), and I want to build a machine learning algorithm that estimates f(x) for any input point x, based on a bunch of sample xs in a training data set. If I know the value of f(x) for every x in the training data, this should be simple - just do a regression, or take the weighted average of nearby points, or whatever.
However, this isn't what my training data looks like. Rather, I have a bunch of pairs of points (x, y), and I know the value of f(x) - f(y) for each pair, but I don't know the absolute values of f(x) for any particular x. It seems like there ought to be a way to use this data to find an approximation to f(x), but I haven't found anything after some Googling; there are papers like this but they seem to assume that the training data comes in the form of a set of discrete labels for each entity, rather than having labels over pairs of entities.
This is just making something up, but could I try kernel density estimation over f'(x), and then do integration to get f(x)? Or is that crazy, or is there a known better technique?
You could assume that f is linear, which would simplify things - if f is linear we know that:
f(x-y) = f(x) - f(y)
For example, Suppose you assume f(x) = <w, x>, making w the parameter you want to learn. How would the squared loss per sample (x,y) and known difference d look like?
loss((x,y), d) = (f(x)-f(y) - d)^2
= (<w,x> - <w,y> - d)^2
= (<w, x-y> - d)^2
= (<w, z> - d)^2 // where z:=x-y
Which is simply the squared loss for z=x-y
Practically, you would need to construct z=x-y for each pair and then learn f using linear regression over inputs z and outputs d.
This model might be too weak for your needs, but its probably the first thing you should try. Otherwise, as soon as you step away from the linearity assumption, you'd likely arrive at a difficult non-convex optimization problem.
I don't see a way to get absolute results. Any constant in your function (f(x) = g(x) + c) will disappear, in the same way constants disappear in an integral.

Resources