Does scikit-learn provide facility to perform regression using a gaussian or polynomial kernel? I looked at the APIs and I don't see any.
Has anyone built a package on top of scikit-learn that does this?
Theory
Polynomial regression is a special case of linear regression. With the main idea of how do you select your features. Looking at the multivariate regression with 2 variables: x1 and x2. Linear regression will look like this: y = a1 * x1 + a2 * x2.
Now you want to have a polynomial regression (let's make 2 degree polynomial). We will create a few additional features: x1*x2, x1^2 and x2^2. So we will get your 'linear regression':
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
This nicely shows an important concept curse of dimensionality, because the number of new features grows much faster than linearly with the growth of degree of polynomial. You can take a look about this concept here.
Practice with scikit-learn
You do not need to do all this in scikit. Polynomial regression is already available there (in 0.15 version. Check how to update it here).
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
X = [[0.44, 0.68], [0.99, 0.23]]
vector = [109.85, 155.72]
predict= [0.49, 0.18]
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)
clf = linear_model.LinearRegression()
clf.fit(X_, vector)
print clf.predict(predict_)
Either you use Support Vector Regression sklearn.svm.SVR and set the appropritate kernel (see here).
Or you install the latest master version of sklearn and use the recently added sklearn.preprocessing.PolynomialFeatures (see here) and then OLS or Ridge on top of that.
Related
I'm using sklearn for SVR (regression) using an RBF kernel. I'm want to know how the inference is done under the hood. I thought it was a function of the support vectors, function mean, and gamma, but it appears I'm missing one aspect (probably some scaling based on how close 2 points are.
Here is "my Equation" that I've tried in the graph's below.
out = mean
for vect in vectors:
out = out + (vect.y - mean) * math.exp(-(vect.x - x) ** 2 * gamma)
When I do just 2 points spaced away, my equation matches what skLearn reports with svr.predict.
With 3 training points and 2 close together, my equation does not match what svr.predict gives:
Given the support vectors, gamma, and mean, and anything else needed, what is the equation for SVR inference with RBF kernel? Can those be obtained from the sklearn svr class?
The equation that works for me using sklearn library and SVR inference with RBF kernel is as follows with python code:
# x and y is already defined, and is the training data for the SVR
svr = svm.SVR(kernel="rbf", C=C, gamma=gamma, epsilon=epsilon, tol=tol)
svr.fit(x,y)
vectors = []
for i in svr.support_:
vectors.append([x[i][0], y[i]])
out = svr._intercept_[0]
for vect, coef in zip(vectors, svr._dual_coef_[0]):
out = out + coef * math.exp(-(vect[0] - x) ** 2 * gamma)
I found that svr._intercept_[0] contains the y offset for the function.
I found that svr._dual_coef_[0] contains the coefficients to multiply each of the exponentials by.
I found that svr.support_ contains the indexes of the elements in your training set used as the support vectors.
I realize I'm accessing what is intended to be accessed within the svr class only, however, I don't see an official API method for accessing these variables, and this is working for me for now.
What if different inputs give the same output as non-linear equations have more than one roots?
Also, can multiple entirely different weight values give approximately the same optimal predictions?
For example,
length = 1000
X = np.random.rand(length,1) * 100
plt.plot(X,3.5*(X**2) + 3.45*X + 3.44,"g.",alpha = 0.8)
plt.plot(X,3.60818952*(X**2)-5.94902958*X -17.74136614,"y.")
plt.show()
gives the same curves.
matplotlib graph
In multi-class logistic regression, lets say we use softmax and cross entropy.
Does SGD one training example update all the weights or only a portion of the weights which are associated to the label ?
For example, the label is one-hot [0,0,1]
Does the whole matrix W_{feature_dim \times num_class} updated or only W^{3}_{feature_dim \times 1} updated ?
Thanks
All of your weights are updated.
You have y = Softmax(W x + β), so to predict a y out of a single x you are making use of all your W weights. If something is used during the forward pass (prediction), then it also gets updated during the backward pass (SGD). Perhaps a more intuitive way of thinking about it is that you are essentially predicting the class membership probability for your features; assigning weight to some class means removing weight from another, so you need to update both.
Take for instance the simple case of x ∈ ℝ, y ∈ ℝ3. Then W ∈ ℝ1×3. Before activation, your prediction for some given x would look like: y= [y1 = W11x + β1, y2 = W12x + β2, y3 = W13x + β3]. You have an error signal for all of these mini-predictions, coming out of categorical crossentropy, for which you must then compute the derivative wrt the W, β terms.
I hope this is clear
To reduce the problem of over-fitting in linear regression in machine learning , it is suggested to modify the cost function by including squares of parameters. This results in smaller values of the parameters.
This is not at all intuitive to me. How can having smaller values for parameters result in simpler hypothesis and help prevent over-fitting?
I put together a rather contrived example, but hopefully it helps.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures
First build a linear dataset, with a training and test split. 5 in each
X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5)
Fit the data with a fifth order polynomial with no regularization.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('poly', PolynomialFeatures(5)),
('model', Ridge(alpha=0.)) # alpha=0 indicates 0 regularization.
])
pipeline.fit(X_train,y_train)
Looking at the coefficients
pipeline.named_steps['model'].coef_
pipeline.named_steps['model'].intercept_
# y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5
Here the model touches all the training point, but has high coefficients and does not touch the test points.
Let's try again, but change our L2 regularization
pipeline.set_params(model__alpha=1)
y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5
Here we see a smoother shape, with less wiggling around. It no longer touches all the training points, but it is a much smoother curve. The coefficients are smaller due to the regularization being added.
This is a bit more complicated. It depends very much on the algorithm you are using.
To make an easy but slightly stupid example. Instead of optimising the parameter of the function
y = a*x1 + b*x2
you could also optimising the parameters of
y = 1/a * x1 + 1/b * x2
Obviously if you minimise in the former case the you need to maximise them in the latter case.
The fact that for most algorithm minimising the square of the parameters comes from computational learning theory.
Let's assume for the following you want to learn a function
f(x) = a + bx + c * x^2 + d * x^3 +....
One can argue that a function were only a is different from zero is more likely than a function, where a and b are different form zero and so on.
Following Occams razor (If you have two hypothesis explaining your data, the simpler is more likely the right one), you should prefer a hypothesis where more of you parameters are zero.
To give an example lets say your data points are (x,y) = {(-1,0),(1,0)}
Which function would you prefer
f(x) = 0
or
f(x) = -1 + 1*x^2
Extending this a bit you can go from parameters which are zero to parameters which are small.
If you want to try it out you can sample some data points from a linear function and add a bit of gaussian noise. If you want to find a perfect polynomial fit you need a pretty complicated function with typically pretty large weights. However, if you apply regularisation you will come close to your data generating function.
But if you want to set your reasoning on rock-solid theoretical foundations I would recommend to apply Baysian statistics. The idea there is that you define a probability distribution over regression functions. That way you can define yourself what a "probable" regression function is.
(Actually Machine Learning by Tom Mitchell contains a pretty good and more detailed explanation)
Adding the squares to your function (so from linear to polynomial) takes care that you can draw a curve instead of just a straight line.
Example of polynomial function:
y=q+t1*x1+t2*x2^2;
Adding this however can lead to a result which follows the test data too much with as result that new data is matched to close to the test data. Adding more and more polynomials (3rd, 4th orders). So when adding polynomials you always have to watch out that the data is not becoming overfitted.
To get more insight in this, draw some curves in a spreadsheet and see how the curves change following your data.
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.