I'm using sklearn for SVR (regression) using an RBF kernel. I'm want to know how the inference is done under the hood. I thought it was a function of the support vectors, function mean, and gamma, but it appears I'm missing one aspect (probably some scaling based on how close 2 points are.
Here is "my Equation" that I've tried in the graph's below.
out = mean
for vect in vectors:
out = out + (vect.y - mean) * math.exp(-(vect.x - x) ** 2 * gamma)
When I do just 2 points spaced away, my equation matches what skLearn reports with svr.predict.
With 3 training points and 2 close together, my equation does not match what svr.predict gives:
Given the support vectors, gamma, and mean, and anything else needed, what is the equation for SVR inference with RBF kernel? Can those be obtained from the sklearn svr class?
The equation that works for me using sklearn library and SVR inference with RBF kernel is as follows with python code:
# x and y is already defined, and is the training data for the SVR
svr = svm.SVR(kernel="rbf", C=C, gamma=gamma, epsilon=epsilon, tol=tol)
svr.fit(x,y)
vectors = []
for i in svr.support_:
vectors.append([x[i][0], y[i]])
out = svr._intercept_[0]
for vect, coef in zip(vectors, svr._dual_coef_[0]):
out = out + coef * math.exp(-(vect[0] - x) ** 2 * gamma)
I found that svr._intercept_[0] contains the y offset for the function.
I found that svr._dual_coef_[0] contains the coefficients to multiply each of the exponentials by.
I found that svr.support_ contains the indexes of the elements in your training set used as the support vectors.
I realize I'm accessing what is intended to be accessed within the svr class only, however, I don't see an official API method for accessing these variables, and this is working for me for now.
Related
I want to implement a kernel ridge regression that also works within MLJ. Moreover, I want to have the option to use either feature vectors or a predefined kernel matrix as in Python sklearn.
When I run this code
const MMI = MLJModelInterface
MMI.#mlj_model mutable struct KRRModel <: MLJModelInterface.Deterministic
mu::Float64 = 1::(_ > 0)
kernel::String = "linear"
end
function MMI.fit(m::KRRModel,verbosity::Int,K,y)
K = MLJBase.matrix(K)
fitresult = inv(K+m.mu*I)*y
cache = nothing
report = nothing
return (fitresult,cache,report)
end
N = 10
K = randn(N,N)
K = K*K
a = randn(N)
y = K*a + 0.2*randn(N)
m = KRRModel()
kregressor = machine(m,K,y)
cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
evaluate!(kregressor, resampling=cv, measure=rms, verbosity=1)
the evaluate! function evaluates the machine on different subsets of rows of K. Due to the Representer Theorem, a kernel ridge regression has a number of nonzero coefficients equal to the number of samples. Hence, a reduced size matrix K[train_rows,train_rows] can be used instead of K[train_rows,:].
To denote I'm using a kernel matrix I'd set m.kernel = "" . How do I make evaluate! select the columns as well as the rows to form a smaller matrix when m.kernel = ""?
This is my first time using MLJ and I'd like to make as few modifications as possible.
Quoting the answer I got on the Julia Discourse from #ablaom
The intended use of evaluate! is to estimate the generalisation error
associated with some supervised learning model, by subsampling
observations, as in cross-validation, a common use-case. I’m afraid
there is no natural way for evaluate! do feature subsampling.
https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/
FYI: There is a version of kernel regression implementing the MLJ
model interface, namely kernel partial least squares regression from
the package GitHub - lalvim/PartialLeastSquaresRegressor.jl:
Implementation of a Partial Least Squares Regressor 2 .
Am trying to understand math behind Linear Regression and i have verified in multiple sites that Linear Regression works under OLS method with y=mx+c to get best fit line
So in order to calculate intercept and slope we use below formula(if am not wrong)
m = sum of[ (x-mean(x)) (y-mean(y)) ] / sum of[(x-mean(x))]
c = mean(y) - b1(mean(x))
So with this we get x and c values to substitute in above equation to get y predicted values and can predict for newer x values.
But my doubt is when does "Gradient Descent" used. I understood it is also used for calculating co-efficients only in such a way it reduces the cost function by finding local minima value.
Please help me in this
Are this two having separate functions in python/R.
Or linear regression by default works on Gradient Descent(if so then when does above formula used for calculating m and c values)
If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.
Does scikit-learn provide facility to perform regression using a gaussian or polynomial kernel? I looked at the APIs and I don't see any.
Has anyone built a package on top of scikit-learn that does this?
Theory
Polynomial regression is a special case of linear regression. With the main idea of how do you select your features. Looking at the multivariate regression with 2 variables: x1 and x2. Linear regression will look like this: y = a1 * x1 + a2 * x2.
Now you want to have a polynomial regression (let's make 2 degree polynomial). We will create a few additional features: x1*x2, x1^2 and x2^2. So we will get your 'linear regression':
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
This nicely shows an important concept curse of dimensionality, because the number of new features grows much faster than linearly with the growth of degree of polynomial. You can take a look about this concept here.
Practice with scikit-learn
You do not need to do all this in scikit. Polynomial regression is already available there (in 0.15 version. Check how to update it here).
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
X = [[0.44, 0.68], [0.99, 0.23]]
vector = [109.85, 155.72]
predict= [0.49, 0.18]
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X)
predict_ = poly.fit_transform(predict)
clf = linear_model.LinearRegression()
clf.fit(X_, vector)
print clf.predict(predict_)
Either you use Support Vector Regression sklearn.svm.SVR and set the appropritate kernel (see here).
Or you install the latest master version of sklearn and use the recently added sklearn.preprocessing.PolynomialFeatures (see here) and then OLS or Ridge on top of that.
I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.