Kernel Function in Gaussian Processes - machine-learning

Given a kernel in Gaussian Process, is it possible to know the shape of functions being drawn from the prior distribution without sampling at first?

I think the best way to know the shape of prior functions is to draw them. Here's 1-dimensional example:
These are the samples from the GP prior (mean is 0 and covariance matrix induced by the squared exponential kernel). As you case see they are smooth and generally it gives a feeling how "wiggly" they are. Also note that in case of multi-dimensions each one of them will look somewhat like this.
Here's a full code I used, feel free to write your own kernel or tweak the parameters to see how it affects the samples:
import numpy as np
import matplotlib.pyplot as pl
def kernel(a, b, gamma=0.1):
""" GP squared exponential kernel """
sq_dist = np.sum(a**2, 1).reshape(-1, 1) + np.sum(b**2, 1) - 2*np.dot(a, b.T)
return np.exp(-0.5 * (1 / gamma) * sq_dist)
n = 300 # number of points.
m = 10 # number of functions to draw.
s = 1e-6 # noise variance.
X = np.linspace(-5, 5, n).reshape(-1, 1)
K = kernel(X, X)
L = np.linalg.cholesky(K + s * np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n, m)))
pl.figure(1)
pl.clf()
pl.plot(X, f_prior)
pl.title('%d samples from the GP prior' % m)
pl.axis([-5, 5, -3, 3])
pl.show()

Related

Efficient pseudo-inverse for PyTorch 2D convolution

Background:
Thanks for your attention! I am learning the basic knowledge of 2D convolution, linear algebra and PyTorch. I encounter the implementation problem about the psedo-inverse of the convolution operator. Specifically, I have no idea about how to implement it in an efficient way. Please see the following problem statements for details. Any help/tip/suggestion is welcomed.
(Thanks a lot for your attention!)
The Original Problem:
I have an image feature x with shape [b,c,h,w] and a 3x3 convolutional kernel K with shape [c,c,3,3]. There is y = K * x. How to implement the corresponding pseudo-inverse on y in an efficient way?
There is [y = K * x = Ax], how to implement [x_hat = (A^+)y]?
I guess that there should be some operations using torch.fft. However, I still have no idea about how to implement it. I do not know if there exists an implementation previously.
import torch
import torch.nn.functional as F
c = 32
K = torch.randn(c, c, 3, 3)
x = torch.randn(1, c, 128, 128)
y = F.conv2d(x, K, padding=1)
print(y.shape)
# How to implement pseudo-inverse for y = K * x in an efficient way?
Some of My Efforts:
I may know that the 2D convolution is a linear operator. It is equivalent to a "matrix product" operator. We can actually write out the matrix form of the convolution and calculate its psedo-inverse. However, I think this type of operation will be inefficient. And I have no idea about how to implement it in an efficient way.
According to Wikipedia, the psedo-inverse may satisfy the property of A(A_pinv(x))=x, where A is the convolutional operator, A_pinv is its psedo-inverse, and x may be any image feature.
(Thanks again for reading such a long post!)
This takes the problem to another level.
The convolution itself is a linear operation, you can determine the matrix of the operation and solve a least square problem directly [1], or compute the pseudo-inverse as you mentioned, and then apply to different outputs and predicting a projection of the input.
I am changing your code to using padding=0
import torch
import torch.nn.functional as F
# your code
c = 32
K = torch.randn(c, c, 1, 1)
x = torch.randn(4, c, 128, 128)
y = F.conv2d(x, K, bias=torch.zeros((c,)))
Also, as you probably already suggested the convolution can be computed as ifft(fft(h)*fft(x)). However, the conv2d function is a cross-correlation, so you have to conjugate the filter leading to ifft(fft(h)*fft(x)), also you have to apply this to two axes, and you have to make sure the FFT is calcuated using the same representation (size), since the data is real, we can apply multi-dimensional real FFT. To be complete, conv2d works on multiple channels, so we have to calculate summations of convolutions. Since the FFT is linear, we can simply compute the summations on the frequency domain
using einsum.
s = y.shape[-2:]
K_f = torch.fft.rfftn(K, s)
x_f = torch.fft.rfftn(x, s)
y_f = torch.einsum('jkxy,ikxy->ijxy', K_f.conj(), x_f)
y_hat = torch.fft.irfftn(y_f, s)
Except for the borders it should be accurate (remember FFT computes a cyclic convolution).
torch.max(abs(y_hat[:,:,:-2,:-2] - y[:,:,:,:]))
Now, notice the pattern jk,ik->ij on the einsum, that means y_f[i,j] = sum(K_f[j,k] * x_f[i,k]) = x_f # K_f.T, if # is the matrix product on the first two dimensions. So to invert this operation we have to can interpret the first two dimensions as matrices. The function pinv will compute pseudo-inverses on the last two axes, so in order to use that we have to permute the axes. If we right multiply the output by the pseudo-inverse of transposed K_f we should invert this operation.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
Nottice that I am using the full convolution, but the conv2d actually cropped the images. Let's apply that
y_hat[:,:,128-(k-1):,:] = 0
y_hat[:,:,:,128-(k-1):] = 0
Repeating the calculation you will see that the input is not accurate anymore, so you have to be careful about what you do with your convolution, but in some situations where you can get this to work it will be in fact efficient.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))

Optimise custom gaussian processes kernel in scikit using gridsearch

I'm working with Gaussian processes and when I use the scikit-learn GP modules I struggle to create and optimise custom kernels using gridsearchcv. The best way to describe this problem is using the classic Mauna Loa example where the appropriate kernel is constructed using a combination of already defined kernels such as RBF and RationalQuadratic. In that example the parameters of the custom kernel are not optimised but treated as given. What if I wanted to run a more general case where I would want to estimate those hyperparameters using cross-validation? How should I go about constructing the custom kernel and then the corresponding param_grid object for the grid search?
In a very naive way I could construct a custom kernel using something like this:
def custom_kernel(a,ls,l,alpha,nl):
kernel = a*RBF(length_scale=ls) \
+ b*RationalQuadratic(length_scale=l,alpha=alpha) \
+ WhiteKernel(noise_level=nl)
return kernel
however this function can't of course be called from gridsearchcv using e.g. GaussianProcessRegressor(kernel=custom_kernel(a,ls,l,alpha,nl)).
One possible path forward is presented in this SO question however I was wondering if there's an easier way to solve this problem than coding the kernel from scratch (along with its hyperparameters) as I'm looking to work with a combination of standard kernels and there's also the possibility that I would like to mix them up.
So this is how far I got. It answers the question but it is really really slow for the Mauna Loa example however that's probably a difficult dataset to work with:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process.kernels import ConstantKernel,RBF,WhiteKernel,RationalQuadratic,ExpSineSquared
import numpy as np
from sklearn.datasets import fetch_openml
# from https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html
def load_mauna_loa_atmospheric_co2():
ml_data = fetch_openml(data_id=41187)
months = []
ppmv_sums = []
counts = []
y = ml_data.data[:, 0]
m = ml_data.data[:, 1]
month_float = y + (m - 1) / 12
ppmvs = ml_data.target
for month, ppmv in zip(month_float, ppmvs):
if not months or month != months[-1]:
months.append(month)
ppmv_sums.append(ppmv)
counts.append(1)
else:
# aggregate monthly sum to produce average
ppmv_sums[-1] += ppmv
counts[-1] += 1
months = np.asarray(months).reshape(-1, 1)
avg_ppmvs = np.asarray(ppmv_sums) / counts
return months, avg_ppmvs
X, y = load_mauna_loa_atmospheric_co2()
# Kernel with parameters given in GPML book
k1 = ConstantKernel(constant_value=66.0**2) * RBF(length_scale=67.0) # long term smooth rising trend
k2 = ConstantKernel(constant_value=2.4**2) * RBF(length_scale=90.0) \
* ExpSineSquared(length_scale=1.3, periodicity=1.0) # seasonal component
# medium term irregularity
k3 = ConstantKernel(constant_value=0.66**2) \
* RationalQuadratic(length_scale=1.2, alpha=0.78)
k4 = ConstantKernel(constant_value=0.18**2) * RBF(length_scale=0.134) \
+ WhiteKernel(noise_level=0.19**2) # noise terms
kernel_gpml = k1 + k2 + k3 + k4
gp = GaussianProcessRegressor(kernel=kernel_gpml)
# print parameters
print(gp.get_params())
param_grid = {'alpha': np.logspace(-2, 4, 5),
'kernel__k1__k1__k1__k1__constant_value': np.logspace(-2, 4, 5),
'kernel__k1__k1__k1__k2__length_scale': np.logspace(-2, 2, 5),
'kernel__k2__k2__noise_level':np.logspace(-2, 1, 5)
}
grid_gp = GridSearchCV(gp,cv=5,param_grid=param_grid,n_jobs=4)
grid_gp.fit(X, y)
What helped me was to initialise the model first as gp = GaussianProcessRegressor(kernel=kernel_gpml) and then use the get_params attribute in order to get a list of the model hyper parameters.
Finally I note that in their book Rasmussen and Williams appear to have used Leave one out cross validation to tune the hyperparameters.

What would be a good loss function to penalize the magnitude and sign difference

I'm in a situation where I need to train a model to predict a scalar value, and it's important to have the predicted value be in the same direction as the true value, while the squared error being minimum.
What would be a good choice of loss function for that?
For example:
Let's say the predicted value is -1 and the true value is 1. The loss between the two should be a lot greater than the loss between 3 and 1, even though the squared error of (3, 1) and (-1, 1) is equal.
Thanks a lot!
This turned out to be a really interesting question - thanks for asking it! First, remember that you want your loss functions to be defined entirely of differential operations, so that you can back-propagation though it. This means that any old arbitrary logic won't necessarily do. To restate your problem: you want to find a differentiable function of two variables that increases sharply when the two variables take on values of different signs, and more slowly when they share the same sign. Additionally, you want some control over how sharply these values increase, relative to one another. Thus, we want something with two configurable constants. I started constructing a function that met these needs, but then remembered one you can find in any high school geometry text book: the elliptic paraboloid!
The standard formulation doesn't meet the requirement of sign agreement symmetry, so I had to introduce a rotation. The plot above is the result. Note that it increases more sharply when the signs don't agree, and less sharply when they do, and that the input constants controlling this behaviour are configurable. The code below is all that was needed to define and plot the loss function. I don't think I've ever used a geometric form as a loss function before - really neat.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
def elliptic_paraboloid_loss(x, y, c_diff_sign, c_same_sign):
# Compute a rotated elliptic parabaloid.
t = np.pi / 4
x_rot = (x * np.cos(t)) + (y * np.sin(t))
y_rot = (x * -np.sin(t)) + (y * np.cos(t))
z = ((x_rot**2) / c_diff_sign) + ((y_rot**2) / c_same_sign)
return(z)
c_diff_sign = 4
c_same_sign = 2
a = np.arange(-5, 5, 0.1)
b = np.arange(-5, 5, 0.1)
loss_map = np.zeros((len(a), len(b)))
for i, a_i in enumerate(a):
for j, b_j in enumerate(b):
loss_map[i, j] = elliptic_paraboloid_loss(a_i, b_j, c_diff_sign, c_same_sign)
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y = np.meshgrid(a, b)
surf = ax.plot_surface(X, Y, loss_map, cmap=cm.coolwarm,
linewidth=0, antialiased=False)
plt.show()
From what I understand, your current loss function is something like:
loss = mean_square_error(y, y_pred)
What you could do, is to add one other component to your loss, being this a component that penalizes negative numbers and does nothing with positive numbers. And you can choose a coefficient for how much you want to penalize it. For that, we can use like a negative shaped ReLU. Something like this:
Let's call "Neg_ReLU" to this component. Then, your loss function will be:
loss = mean_squared_error(y, y_pred) + Neg_ReLU(y_pred)
So for example, if your result is -1, then the total error would be:
mean_squared_error(1, -1) + 1
And if your result is 3, then the total error would be:
mean_squared_error(1, -1) + 0
(See in the above function how Neg_ReLU(3) = 0, and Neg_ReLU(-1) = 1.
If you want to penalize more the negative values, then you can add a coefficient:
coeff_negative_value = 2
loss = mean_squared_error(y, y_pred) + coeff_negative_value * Neg_ReLU
Now the negative values are more penalized.
The ReLU negative function we can build it like this:
tf.nn.relu(tf.math.negative(value))
So summarizing, in the end your total loss will be:
coeff = 1
Neg_ReLU = tf.nn.relu(tf.math.negative(y))
total_loss = mean_squared_error(y, y_pred) + coeff * Neg_ReLU

Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

I read this thread about the difference between SVC() and LinearSVC() in scikit-learn.
Now I have a data set of binary classification problem(For such a problem, the one-to-one/one-to-rest strategy difference between both functions could be ignore.)
I want to try under what parameters would these 2 functions give me the same result. First of all, of course, we should set kernel='linear' for SVC()
However, I just could not get the same result from both functions. I could not find the answer from the documents, could anybody help me to find the equivalent parameter set I am looking for?
Updated:
I modified the following code from an example of the scikit-learn website, and apparently they are not the same:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
for i in range(len(y)):
if (y[i]==2):
y[i] = 1
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C, dual = True, loss = 'hinge').fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)']
for i, clf in enumerate((svc, lin_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(1, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Result:
Output Figure from previous code
In mathematical sense you need to set:
SVC(kernel='linear', **kwargs) # by default it uses RBF kernel
and
LinearSVC(loss='hinge', **kwargs) # by default it uses squared hinge loss
Another element, which cannot be easily fixed is increasing intercept_scaling in LinearSVC, as in this implementation bias is regularized (which is not true in SVC nor should be true in SVM - thus this is not SVM) - consequently they will never be exactly equal (unless bias=0 for your problem), as they assume two different models
SVC : 1/2||w||^2 + C SUM xi_i
LinearSVC: 1/2||[w b]||^2 + C SUM xi_i
Personally I consider LinearSVC one of the mistakes of sklearn developers - this class is simply not a linear SVM.
After increasing intercept scaling (to 10.0)
However, if you scale it up too much - it will also fail, as now tolerance and number of iterations are crucial.
To sum up: LinearSVC is not linear SVM, do not use it if do not have to.

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?
The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put
C1 = C / n1
C2 = C / n2
where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.
Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.
Example using python and sklearn
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.
This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:
http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf

Resources