compute ci square distance in python - image-processing

im using knn model from sklean (- documentation
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) to train some model for images classification. as you can see in the documentation, there is no option to pass the chi square distance as a metric to the KNeighborsClassifier function. but there is an option to pass a callable item-so i can pass a function that i built to calc the chi square. so i tried to write my own function.
i know that for some two images, the chi square is calculated up to this formula:
formula to calaculate chi square distance
my task is to solve this problem without using for loops as it takes too long. i need to solve this using vectorization, as my data passed to the chi square are images so if i have two images represented as np arrays' i can do math actions on them without loops, i mean for example for A=[1,2,3] and B=[3,4,5], A+B just give us [4,6,8], no need to use loop to calc this. i this way i also need to calc the chi square function.
anyway, when i tried for example this function:
def chi2(A, B):
#compute the chi-squared distance using above formula
chi = 0.5 * (((A - B) ** 2) / (A + B))
return chi
to calc the chi square, i get an error, and if i try some others similar functions, for example this code :
def chi2(A, B):
#compute the chi-squared distance using above formula
def chi2_distance(A, B):
# compute the chi-squared distance using above formula
chi = 0.5 * np.sum([((a - b) ** 2) / (a + b)
for (a, b) in zip(A, B)])
return chi
i get warnings: RuntimeWarning: invalid value encountered in double_scalars
for (a, b) in zip(A, B)])
and the program running like forever.
any suggestions to some efficient code to calc the chi square distance (as i said, with no loops)?

Related

Efficient pseudo-inverse for PyTorch 2D convolution

Background:
Thanks for your attention! I am learning the basic knowledge of 2D convolution, linear algebra and PyTorch. I encounter the implementation problem about the psedo-inverse of the convolution operator. Specifically, I have no idea about how to implement it in an efficient way. Please see the following problem statements for details. Any help/tip/suggestion is welcomed.
(Thanks a lot for your attention!)
The Original Problem:
I have an image feature x with shape [b,c,h,w] and a 3x3 convolutional kernel K with shape [c,c,3,3]. There is y = K * x. How to implement the corresponding pseudo-inverse on y in an efficient way?
There is [y = K * x = Ax], how to implement [x_hat = (A^+)y]?
I guess that there should be some operations using torch.fft. However, I still have no idea about how to implement it. I do not know if there exists an implementation previously.
import torch
import torch.nn.functional as F
c = 32
K = torch.randn(c, c, 3, 3)
x = torch.randn(1, c, 128, 128)
y = F.conv2d(x, K, padding=1)
print(y.shape)
# How to implement pseudo-inverse for y = K * x in an efficient way?
Some of My Efforts:
I may know that the 2D convolution is a linear operator. It is equivalent to a "matrix product" operator. We can actually write out the matrix form of the convolution and calculate its psedo-inverse. However, I think this type of operation will be inefficient. And I have no idea about how to implement it in an efficient way.
According to Wikipedia, the psedo-inverse may satisfy the property of A(A_pinv(x))=x, where A is the convolutional operator, A_pinv is its psedo-inverse, and x may be any image feature.
(Thanks again for reading such a long post!)
This takes the problem to another level.
The convolution itself is a linear operation, you can determine the matrix of the operation and solve a least square problem directly [1], or compute the pseudo-inverse as you mentioned, and then apply to different outputs and predicting a projection of the input.
I am changing your code to using padding=0
import torch
import torch.nn.functional as F
# your code
c = 32
K = torch.randn(c, c, 1, 1)
x = torch.randn(4, c, 128, 128)
y = F.conv2d(x, K, bias=torch.zeros((c,)))
Also, as you probably already suggested the convolution can be computed as ifft(fft(h)*fft(x)). However, the conv2d function is a cross-correlation, so you have to conjugate the filter leading to ifft(fft(h)*fft(x)), also you have to apply this to two axes, and you have to make sure the FFT is calcuated using the same representation (size), since the data is real, we can apply multi-dimensional real FFT. To be complete, conv2d works on multiple channels, so we have to calculate summations of convolutions. Since the FFT is linear, we can simply compute the summations on the frequency domain
using einsum.
s = y.shape[-2:]
K_f = torch.fft.rfftn(K, s)
x_f = torch.fft.rfftn(x, s)
y_f = torch.einsum('jkxy,ikxy->ijxy', K_f.conj(), x_f)
y_hat = torch.fft.irfftn(y_f, s)
Except for the borders it should be accurate (remember FFT computes a cyclic convolution).
torch.max(abs(y_hat[:,:,:-2,:-2] - y[:,:,:,:]))
Now, notice the pattern jk,ik->ij on the einsum, that means y_f[i,j] = sum(K_f[j,k] * x_f[i,k]) = x_f # K_f.T, if # is the matrix product on the first two dimensions. So to invert this operation we have to can interpret the first two dimensions as matrices. The function pinv will compute pseudo-inverses on the last two axes, so in order to use that we have to permute the axes. If we right multiply the output by the pseudo-inverse of transposed K_f we should invert this operation.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))
Nottice that I am using the full convolution, but the conv2d actually cropped the images. Let's apply that
y_hat[:,:,128-(k-1):,:] = 0
y_hat[:,:,:,128-(k-1):] = 0
Repeating the calculation you will see that the input is not accurate anymore, so you have to be careful about what you do with your convolution, but in some situations where you can get this to work it will be in fact efficient.
s = 128,128
K_f = torch.fft.rfftn(K, s)
K_f_inv = torch.linalg.pinv(K_f.T).T
y_f = torch.fft.rfftn(y_hat, s)
x_f = torch.einsum('jkxy,ikxy->ijxy', K_f_inv.conj(), y_f)
x_hat = torch.fft.irfftn(x_f, s)
print(torch.mean((x - x_hat)**2) / torch.mean((x)**2))

How tf.gradients work in TensorFlow

Given I have a linear model as the following I would like to get the gradient vector with regards to W and b.
# tf Graph Input
X = tf.placeholder("float")
Y = tf.placeholder("float")
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred = tf.add(tf.mul(X, W), b)
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
However if I try something like this where cost is a function of cost(x,y,w,b) and I only want to gradients with respect to w and b:
grads = tf.gradients(cost, tf.all_variable())
My placeholders will also be included (X and Y).
Even if I do get a gradient with [x,y,w,b] how do I know which element in the gradient that belong to each parameter since it is just a list without names to which parameter the derivative has be taken with regards to?
In this question I'm using parts of this code and I build on this question.
Quoting the docs for tf.gradients
Constructs symbolic partial derivatives of sum of ys w.r.t. x in xs.
So, this should work:
dc_dw, dc_db = tf.gradients(cost, [W, b])
Here, tf.gradients() returns the gradient of cost wrt each tensor in the second argument as a list in the same order.
Read tf.gradients for more information.

Can you give me a short step by step numerical example of radial basis function kernel trick? I would like to understand how to apply on perceptron

I understand well perceptron so put accent only on kernel but I am not familiar with matemathic expressions so please give me an numerical example and a guide on kernel.
For example:
My hyperplane of perceptron is x1*w1+x2*w2+x3*w3+b=0; The RBF kernel formula: k(x,z) = exp((-|x-z|^2)/2*variance^2) where takes action the radial basis function kernel here. Is x an input and what is z variable here?
Or of what I have to calculate variance if it is variance in the formula?
Somewhere I have understood so that I have to plug this formula in perceptron decision function x1*w1+x2*w2+x3*w3+b=0; but how does it look look like If I plug in?
I would like to ask a numerical example to avoid confusion.
Linear Perceptron
As you know linear perceptrons can be trained for binary classification. More precisely, if there is n features, x1, x2, ..., xn in n-dimensional space, Rn, and you want to label them in 2 categories, y1 & y2 (usually -1 and +1), you can use linear perceptron which defines a hyperplane w1*x1 + ... + wn*xn + b = 0 to do so.
w1*x1 + ... + wn*xn + b > 0 or W.X + b > 0 ==> class = y1
w1*x1 + ... + wn*xn + b < 0 or W.X + b < 0 ==> class = y2
Linear perceptron will work well, only if the problem is linearly separable in Rn. For example, in 2D space, this means that one line can separate the 2 sets of points.
Algorithm
One common algorithm to train the perceptron, i.e., find weights and bias, w's & b, based on N data points, X1, ..., XN, and their labels, Y1, ..., YN is the following:
Initialize: W = zeros(n,1); b = 0
For i=1 to N:
Calculate F(Xi) = W.Xi + b
If F(Xi)*Yi <= 0:
W <--- W + Xi*Yi
b <--- b + Yi
This will give the final value for W & b. Besides, based on the training, W will be a linear combination of training points, Xi's, more precisely, the ones that were misclassified. So W = a1*X1 + ... + ...aN*XN where a's are in {0,y1,y2}.
Now, if there is a new point, let's say Z, to label, we check the sign of F(Z) = W.Z + b = a1*(X1.Z) + ... + aN*(XN.Z) + b. It is interesting that only the inner product of new point and training points take part in it.
Kernel Perceptron
Now, if the problem is not linearly separable, one may try to go to a higher dimensional space in which a hyperplane can do the classification. As an example, consider a circle in 2D space. The points inside and outside of the circle can't be separated by a line. However, if you find a transformation that can take the points to 3D space such that the first 2 coordinates remain the same for all points, and the 3rd coordinate become +1 and -1 for the points inside and outside of the circle respectively, then a plane defined as 3rd coordinate = 0 can separate the points.
Finding such transformations can be difficult and computationally heavy, so the kernel trick is introduced. Notice that we only used the inner product of new points with the training points. Kernel trick employs this fact and defines the inner product of the transformed points without actually finding the transformation.
If the unknown transformation is P(X) then Kernel function will be:
K(Xi,Xj) = <P(Xi),P(Xj)>. So instead of finding P, kernel functions are defined which represent the scalar result of the inner product in high-dimensional space. There are also theorems about what functions can be kernel functions, i.e., correspond to inner product in another space.
After choosing a kernel function, the algorithm will be modified as follows:
Initialize: F(X) = 0
For i=1 to N:
Calculate F(Xi)
If F(Xi)*Yi <= 0:
F(.) <--- F(.) + K(.,Xi)*Yi + Yi
At the end, F(.) = a1*K(.,X1) + ... + ...aN*K(.,XN) + b where a's are in {0,y1,y2}.
RBF Kernel
Radial basis function is one type of kernel function that is actually computing the inner product in an infinite-dimensional space. It can be written as
K(Xi,Xj) = exp(- norm2(Xi-Xj)^2 / (2*sigma^2))
Sigma is some parameter that you can work with to find an optimum value for. For example, you can train the model with different values of sigma and then find the best value based on the performance. You can start with sigma = 1
After training the model to find F(.), for a new data Z, the sign of F(Z) = a1*K(Z,X1) + ... + ...aN*K(Z,XN) + b will determine the class.
Remarks:
Regarding to your question about variance, you don't need to find any variance.
About x and z in your question, in each iteration, you should find the kernel output for the current data point and all the previously added points (the points that were misclassified and hence were added to F).
I couldn't come up with a simple instructive numerical example.
References:
I borrowed some notation from
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwjVu-fXo8DOAhVDxCYKHQkcDDAQFggoMAE&url=http%3A%2F%2Falex.smola.org%2Fteaching%2Fpune2007%2Fpune_3.pdf&usg=AFQjCNHlxy9TnY8xNe2-QDERipN_GycSqQ&bvm=bv.129422649,d.eWE

The number of parameters in Gaussian mixture model

I have D-dimensional data with K components.
How many parameters if I use a model with full covariance matrices?
and How many if I use diaogonal covariance matrices?
Answer by xyLe_ at CrossValidated
https://stats.stackexchange.com/a/229321/127305
Simply do the math.
For each Gaussian you have:
1. A Symmetric full DxD covariance matrix giving (D*D - D)/2 + D parameters ((D*D - D)/2 is the number of off-diagonal elements and D is the number of diagonal elements)
2. A D dimensional mean vector giving D parameters
3. A mixing weight giving another parameter
This results in Df = (D*D - D)/2 + 2D + 1 for each gaussian.
Given you have K components, you have K*Df parameters.
In the diagonal case the covariance matrix parameters reduce to D, because of the abscence of off-diagonal elements.
Thus yielding Df = 2D + 1.

How does fitEllipse work in OpenCV?

I am working with opencv and I need to understand how does the function fitEllipse exactly works. I looked at the code at (https://github.com/Itseez/opencv/blob/master/modules/imgproc/src/shapedescr.cpp) and I know it uses least-squares to determine the likely ellipses. I also looked at the paper given in the documentation(Andrew W. Fitzgibbon, R.B.Fisher. A Buyer’s Guide to Conic Fitting. Proc.5th British Machine Vision Conference, Birmingham, pp. 513-522, 1995.)
But I cannot understand exactly the algorithm. For example, why does it need to solve 3 times the least square problem? why bd is initialized to 10000 before the first svd(I guess it is juste a random value for the initialization but why this value can be random?)? why does the values in Ad needs to be negative before the first svd?
Thank you!
Here is Matlab code.. it might help
function [Q,a]=fit_ellipse_fitzgibbon(data)
% function [Q,a]=fit_ellipse_fitzgibbon(data)
%
% Ellipse specific fit, according to:
%
% Direct Least Square Fitting of Ellipses,
% A. Fitzgibbon, M. Pilu and R. Fisher. PAMI 1996
%
%
% See Also:
% FIT_ELLIPSE_LS
% FIT_ELLIPSE_HALIR
[m,n] = size(data);
assert((m==2||m==3)&&n>5);
x = data(1,:)';
y = data(2,:)';
D = [x.^2 x.*y y.^2 x y ones(size(x))]; % design matrix
S = D'*D; % scatter matrix
C(6,6)=0; C(1,3)=-2; C(2,2)=1; C(3,1)=-2; % constraints matrix
% solve the generalized eigensystem
[V,D] = eig(S, C);
% find the only negative eigenvalue
[n_r, n_c] = find(D<0 & ~isinf(D));
if isempty(n_c),
warning('Error getting the ellipse parameters, will do LS');
[Q,a] = fit_ellipse_ls(data); %
return;
end
% the parameters
a = V(:, n_c);
[A B C D E F] = deal(a(1),a(2),a(3),a(4),a(5),a(6)); % deal is slow!
Q = [A B/2 D/2; B/2 C E/2; D/2 E/2 F];
end % fit_ellipse_fitzgibbon
Fitzibbon solution has some numerical stability though. See the work of Halir for a solution to this.
It is essentially least squares solution, but specifically designed so that it will produce a valid ellipse, not just any conic.

Resources