Understanding asymmetrically quantized matrix multiplication - machine-learning

I am trying to implement the equation for asymmetrically quantized matrix multiplcation from the paper https://arxiv.org/abs/2106.08295 (equation 13)
In this system, real-number matrices are represented as an affine map with a scale factor s and a zero point z. The terms in blue are possible to pre-compute.
Computing each of final terms is simple enough, but I do not understand how they are supposed to sum together. The first term s_w * s_x * W_int * X_int is a matrix multiplication of integer values, so if the input matrices W,X had dimensions m by n and n by p respectively, the result would be a m by p matrix. The second term s_w * z_w * s_x * X_int would have dimensions n by p, so I do not understand how it could be subtracted from the first term.
I understand this algorithm is implemented in many tinyML applications, eg TensorFlow lite, has anyone any experience here? Or know how it was done?

Related

Vector coefficients based on similarity

I've been looking for a solution to create a recommendation system based on vectors similarity.
Basically, i have a few vectors per user for example:
User1: [0,3,7,8,5] , [3,5,8,2,4] , [1,5,3,9,4]
User2: [3,1,6,7,9] , [2,4,1,3,8] , [7,8,3,3,1]
For every vector i need to calculate a coefficient and based on that coefficient differentiate a vector from another. I've found formulas that would calculate coefficients based on similarity of 2 vectors which i don't really want that.I need a formula that would calculate a coefficient per vector and then i do some other calculations with those coefficients.Are there any good formulas for this?
Thanks
So going based off your response to my comment: I don't think there's a similarity coefficient measure that will do what you want. Let me explain why...
Similarity coefficients are functions f(x, y) -> c where x and y are vectors and c is a scalar. Note that f takes two parameters. f(x,y) = f(y,x), but f(x) is meaningless - its asking for the similarity of x relative to... nothing.
So what? We could just use a function g(x) = f(x, V) where V is a fixed vector. E.g. let V = [1, 1, ..., 1]. Now we have a monadic function that gives us a similarity value for every individual vector. But...
Knowing f(x,y) = c and f(x,z) = c' doesn't tell you a whole lot about f(y,z). Take vectors in 2-space, x = [1, 1], y = [0, 1], z = [1,0]. A similarity function symmetric in the two dimensions would say f(x,y) = f(x,z) but hopefully not = f(y,z) So our g function above isn't very useful, because knowing how similar two vectors are to V doesn't tell us much about how similar they are to each other.
So what can you do? I think a simple solution to your problem would be a variation of the k nearest neighbors algorithm. It allows you to find vectors close to a given vector (or, if you prefer to find clusters of vectors without specifying a given vector, look up clustering)
EDIT: inspiration from Yahya's answer: if your vectors are super huge and knn or clustering is too difficult, consider principle component analysis or some other method of cutting them down to size (reducing the number of dimensions) - just keep in mind whatever you do will likely be lossy

Fourier Transform Scaling the magnitude

I have a set of signals S1, S2, ....,SN, for which I am numerically computing the Fourier transforms F1, F2, ,,,,FN. Where Si's and Fi's are C++ vectors (my computations are in C++).
My Computation objective is as follows:
compute product: F = F1*F2*...*FN
inverse Fourier transform F to get a S.
What I numerically observe is when I am trying to compute the product either I am running into situation where numbers become too small. Or numbers become too big.
I thought of scaling F1 by say a1 so that the overflow or underflow is avoided.
With the scaling my step 1 above will become
F' = (F1/a1)*(F2/a2)*...*(FN/aN) = F'1*F'2*...*F'N
And when I inverse transform my final S' will differ from S by a scale factor. The structural form of the S will not change. By this I mean only the normalization of S is different.
My question is :
Is my rationale correct.
If my rationale is correct then given a C++ vector "Fi" how can I choose a good scale "ai" to scale up or down the Fi's.
Many thanks in advance.
You can change the range of the Fi vector in a new domain, let's say [a, b]. So from your vector Fi, the minimum number will become a and the maximum number will become b. You can scale the vector using this equation:
Fnew[i] = (b-a)/(max-min) * (F[i] - min) + a,
where: min = the minimum value from F
max= the maximum value from F
The only problem now is choosing a and b. I would choose [1,2], because the values are small (so you won't get overflow easily), but not smaller than 0.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much
This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.
I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

Dimensionality in Isomap

Given a D-dimensional data (D features), what is the maximum number of projections (d-dimensions) in Isomap?
Usually for PCA, LDA, you have the same number of possible components as the number of your features... In isomap it seems to be possible to have more (?)
Imagine that you input consists of N data vectors, each of which is D dimensional.
Isomap constructs a N x N matrix based on finding near-neighbors and then looking for paths between near neighbors to connect far away neighbors, then Isomap solves an eigenvalue problem on this N x N matrix to define the projections
So, IF you have more datavectors than you have dimensions in each datavector, THEN you may have more projections than you have data points.
An easy data set to explore this is:
create 100 points randomly on a circle in 2D. Then N = 100, D = 2.
The Isomap projections and coefficients will be non-zero for the first 99 projections.

Deriving the Inverse Filter of Image Convolution Kernel

Does anyone has an idea how to calculate the Inverse of a 2-D filter?
Let's say I have a 3x3 filter:
0 1 0
1 1 1
0 1 0
I want to find it's inverse.
It's easy to do using DFT.
But let's say I want to do it by convolution.
Now, that's the problem, Matlab symbolic isn't my specialty.
Assuming there's a 3X3 Inverse Filter it means convolution of the two will result in:
0 0 0
0 1 0
0 0 0
The problem is to create the right set of equations for that and solving it.
Doing it with symbols is easy to think yet I couldn't do it.
Any ideas?
Thanks.
P.S.
I'm not sure there an Inverse Filter for this one as it has zeros in its DTFT.
Moreover, someone should allow Latex in this forum the way MathOverflow has.
Let h[n] be the finite impulse response of a 1D filter. What does that imply about its inverse filter? The inverse cannot possibly be FIR.
Proof: Let H(omega) G(omega) = 1 for all omega, where H is the DTFT of h[n], and G is the DTFT of g[n]. If h[n] is FIR, then g[n] must be IIR.
Of course, there are ways to approximate the inverse IIR filter with an FIR filter. A basic method is adaptive filtering, e.g., Least Mean Squares (LMS) algorithm. Or just truncate the IIR filter. You still need to worry about stability, though.
For practical purposes, there probably is no desirable solution to your specific question. Especially if, for example, this is in image processing and you are trying to inverse an FIR blur filter with FIR sharpening filter. The final image will not look that good, period, unless your sharpening filter is really, really large.
EDIT: Let y[n] = b0 x[n-0] + b1 x[n-1] + ... + bN x[n-N]. Let this equation characterize the forward system, where y is the output and x is the input. By definition, the impulse response is the output when the input is an impulse: h[n] = b0 d[n-0] + b1 d[n-1] + ... + bN d[n-N]. This impulse response has finite length N+1.
Now, consider the inverse system where x is the output and y is the input. Then the impulse response is described by the recurrence equation d[n] = b0 h[n] + b1 h[n-1] + ... + bN h[n-N]. Equivalently, b0 h[n] = d[n] - b1 h[n-1] - ... - bN h[n-N].
Without loss of generality, assume that b0 and bN are both nonzero. For any m, if h[m] is nonzero, then h[m+N] is also nonzero. Because this system has feedback, its impulse response is infinitely long. QED.
Causality does not matter. The inverse of a delay is an advance, and vice versa. Neither a delay or an advance alter the finiteness of an impulse response. Shift an infinite impulse response left or right; it is still infinite.
EDIT 2: For clarification, this proof is not related to my original "proof". One was in frequency domain, the other in time domain.
In practice, one useful solution is Wiener deconvolution http://en.wikipedia.org/wiki/Wiener_deconvolution which basically, in Fourier space, divides by the spectrum of the given filter. Zeros are handled by adding a fudge term: instead of 1/H(w), use H(w)/( abs(H(w))^2 + c) where H(w) is the discrete Fourier transform of the filter h(x) (add a 2nd dimension as you like) and "w" is supposed to be omega. The constant is chosen based on the noise level of the signal or image.
For a separable filter (i.e., one in which you can pull out a horizontal and vertical filter which can be applied in any order and give the same result as the composite 2-D filter), you could attempt to calculate the inverse of each on individually.
The inverse of a filter H(w) is G(w)=1/H(w), so one way to do it is to take the impulse response (the h[n] time-domain coeffs) and inverse-DFT them. It's not always easy to get an analytic expression for such a filter, so you could either compute it numerically (approximating to the desired precision) or do adaptive inverse filtering, as Steve suggested. See Widrow and Stearn's Adaptive Signal Processing for more info on this latter method.
One way is to write a matrix representation of convolution with your filter and then try to find some (regularized) inverse of this matrix.
For example the 1D [1,-1] filter is maybe the simplest discrete differential approximation that exists.
Its matrix will have two diagonals 1 on main diagonal and -1 on first off diagonal.
It's inverse will be integral filter. An IIR filter with infinite trail of 1s. Since our matrix has limited size this will in practice mean filling up the matrix with 1s everywhere on one side of the diagonal.
Deriving the Inverse Kernel of a Given 2D Convolution Kernel
This is basically a generalization of the question - Deriving the Inverse Filter of Image Convolution Kernel.
Problem Formulation
Given a Convolution Kernel $ f \in \mathbb{R}^{m \times n} $ find its inverse kernel, $ g \in \mathbb{R}^{p \times q} $ such that $ f \ast g = h = \delta $.
Solution I
One could build the Matrix Form of the Convolution Operator.
The Matrix form can replicate various mode of Image Filtering (Exclusive):
Boundary Conditions (Applied as padding the image and apply Convolution in Valid mode).
Zero Padding
Padding the image with zeros.
Circular
Using Circular / Periodic continuation of the image.
Match Frequency Domain convolution.
Replicate
Replicating the edge values (Nearest Neighbor).
Usually creates the least artifacts in real world.
Symmetric
Mirroring the image along the edges.
Convolution Shape
Full
The output size is to full extent - $ \left( m + p - 1 \right) \times \left( n + q - 1 \right) $.
Same
Output size is the same as the input size ("Image").
Valid
Output size is the size of full overlap of the image and kernel.
When using Image Filtering the output size matches the input size hence the Matrix Form is square and the inverse is defined.
Using Convolution Matrix the matrix might not be square (Unless "Same" is chosen) hence the Pseudo Inverse should be derived.
One should notice that while the input matrix should be sparse the inverse matrix isn't.
Moreover while the Convolution Matrix will have special form (Toeplitz neglecting the Boundary Conditions) the inverse won't be.
Hence this solution is more accurate than the next one yet it also use higher degree of freedom (The solution isn't necessarily in Toeplitz form).
Solution II
In this solution the inverse is derived using minimization of the following cost function:
$$ \arg \min_{g} \frac{1}{2} {\left| f \ast g - h \right|}_{2}^{2} $$
The derivative is given by:
$$ \frac{\partial \frac{1}{2} {\left| f \ast g - h \right|}_{2}^{2} }{\partial g} = f \star \left( f \ast g - h \right) $$
Where $ \star $ is the Correlation operation.
In practice the convolution in the Objective Function is done in full mode (MATLAB idiom).
In practice it is done:
hObjFun = #(mG) 0.5 * sum((conv2(mF, mG, 'full') - mH) .^ 2, 'all');
mObjFunGrad = conv2(conv2(mF, mG, 'full') - mH, mF(end:-1:1, end:-1:1), 'valid');
Where the code flips mF for Correaltion and use valid (In matrix form it is the Adjoint / Transpose -> the output is smaller).
The optimization problem is Strictly Convex and can be solved easily using Gradient Descent:
for ii = 1:numIteraions
mObjFunGrad = conv2(conv2(mF, mG, 'full') - mH, mF(end:-1:1, end:-1:1), 'valid');
mG = mG - (stepSize * mObjFunGrad);
end
Full code is available in my StackExchnage Code StackOverflow Q2080835 GitHub Repository (Have a look at StackOverflow\Q2080835 folder).

Resources