Shift right every DW in a __m128i by a different amount - sse

I want to shift right every element of a __m128i register by a different amount.I know this is possible by multiplication if we want to shift left like below:
__m128i mul_constant = _mm_set_epi32(8, 4, 2, 1);
__m128i left_vshift = _mm_mullo_epi32(R, mul_constant);
But, what is the solution if we want to shift it right?

I finally did it like below:
Shifting every byte by a different amount to left and then a 32-bit right shift by 3 gave me what I wanted.
R = _mm_mullo_epi32(R, _mm_set_epi32(1, 2, 4, 8));
R = _mm_srli_epi32(R, 3);

Related

Image registration and focus stacking

Background:
I am looking to align images for a focus stacking application using a smartphone. Links to images:
First in stack: 1, Last in stack: 2, Final stacked images: 3
I.e. images are nominally the same, BUT contain:
Systematic change in FOCUS as the focal plane shifts between images
Magnification changes slightly (smartphone feature as focus changes!)
Camera moves slightly due to random vibrations.
Images need to be aligned for the focus-stacking APP to work.
Progress to date:
I use OpenCV's findTransformECC() to get alignment. It works well after some experimentation i.e. see cv2.MOTION_EUCLIDEAN for the warp_mode in ECC image alignment method which was useful to improve the initialization of the Warp matrix:
Images aligned at pixel level
60secs to process 8Mpix image (1sec for 0.5Mpix image) (on 3 year old portable PC with OpenCV release libraries)
See stacked image link above.
I briefly investigated a feature detector (SIFT). It did not align the images well, presumably due to the change in focus between images.
Code:
int scale = 1;
int scaleSmall = 4;
float scaleDiff = scaleSmall / scale;
for (i = 0; i< numImages; i++) {
file = dir + image + to_string(i) + ".jpg";
col[i] = imread(file);
resize(col[i], z[i], Size(col[i].cols/scale, col[i].rows/scale));
cvtColor(z[i], zg[i], CV_BGR2GRAY);
resize(zg[i], zgSmall[i], Size(col[i].cols / scaleSmall, col[i].rows / scaleSmall));
}
// Set a 2x3 or 3x3 warp matrix depending on the motion model.
// See https://www.learnopencv.com/image-alignment-ecc-in-opencv-c-python/
// Define the motion model
const int warp_mode = MOTION_HOMOGRAPHY;
// Initialize the matrix to identity
if (warp_mode == MOTION_HOMOGRAPHY) {
warp_init = Mat::eye(3, 3, CV_32F);
warp_matrix = Mat::eye(3, 3, CV_32F);
warp_matrix_prev = Mat::eye(3, 3, CV_32F);
scaleTX = (Mat_<float>(3, 3) << 1, 1, scaleDiff, 1, 1, scaleDiff, 1 / scaleDiff, 1 / scaleDiff, 1);
}
else {
warp_init = Mat::eye(2, 3, CV_32F);
scaleTX = Mat::eye(2, 3, CV_32F);
warp_matrix = Mat::eye(2, 3, CV_32F);
warp_matrix_prev = Mat::eye(2, 3, CV_32F);
scaleTX = (Mat_<float>(2, 3) << 1, 1, scaleDiff, 1, 1, scaleDiff);
}
// Specify the number of iterations.
int number_of_iterations = 5000;
// Specify the threshold of the increment
// in the correlation coefficient between two iterations
double termination_eps = 1e-8;
// Define termination criteria
TermCriteria criteria(TermCriteria::COUNT + TermCriteria::EPS, number_of_iterations, termination_eps);
for (i = 1; i < numImages; i++) {
// Check images right size
if (zg[0].rows < 10 || zg[1].rows < 10)
return;
// Run the ECC algorithm at start to get an initial guess. The results are stored in warp_matrix.
if (i == 1) {
findTransformECC(zgSmall[0], zgSmall[i], warp_init, warp_mode, criteria );
// See https://stackoverflow.com/questions/45997891/cv2-motion-euclidean-for-the-warp-mode-in-ecc-image-alignment-method
warp_matrix = warp_init * scaleTX;
}
// Warp Matrix from previous iteration is used as initialisation
findTransformECC(zg[0], zg[i], warp_matrix, warp_mode, criteria);
if (warp_mode != MOTION_HOMOGRAPHY) {
warpAffine(zg[i], ag[i], warp_matrix, zg[i].size(), INTER_LINEAR + WARP_INVERSE_MAP);
warpAffine(z[i], acol[i], warp_matrix, zg[i].size(), INTER_LINEAR + WARP_INVERSE_MAP);
}
else {
// Use warpPerspective for Homography
warpPerspective(z[i], acol[i], warp_matrix, z[i].size(), INTER_LINEAR + WARP_INVERSE_MAP);
warpPerspective(zg[i], ag[i], warp_matrix, zg[i].size(), INTER_LINEAR + WARP_INVERSE_MAP);
}
}
}
Question:
Can the image registration speed be significantly improved (using the same hardware)?
there are at least 3 improvements that can be done:
5000 iterations may be unnecessary. Try to limit it to 500. Moreover transforming images to gradient domain may help. See GetGradient function from this tutorial.
You can assume that perspective effects are negligible so you can change warp_mode to MOTION_AFFINE to limit the degrees of freedom from 8 to 6.
You can also try another, much faster approach that is based on phase correlation (frequency domain). In the standard way it only estimates translation between images but you can transfer them to the log-polar space to get translation, rotation and scale invariance. This code implements the third approach.

How to implement decay towards zero in signed fixed point math, in sse?

There are many decay-like physical events (for example body friction or charge leak), that are usually modelled in iterators like x' = x * 0.99, which is usually very easy to write in floating point arithmetics.
However, i have a demand to do this in 16-bit "8.8" signed fixed point manner, in sse. For efficient implementation on typical ALU mentioned formula can be rewritten as x = x - x/128; or x = x - (x>>7) where >> is "arithmetic", sign-extending right shift.
And i stuck here, because _mm_sra_epi16() produces totally counterintuitive behaviour, which is easily verifiable by following example:
#include <cstdint>
#include <iostream>
#include <emmintrin.h>
using namespace std;
int main(int argc, char** argv) {
cout << "required: ";
for (int i = -1; i < 7; ++i) {
cout << hex << (0x7fff >> i) << ", ";
}
cout << endl;
cout << "produced: ";
__m128i a = _mm_set1_epi16(0x7fff);
__m128i b = _mm_set_epi16(-1, 0, 1, 2, 3, 4, 5, 6);
auto c = _mm_sra_epi16(a, b);
for (auto i = 0; i < 8; ++i) {
cout << hex << c.m128i_i16[i] << ", ";
}
cout << endl;
return 0;
}
Output would be as follows:
required: 0, 7fff, 3fff, 1fff, fff, 7ff, 3ff, 1ff,
produced: 0, 0, 0, 0, 0, 0, 0, 0,
It only applies first shift to all, like it is actually _mm_sra1_epi16 function, accidentely named sra and given __m128i second argument bu a funny clause for no reason. So this cannot be used in SSE.
On other hand, i heard that division algorithm is enormously complex, thus _mm_div_epi16 is absent in SSE and also cannot be used.
What to do and how to implement/vectorize that popular "decay" technique?
x -= x>>7 is trivial to implement with SSE2, using a constant shift count for efficiency. This compiles to 2 instructions if AVX is available, otherwise a movdqa is needed to copy v before a destructive right-shift.
__m128i downscale(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
return _mm_sub_epi16(v, dec);
}
GCC even auto-vectorizes it (Godbolt).
void foo(short *__restrict a) {
for (int i=0 ; i<10240 ; i++) {
a[i] -= a[i]>>7; // inner loop uses the same psraw / psubw
}
}
Unlike float, fixed-point has constant absolute precision over the full range, not constant relative precision. So for small positive numbers, v>>7 will be zero and your decrement will stall. (Negative inputs underflow to -1, because arithmetic right shift rounds towards -infinity.)
If small inputs where the shift can underflow to 0, you might want to OR with _mm_set1_epi16(1) to make sure the decrement is non-zero. Negligible effect on large-ish inputs. However, that will eventually make a downscale chain go from 0 to -1. (And then back up to 0, because -1 | 1 == -1 in 2's complement).
__m128i downscale_nonzero(__m128i v){
__m128i dec = _mm_srai_epi16(v, 7);
dec = _mm_or_si128(dec, _mm_set1_epi16(1));
return _mm_sub_epi16(v, dec);
}
If starting negative, the sequence would be -large, logarithmic until -128, linear until -4, -3, -2, -1, 0, -1, 0, -1, ...
Your code got all-zeros because _mm_sra_epi16 uses the low 64 bits of the 2nd source vector as a 64-bit shift count that applies to all elements. Read the manual. So you shifted all the bits out of each 16-bit element.
It's not idiotic, but per-element shift counts require AVX2 (for 32/64-bit elements) or AVX512BW for _mm_srav_epi16 or 64-bit arithmetic right shifts, which would make sense for the way you're trying to use it. (But the shift count is unsigned, so -1 also going to shift out all the bits).
Indeed, that instruction should be named _mm_sra1_epi16()
Yup, that would make sense. But remember that when these were named, AVX2 _mm_srav_* didn't exist yet. Also, that specific name would not be ideal because 1 and i are not the most visually distinct. (i for immediate, for the psraw xmm1, imm16 form instead of the psraw xmm1, xmm2/m128 form of the asm instruction: http://felixcloutier.com/x86/PSRAW:PSRAD:PSRAQ.html).
The other way it makes sense is that the MMX/SSE2 asm instruction has two forms: immediate (with the same count for all elements of course), and vector. Instead of forcing you to broadcast the count to all element, the vector version takes the scalar count in the bottom of a vector register. I think the intended use-case is after a movd xmm0, eax or something.
If you need per-element-variable shift counts without AVX512, see various Q&As about emulating it, e.g. Shifting 4 integers right by different values SIMD.
Some of the workarounds use multiplies by powers of 2 for variable left-shift, and then a right shift to put the data where needed. (But you need to somehow get the 1<<n SIMD vector prepared, so this works if the same set of counts is reused for many vectors, or especially if it's a compile-time constant).
With 16-bit elements, you can use just one _mm_mulhi_epi16 to do runtime-variable right shift counts with no precision loss or range limits. mulhi(x*y) is exactly like (x*(int)y) >> 16, so you can use y=1<<14 to right shift by 16-14 = 2 in that element.

Obtain sigma of gaussian blur between two images

Suppose I have an image A, I applied Gaussian Blur on it with Sigam=3 So I got another Image B. Is there a way to know the applied sigma if A,B is given?
Further clarification:
Image A:
Image B:
I want to write a function that take A,B and return Sigma:
double get_sigma(cv::Mat const& A,cv::Mat const& B);
Any suggestions?
EDIT1: The suggested approach doesn't work in practice in its original form(i.e. using only 9 equations for a 3 x 3 kernel), and I realized this later. See EDIT1 below for an explanation and EDIT2 for a method that works.
EDIT2: As suggested by Humam, I used the Least Squares Estimate (LSE) to find the coefficients.
I think you can estimate the filter kernel by solving a linear system of equations in this case. A linear filter weighs the pixels in a window by its coefficients, then take their sum and assign this value to the center pixel of the window in the result image. So, for a 3 x 3 filter like
the resulting pixel value in the filtered image
result_pix_value = h11 * a(y, x) + h12 * a(y, x+1) + h13 * a(y, x+2) +
h21 * a(y+1, x) + h22 * a(y+1, x+1) + h23 * a(y+1, x+2) +
h31 * a(y+2, x) + h32 * a(y+2, x+1) + h33 * a(y+2, x+2)
where a's are the pixel values within the window in the original image. Here, for the 3 x 3 filter you have 9 unknowns, so you need 9 equations. You can obtain those 9 equations using 9 pixels in the resulting image. Then you can form an Ax = b system and solve for x to obtain the filter coefficients. With the coefficients available, I think you can find the sigma.
In the following example I'm using non-overlapping windows as shown to obtain the equations.
You don't have to know the size of the filter. If you use a larger size, the coefficients that are not relevant will be close to zero.
Your result image size is different than the input image, so i didn't use that image for following calculation. I use your input image and apply my own filter.
I tested this in Octave. You can quickly run it if you have Octave/Matlab. For Octave, you need to load the image package.
I'm using the following kernel to blur the image:
h =
0.10963 0.11184 0.10963
0.11184 0.11410 0.11184
0.10963 0.11184 0.10963
When I estimate it using a window size 5, I get the following. As I said, the coefficients that are not relevant are close to zero.
g =
9.5787e-015 -3.1508e-014 1.2974e-015 -3.4897e-015 1.2739e-014
-3.7248e-014 1.0963e-001 1.1184e-001 1.0963e-001 1.8418e-015
4.1825e-014 1.1184e-001 1.1410e-001 1.1184e-001 -7.3554e-014
-2.4861e-014 1.0963e-001 1.1184e-001 1.0963e-001 9.7664e-014
1.3692e-014 4.6182e-016 -2.9215e-014 3.1305e-014 -4.4875e-014
EDIT1:
First of all, my apologies.
This approach doesn't really work in the practice. I've used the filt = conv2(a, h, 'same'); in the code. The resulting image data type in this case is double, whereas in the actual image the data type is usually uint8, so there's loss of information, which we can think of as noise. I simulated this with the minor modification filt = floor(conv2(a, h, 'same'));, and then I don't get the expected results.
The sampling approach is not ideal, because it's possible that it results in a degenerated system. Better approach is to use random sampling, avoiding the borders and making sure the entries in the b vector are unique. In the ideal case, as in my code, we are making sure the system Ax = b has a unique solution this way.
One approach would be to reformulate this as Mv = 0 system and try to minimize the squared norm of Mv under the constraint squared-norm v = 1, which we can solve using SVD. I could be wrong here, and I haven't tried this.
Another approach is to use the symmetry of the Gaussian kernel. Then a 3x3 kernel will have only 3 unknowns instead of 9. I think, this way we impose additional constraints on v of the above paragraph.
I'll try these out and post the results, even if I don't get the expected results.
EDIT2:
Using the LSE, we can find the filter coefficients as pinv(A'A)A'b. For completion, I'm adding a simple (and slow) LSE code.
Initial Octave Code:
clear all
im = double(imread('I2vxD.png'));
k = 5;
r = floor(k/2);
a = im(:, :, 1); % take the red channel
h = fspecial('gaussian', [3 3], 5); % filter with a 3x3 gaussian
filt = conv2(a, h, 'same');
% use non-overlapping windows to for the Ax = b syatem
% NOTE: boundry error checking isn't performed in the code below
s = floor(size(a)/2);
y = s(1);
x = s(2);
w = k*k;
y1 = s(1)-floor(w/2) + r;
y2 = s(1)+floor(w/2);
x1 = s(2)-floor(w/2) + r;
x2 = s(2)+floor(w/2);
b = [];
A = [];
for y = y1:k:y2
for x = x1:k:x2
b = [b; filt(y, x)];
f = a(y-r:y+r, x-r:x+r);
A = [A; f(:)'];
end
end
% estimated filter kernel
g = reshape(A\b, k, k)
LSE method:
clear all
im = double(imread('I2vxD.png'));
k = 5;
r = floor(k/2);
a = im(:, :, 1); % take the red channel
h = fspecial('gaussian', [3 3], 5); % filter with a 3x3 gaussian
filt = floor(conv2(a, h, 'same'));
s = size(a);
y1 = r+2; y2 = s(1)-r-2;
x1 = r+2; x2 = s(2)-r-2;
b = [];
A = [];
for y = y1:2:y2
for x = x1:2:x2
b = [b; filt(y, x)];
f = a(y-r:y+r, x-r:x+r);
f = f(:)';
A = [A; f];
end
end
g = reshape(A\b, k, k) % A\b returns the least squares solution
%g = reshape(pinv(A'*A)*A'*b, k, k)

Generate random numbers matrix in OpenCV

I want to know how can I generate a matrix of random numbers of any given size, for example 2x4. Matrix should consists of signed whole number in range, for example [-500, +500].
I have read the documentation of RNG, but I am not sure on how I should use this.
I referred too this question but this did not provide me the solution I am looking for.
I know this might be a silly question, but any help on it would be truly appreciated.
If you want values to be uniformly distributed, you can use cv::randu
Mat1d mat(2, 4); // Or: Mat mat(2, 4, CV_64FC1);
double low = -500.0;
double high = +500.0;
randu(mat, Scalar(low), Scalar(high));
Note that the upper bound is exclusive, so this example represents data in range [-500, +500).
If you want values to be normally distributed, you can use cv::randn
Mat1d mat(2, 4); // Or: Mat mat(2, 4, CV_64FC1);
double mean = 0.0;
double stddev = 500.0 / 3.0; // 99.7% of values will be inside [-500, +500] interval
randn(mat, Scalar(mean), Scalar(stddev));
This works for matrices up to 4 channels, e.g.:
Mat3b random_image(100,100);
randu(random_image, Scalar(0,0,0), Scalar(256,256,256));

Creating an image of difference of adjacent pixels with digitalmicrograph (DM) script

The following digitalmicrograph function tries to create an image by taking difference of neighboring pixel in a sub-row of a row of the image. The first pixel is replaced with a mean of the difference result of the sub-row thus created.
E.g. If the input image is 8 pixels wide and 1 rows tall and the sub-row size is 4 -
In_img = {8,9,2,4,9,8,7,5}
Then the output image will be -
Out_img = {mean(8,9,2,4)=5.75,9-8=1,2-9=-7,4-2=2,mean(9,8,7,5)=7.25,8-9=-1,7-8=-1,5-7=-2}
When I run this script, the first pixel of the first row is correct but rest of the pixels are incorrect. When I set the loop limit to only one sub-row and one row i.e. x=1 and y=1, then the script works correctly.
Any ideas as to what may be happening or what may be wrong with the script?
The test image is here and the result is here.
// Function to compute the standard deviation (sigma n-1) of an image, or
// a set of values passed in as pixel values in an image. The
// number of data points (n) the mean and the sum are also returned.
// version:20080229
// D. R. G. Mitchell, adminnospam#dmscripting.com (remove the nospam to make this email address work)
// v1.0, February 2008
void StandardDeviation(image arrayimg, number &stddev, number &n, number &mean, number &sum)
{
mean=mean(arrayimg)
number xsize, ysize
getsize(arrayimg,xsize, ysize)
n=xsize*ysize
sum=sum(arrayimg)
image imgsquared=arrayimg*arrayimg
number sumofvalssqrd=sum(imgsquared)
stddev=sqrt(((n*sumofvalssqrd)-(sum*sum))/(n*(n-1)))
}
image getVectorImage(image refImage, number rowsize)
{
number fh, fv, fhx
getsize(refImage, fh, fv)
fhx=trunc(fh/rowsize)
//result("ByteSize of refimage = "+refImage.ImageGetDataElementByteSize()+"\n")
//create image to save std of each row of the ref image.
//The std values are saved as pixels of one row. The row size is same as number of rows.
//use fhx*rowsize for the new imagesize as fhx is truncated value.
image retImage:=RealImage("",4,fhx*rowsize,fv)
image workImage=slice1(refImage,rowsize+1,0,0,0,rowsize-1,1)
number stddev,nopix,mean,sum
for ( number y=0;y<fv;y++)
{
for (number x=0;x<fhx;x++)
{
//result ("x,y="+x+","+y+"; fhx="+fhx+"; rowsize="+rowsize+"\n")
workImage=slice1(refImage,x*rowsize+1,y,0,0,rowsize-1,1)-slice1(refImage,x*rowsize,y,0,0,rowsize-1,1)
showimage(workImage)
StandardDeviation(workImage,stddev,nopix,mean,sum )
retImage[y,x*rowsize+1,y+1,x*rowsize+rowsize]=workImage
retImage[y,x]=mean
result("mean # row "+y+" = "+mean+"\n")
}
}
return retImage
}
showimage(getVectorImage(getfrontimage(),rowsize))
After your edit, I understood that you want to do something like this:
and that this should be performed for each line of the image individually.
The following script does this. (Explanations below.)
image Modify( image in, number subsize )
{
// Some checking
number sx,sy
in.GetSize(sx,sy)
if ( 0 != sx%subsize )
Throw( "The image width is not an integer multiplication of the subsize." )
// Do the means...
number nTile = sx/subsize
image meanImg := RealImage( "Means", 4, nTile , sy )
meanImg = 0
for ( number i=0; i<subsize; i++ )
meanImg += in.Slice2( i,0,0, 0,nTile,subsize, 1,sy,1 )
meanImg *= 1/subsize
// Do the shifted difference
image dif := RealImage( "Diff", 4, sx-1, sy )
dif = in.slice2( 1,0,0, 0,sx-1,1, 1,sy,1) - in.slice2( 0,0,0, 0,sx-1,1, 1,sy,1)
// Compile the result
image out := in.ImageClone()
out.SetName( in.getName() + "mod" )
out.slice2( 1,0,0, 0,sx-1,1, 1,sy,1 ) = dif
out.slice2( 0,0,0, 0,nTile,subsize, 1,sy,1 ) = meanImg
return out
}
number sx = 8, sy = 4
image img := RealImage( "test", 4, 8, 4 )
img = icol*10 + trunc( Random()*10 )
img.ShowImage()
Modify(img,4).ShowImage()
Some explanations:
You want to do two different things in the image, so you have to be careful not to overwrite data in pixels you will subsequently use for computation! Images are processed pixel by pixel, so if you first compute the mean and write it in the first pixel, the evaluation of the second pixel will be the difference of "9" and the just stored mean-value (not the original "8"). So you have to split computation and use "buffer" copies.
The slice2 command is extremely convenient, because it allows to define a stepsize when sampling. You can use it to address the dark-grey pixels directly.
Be aware of the difference between := and = in image expressions. The first is a memory assignment:
A := B means that A now is the same memory location as B. A is basically another name for B.
A = B means A gets the values of B (copied). A and B are two different memory locations and only values are copied over.
I have some observations in your script:
Instead of the defined method for getting mean/sum/stdev/n of an image, you can as easily get to those numbers from any image img using the following:
mean: number m = mean( img )
sum: number s = sum( img )
stdev: number sd = sqrt( variance( img ) )
pixels: number n = sum( 0 * img + 1 )
if you want to get the difference of an image with an image "shifted by one" you don't have to loop over the lines/columns but can directly use the slice2() command; a [ ] notation using icol and irow; or the command offset() Personally, I prefer the slice2() command.
If I want a script which gives me the standard deviation of the difference of each row with its successor row, i.e. stdDev( row_(y) - row_(y+1) ) for all y < sizeY, my script would be:
Image img := GetFrontImage()
number sx,sy
img.GetSize(sx,sy)
number dy = 1
Image dif = img.Slice2(0,0,0, 0,sx,1, 1,sy-1,1 ) - img.Slice2(0,dy,0, 0,sx,1, 1,sy-1,1)
Image sDevs := RealImage( "Row's stDev", 4, sy-1 )
for ( number y=0; y<sy-1; y++ )
sDevs[y,0] = SQRT( Variance( dif.Slice1(0,y,0, 0,sx,1) ) )
sDevs.ShowImage()
Is this, what you try to achieve? If not, please edit your question for some clarification.

Resources