What does this x86 SSE code do?

What does this x86 SSE code do? - opencv

I see this piece of code in OpenCV.
__m128i delta = _mm_set1_epi8(-128),
t = _mm_set1_epi8((char)threshold),
K16 = _mm_set1_epi8((char)K);
(void)K16;
(void)delta;
(void)t;
Can someone explain to me what it does ? All I got is what the sse functions do but what happens in next three line is unclear

Sets the 128 bit value to the signed char input in 8-bit strides:
http://msdn.microsoft.com/en-us/library/6e14xhyf(v=vs.90).aspx

Related

I can't get vImage (Accelerate Framework) to convert 420Yp8_Cb8_Cr8 (planar) to ARGB8888

I'm trying to convert Planar YpCbCr to RGBA and it's failing with error kvImageRoiLargerThanInputBuffer.
I tried two different ways. Here're some code snippets.
Note 'thumbnail_buffers + 1' and 'thumbnail_buffers + 2' have width and height half of 'thumbnail_buffers + 0'
because I'm dealing with 4:2:0 and have (1/2)*(1/2) as many chroma samples each as luma samples. This silently fails
(even though I asked for an explanation (kvImagePrintDiagnosticsToConsole).
error = vImageConvert_YpCbCrToARGB_GenerateConversion(
kvImage_YpCbCrToARGBMatrix_ITU_R_709_2,
&fullrange_8bit_clamped_to_fullrange,
&convertInfo,
kvImage420Yp8_Cb8_Cr8, kvImageARGB8888,
kvImagePrintDiagnosticsToConsole);
uint8_t BGRA8888_permuteMap[4] = {3, 2, 1, 0};
uint8_t alpha = 255;
vImage_Buffer dest;
error = vImageConvert_420Yp8_Cb8_Cr8ToARGB8888(
thumbnail_buffers + 0, thumbnail_buffers + 1, thumbnail_buffers + 2,
&dest,
&convertInfo, BGRA8888_permuteMap, alpha,
kvImagePrintDiagnosticsToConsole //I don't think this flag works here
);
So I tried again with vImageConvert_AnyToAny:
vImage_CGImageFormat cg_BGRA8888_format = {
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.colorSpace = baseColorspace,
.bitmapInfo =
kCGImageAlphaNoneSkipFirst | kCGBitmapByteOrder32Little,
.version = 0,
.decode = (CGFloat*)0,
.renderingIntent = kCGRenderingIntentDefault
};
vImageCVImageFormatRef vformat = vImageCVImageFormat_Create(
kCVPixelFormatType_420YpCbCr8Planar,
kvImage_ARGBToYpCbCrMatrix_ITU_R_709_2,
kCVImageBufferChromaLocation_Center,
baseColorspace,
0);
vImageConverterRef icref = vImageConverter_CreateForCVToCGImageFormat(
vformat,
&cg_BGRA8888_format,
(CGFloat[]){0, 0, 0},
kvImagePrintDiagnosticsToConsole,
&error );
vImage_Buffer dest;
error = vImageBuffer_Init( &dest, image_height, image_width, 8, kvImagePrintDiagnosticsToConsole);
error = vImageConvert_AnyToAny( icref, thumbnail_buffers, &dest, (void*)0, kvImagePrintDiagnosticsToConsole); //kvImageGetTempBufferSize
I get the same error but this time I get the following message printed to the console.
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[1].height must be >= dests[0].height
But this doesn't make any sense to me. How can my Cb height be anything other than half my Yp height (which is the same as my dest RGB height) when I've got 4:2:0 data?
(Likewise for width?)
What on earth am I doing wrong? I'm going to be doing other conversions as well (4:4:4, 4:2:2, etc) so any clarification
on these APIs would be greatly appreciated. Further, what is my siting supposed to be for these conversions? Above I use
kCVImageBufferChromaLocation_Center. Is that correct?
Some new info:
Since posting this I saw a glaring error, but fixing it didn't help. Notice that in the vImageConvert_AnyToAny case above, I initialized the destination buffer with just the image width instead of 4*width to make room for RGBA. That must be the problem, right? Nope.
Notice further that in the vImageConvert_* case, I didn't initialize the destination buffer at all. Fixed that too and it didn't help.
So far I've tried the conversion six different ways choosing one from (vImageConvert_* | vImageConvert_AnyToAny) and choosing one from (kvImage420Yp8_Cb8_Cr8 | kvImage420Yp8_CbCr8 | kvImage444CrYpCb8) feeding the appropriate number of input buffers each time--carefully checking that the buffers take into account the number of samples per pixel per plane. Each time I get:
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[0].width must be >= dests[0].width
which makes no sense to me. If my luma plane is say 100 wide, my RGBA buffer should be 400 wide. Please any guidance or working code going from YCC to RGBA would be greatly appreciated.

Okay, I figured it out--part user error, part Apple bug. I was thinking of the vImage_Buffer's width and height wrong. For example, the output buffer I specified as 4 * image_width, and 8 bits per pixel, when it should have been simply image_width and 32 bits per pixel--the same amount of memory but sending the wrong message to the APIs. The literal '8' on that line blinded me from remembering what that slot was, I guess. A lesson I must have learned many times--name your magic numbers.
Anyway, now the bug part. Making the input and output buffers correct with regards to width, height, pixel depth fixed all the calls to the low-level vImageConvert_420Yp8_Cb8_Cr8ToARGB8888 and friends. For example in the planar YCC case, your Cb and Cr buffers would naturally have half the width and half the height of the Yp plane. However, in the vImageConvert_AnyToAny cases these buffers caused the calls to fail and bail--saying silly things like I needed my Cb plane to have the same dimensions as my Yp plane even for 4:2:0. This appears to be a bug in some preflighting done by Apple before calling the lower-level code that does the work.
I worked around the vImageConvert_AnyToAny bug by simply making input buffers that were too big and only filling Cb and Cr data in the top-left quadrant. The data were found there during the conversion just fine. I made these too-big buffers with vImageBuffer_Init() where Apple allocated the too-big malloc that goes to waste. I didn't try making the vImage_Buffer's by hand--lying about the size and allocating just the memory I need. This may work, or perhaps Apple will crawl off into the weeds trusting the width and height. If you hand make one, you better tell the truth about the rowBytes however.
I'm going to leave this answer for a bit before marking it correct, hoping someone at Apple sees this and fixes the bug and perhaps gets inspired to improve the documentation for those of us stumbling about.

OpenCV Threshold Type

I have a question about OpenCV's example on Basic Thresholding as provided in the link below:
http://docs.opencv.org/2.4/doc/tutorials/imgproc/threshold/threshold.html#goal
I am slowly beginning to understand the code and have tried out an example too. However I am confused about a part of the code regarding thresholding operations. How does the thresholding function know which threshold operation to use?
This is where it is called:
threshold( src_gray, dst, threshold_value, max_BINARY_value,threshold_type);
I get that the last parameter "threshold_type is how it knows which threshold operation to use(eg. binary, binary inverted, truncated etc.) However in the code, this is all that is assigned to threshold_type:
int threshold_type = 3
As it is only assigned an int value of 3. How does the Threshold function know what operation to give it? Could someone explain it to me?

You should avoid using numeric literals to call the method of OpenCV instead use the constant variable defined in the opencv namespace, However it won't create any difference in output, but it makes the code more readable, So deciphered set of inputs to the cv::threshold() method are:
THRESH_BINARY = 0,
THRESH_BINARY_INV = 1,
THRESH_TRUNC = 2,
THRESH_TOZERO = 3,
THRESH_TOZERO_INV = 4,
THRESH_MASK = 7,
THRESH_OTSU = 8,
THRESH_TRIANGLE = 16
According to this table you are using thresholdType == THRESH_TOZERO

how to deinterleave image channel in SSE

is there any way we can DE-interleave 32bpp image channels similar as below code in neon.
//Read all r,g,b,a pixels into 4 registers
uint8x8x4_t SrcPixels8x8x4= vld4_u8(inPixel32);
ChannelR1_32x4 = vmovl_u16(vget_low_u16(vmovl_u8(SrcPixels8x8x4.val[0]))),
channelR2_32x4 = vmovl_u16(vget_high_u16(vmovl_u8(SrcPixels8x8x4.val[0]))), vGaussElement_32x4_high);
basically i want all color channels in separate vectors with every vector has 4 elements of 32bits to do some calculation but i am not very familiar with SSE and could not find such instruction in SSE or if some one can provide better ways to do that? Any help is highly appreciated

Since the 8 bit values are unsigned you can just do this with shifting and masking, much like you would for scalar code, e.g.
__m128i vrgba;
__m128i vr = _mm_and_si128(vrgba, _mm_set1_epi32(0xff));
__m128i vg = _mm_and_si128(_mm_srli_epi32(vrgba, 8), _mm_set1_epi32(0xff));
__m128i vb = _mm_and_si128(_mm_srli_epi32(vrgba, 16), _mm_set1_epi32(0xff));
__m128i va = _mm_srli_epi32(vrgba, 24);
Note that I'm assuming your RGBA elements have the R component in the LS 8 bits and the A component in the MS 8 bits, but if they are the opposite endianness you can just change the names of the vr/vg/vb/va vectors.

Z3Py: How should I represent some 32-bit; 16-bit and 8-bit registers?

I am a newbie to Z3. Sorry if it is a stupid question..
I am basically trying to implement a simple symbolic execution engine on x86-32bit assembly instructions. Here is the problem I am facing now:
Suppose before execution, I have initialize some registers by using BitVec.
self.eq['%eax'] = BitVec('reg%d' % 1, 32)
self.eq['%ebx'] = BitVec('reg%d' % 2, 32)
self.eq['%ecx'] = BitVec('reg%d' % 3, 32)
self.eq['%edx'] = BitVec('reg%d' % 4, 32)
So here is my question, how to handle some 16-bit or even 8-bit registers?
Is there anyway I can extract a 8-bit part from a 32-bit BitVec, assigning it with some value, and then put it back? Can I do that in z3? Or is there any better way..?
Am I clear? thank you a lot!

You can extract parts of a bitvector which results in a new, smaller bitvector value that you can use any way you like (for example add).
You can replace parts of a bitvector by first extracting all the parts and then concatenating smaller bitvectors into one big one.
For example incrementing the upper half of eax would be like this:
eaxNew = concat(add(extract(eaxOld, upperHalf), 1), extract(eaxOld, lowerHalf))
(Pseudo-code)
http://research.microsoft.com/en-us/um/redmond/projects/z3/namespacez3py.html

loaddup_pd/unpacklo_pd on Xeon Phi

If I have the following doubles in a 512-wide SIMD vector, as in a Xeon Phi register:
m0 = |b4|a4|b3|a3|b2|a2|b1|a1|
is it possible to make it into:
m0_d = |a4|a4|a3|a3|a2|a2|a1|a1|
using a single instruction?
Also since there are no bitwise intrinsics for doubles is this still a valid way to achieve the above?
m0_t = _mm512_swizzle_pd(m9,_MM_SWIZ_REG_CDAB);//m0_t->|a4|b4|a3|b3|a2|b2|a1|b1|
__m512d res = _mm512_mask_or_epi64(m0,k1,zero,m0_t);//k1 is 0xAA

Can be achieved as follows:
m0_d = _mm512_mask_swizzle_pd(m0,0xAA,m0,_MM_SWIZ_REG_CDAB);
It might seem that the swizzle operation is limited, but with the masked variant we can achieve other permutations too.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart