How to insert a u32 into a u8 array in Zig? - zig

I have a u8 buffer that stores a series of values including the RGB components from two u32 values (of which only 24 bits are used).
I'm currently using bitwise operations to extract the individual components, then using #truncate to drop the precision so they'll fit.
var buffer = [_]u8{0} ** max_buffer_size;
const fg = 0xAABBCC;
const bg = 0xAABBCC;
// ...
buffer[idx + 2] = #truncate(u8, fg & 0xFF);
buffer[idx + 3] = #truncate(u8, fg >> 8 & 0xFF);
buffer[idx + 4] = #truncate(u8, fg >> 16 & 0xFF);
buffer[idx + 5] = #truncate(u8, bg & 0xFF);
buffer[idx + 6] = #truncate(u8, bg >> 8 & 0xFF);
buffer[idx + 7] = #truncate(u8, bg >> 16 & 0xFF);
Is there a way to insert the u32 values directly into this memory instead without needing to unpack the values?
// this (obviously) doesn't work (expecting a u8, getting a u32)
buffer[idx + 2] = fg;

You can use writeIntNative or other variants (check std.mem source code) to do this.
E.g.
std.mem.writeIntNative(u32, buffer[idx..][0..#sizeOf(u32)], fg);

Related

How to implement fast majority voting for a bit matrix

I have a representation of a large bit matrix where I'd like to efficiently retrieve the majority bit for each matrix column (^= bit value that occurs most often). The background is that the matrix rows represent ORB feature descriptors and the value I'm looking for resembles the mean in the Hamming domain.
The implementation I'm currently working with looks like this
// holds column-sum for each bit
std::vector<int> sum(32 * 8, 0);
// cv::Mat mat is a matrix of values € [0, 255] filled elsewhere
for (size_t i = 0; i < mat.cols; ++i)
{
const cv::Mat &d = mat.row(i);
const unsigned char *p = d.ptr<unsigned char>();
// count bits set column-wise
for (int j = 0; j < d.cols; ++j, ++p)
{
if (*p & (1 << 7)) ++sum[j * 8];
if (*p & (1 << 6)) ++sum[j * 8 + 1];
if (*p & (1 << 5)) ++sum[j * 8 + 2];
if (*p & (1 << 4)) ++sum[j * 8 + 3];
if (*p & (1 << 3)) ++sum[j * 8 + 4];
if (*p & (1 << 2)) ++sum[j * 8 + 5];
if (*p & (1 << 1)) ++sum[j * 8 + 6];
if (*p & (1)) ++sum[j * 8 + 7];
}
}
cv::Mat mean = cv::Mat::zeros(1, 32, CV_8U);
unsigned char *p = mean.ptr<unsigned char>();
const int N2 = (int)mat.rows / 2 + mat.rows % 2;
for (size_t i = 0; i < sum.size(); ++i)
{
if (sum[i] >= N2)
{
// set bit in mean only if the corresponding matrix column
// contains more 1s than 0s
*p |= 1 << (7 - (i % 8));
}
if (i % 8 == 7) ++p;
}
The bottleneck is the big loop with all the bit shifting. Is there any way or known bit magic to make this any faster?

Integer literal '255' overflows when stored into 'Int8'

I have the following code and i get the following error. Integer literal '255' overflows when stored into 'Int8'
func decodeIDArrayItem(index:Int, tokenArray:UnsafeMutablePointer<CChar>){
var value = tokenArray[index * 4] & 0xFF
value <<= 8;
value |= tokenArray [index * 4 + 1] & 0xFF
value <<= 8;
value |= tokenArray [index * 4 + 2] & 0xFF
value <<= 8;
value |= tokenArray [index * 4 + 3] & 0xFF
}
Any thoughts?
func decodeIDArrayItem(index:Int, tokenArray:UnsafeMutablePointer<CChar>) -> UInt32{
var value:UInt32 = UInt32(tokenArray[index * 4]) & 0xFF
value <<= 8
value |= UInt32(tokenArray [index * 4 + 1]) & 0xFF
value <<= 8
value |= UInt32(tokenArray [index * 4 + 2]) & 0xFF
value <<= 8
value |= UInt32(tokenArray [index * 4 + 3]) & 0xFF
return value
}
I hope you're trying to extract 8 bit data into 32Bit format. You're getting issues because signed char. Any way try with UInt32, it will work fine.
I hope below code will help you.
func decodeIDArrayItem(index:Int, tokenArray:UnsafeMutablePointer<CChar>) -> UInt32{
// convert into 4 byte
// tokenArray[index * 4] of type UInt8 formate
var value:UInt32
let byte1 : UInt32 = UInt32(tokenArray[index * 4]) // 0 index
let byte2 : UInt32 = UInt32(tokenArray[index * 4 + 1])<<8 // 1 index
let byte3 : UInt32 = UInt32(tokenArray[index * 4 + 2])<<16 // 2 index
let byte4 : UInt32 = UInt32(tokenArray[index * 4 + 2])<<32 // 3 index
value = byte1 | byte2 | byte3 | byte4
return value
}
haha! this worked for me
Int8(0xff & 0xff)
nice little hack

Is there a direct way to get a unique value representing RGB color in opencv C++

My image is a RGB image. I want to get a unique value (such as a unicode value) to represent the RGB color value of a certain pixel. For example If the pixels red channel=23, Green channel=200,Blue channel=45 this RGB color could be represented by 232765. I wish if there is a direct opencv c++ function to get such a value from a pixel. And note that this unique value should be unique for that RGB value.
I want something like this and I know this is not correct.
uniqueColorForPixel_i_j=(matImage.at<Vec3b>(i,j)).getUniqueColor();
I hope something could be done if we can get the Scalar value of a pixel. And as in the way RNG can generate a random Scalar RGB value from number, can we get the inverse...
Just a small sample code to show how to pass directly a Vec3b to the function, and an alternative way to shift-and approach.
The code is based on this answer.
UPDATE
I added also a simple struct BGR, that will handle more easily the conversion between Vec3b and unsigned.
UPDATE 2
The code in your question:
uniqueColorForPixel_i_j=(matImage.at<Vec3b>(i,j)).getUniqueColor();
doesn't work because you're trying to call the method getUniqueColor() on a Vec3b which hasn't this method. You should instead pass the Vec3b as the argument of unsigned getUniqueColor(const Vec3b& v);.
The code should clarify this:
#include <opencv2\opencv.hpp>
using namespace cv;
unsigned getUniqueColor_v1(const Vec3b& v)
{
return ((v[2] & 0xff) << 16) + ((v[1] & 0xff) << 8) + (v[0] & 0xff);
}
unsigned getUniqueColor_v2(const Vec3b& v)
{
return 0x00ffffff & *((unsigned*)(v.val));
}
struct BGR
{
Vec3b v;
unsigned u;
BGR(const Vec3b& v_) : v(v_){
u = ((v[2] & 0xff) << 16) + ((v[1] & 0xff) << 8) + (v[0] & 0xff);
}
BGR(unsigned u_) : u(u_) {
v[0] = uchar(u & 0xff);
v[1] = uchar((u >> 8) & 0xff);
v[2] = uchar((u >> 16) & 0xff);
}
};
int main()
{
Vec3b v(45, 200, 23);
unsigned col1 = getUniqueColor_v1(v);
unsigned col2 = getUniqueColor_v2(v);
unsigned col3 = BGR(v).u;
// col1 == col2 == col3
//
// hex: 0x0017c82d
// dec: 1558573
Vec3b v2 = BGR(col3).v;
// v2 == v
//////////////////////////////
// Taking values from a mat
//////////////////////////////
// Just 2 10x10 green mats
Mat mat1(10, 10, CV_8UC3);
mat1.setTo(Vec3b(0, 255, 0));
Mat3b mat2(10, 10, Vec3b(0, 255, 0));
int row = 2;
int col = 3;
unsigned u1 = getUniqueColor_v1(mat1.at<Vec3b>(row, col));
unsigned u2 = BGR(mat1.at<Vec3b>(row, col)).u;
unsigned u3 = getUniqueColor_v1(mat2(row, col));
unsigned u4 = BGR(mat2(row, col)).u;
// u1 == u2 == u3 == u4
return 0;
}

YUV420 to RGB conversion

I converted an RGB matrix to YUV matrix using this formula:
Y = (0.257 * R) + (0.504 * G) + (0.098 * B) + 16
Cr = V = (0.439 * R) - (0.368 * G) - (0.071 * B) + 128
Cb = U = -(0.148 * R) - (0.291 * G) + (0.439 * B) + 128
I then did a 4:2:0 chroma subsample on the matrix. I think I did this correctly, I took 2x2 submatrices from the YUV matrix, ordered the values from least to greatest, and took the average between the 2 values in the middle.
I then used this formula, from Wikipedia, to access the Y, U, and V planes:
size.total = size.width * size.height;
y = yuv[position.y * size.width + position.x];
u = yuv[(position.y / 2) * (size.width / 2) + (position.x / 2) + size.total];
v = yuv[(position.y / 2) * (size.width / 2) + (position.x / 2) + size.total + (size.total / 4)];
I'm using OpenCV so I tried to interpret this as best I can:
y = src.data[(i*channels)+(j*step)];
u = src.data[(j%4)*step + ((i%2)*channels+1) + max];
v = src.data[(j%4)*step + ((i%2)*channels+2) + max + (max%4)];
src is the YUV subsampled matrix. Did I interpret that formula correctly?
Here is how I converted the colours back to RGB:
bgr.data[(i*channels)+(j*step)] = (1.164 * (y - 16)) + (2.018 * (u - 128)); // B
bgr.data[(i*channels+1)+(j*step)] = (1.164 * (y - 16)) - (0.813 * (v - 128)) - (0.391 * (u - 128)); // G
bgr.data[(i*channels+2)+(j*step)] = (1.164 * (y - 16)) + (1.596 * (v - 128)); // R
The problem is my image does not return to its original colours.
Here are the images for reference:
http://i.stack.imgur.com/vQkpT.jpg (Subsampled)
http://i.stack.imgur.com/Oucc5.jpg (Output)
I see that I should be converting from YUV444 to RGB now but I don't quite I understand what the clip function does in the sample I found on Wiki.
C = Y' − 16
D = U − 128
E = V − 128
R = clip(( 298 * C + 409 * E + 128) >> 8)
G = clip(( 298 * C - 100 * D - 208 * E + 128) >> 8)
B = clip(( 298 * C + 516 * D + 128) >> 8)
Does the >> mean I should shift bits?
I'd appreciate any help/comments! Thanks
Update
Tried doing the YUV444 conversion but it just made my image appear in shades of green.
y = src.data[(i*channels)+(j*step)];
u = src.data[(j%4)*step + ((i%2)*channels+1) + max];
v = src.data[(j%4)*step + ((i%2)*channels+2) + max + (max%4)];
c = y - 16;
d = u - 128;
e = v - 128;
bgr.data[(i*channels+2)+(j*step)] = clip((298*c + 409*e + 128)/256);
bgr.data[(i*channels+1)+(j*step)] = clip((298*c - 100*d - 208*e + 128)/256);
bgr.data[(i*channels)+(j*step)] = clip((298*c + 516*d + 128)/256);
And my clip function:
int clip(double value)
{
return (value > 255) ? 255 : (value < 0) ? 0 : value;
}
I had the same problem when decoding WebM frames to RGB. I finally found the solution after hours of searching.
Take SCALEYUV function from here: http://www.telegraphics.com.au/svn/webpformat/trunk/webpformat.h
Then to decode the RGB data from YUV, see this file:
http://www.telegraphics.com.au/svn/webpformat/trunk/decode.c
Search for "py = img->planes[0];", there are two algorithms to convert the data. I only tried the simple one (after "// then fall back to cheaper method.").
Comments in the code also refer to this page: http://www.poynton.com/notes/colour_and_gamma/ColorFAQ.html#RTFToC30
Works great for me.
You won't get back perfectly the same image since UV does compress the image.
You don't say if the result is completely wrong (ie an error) or just not perfect
R = clip(( 298 * C + 409 * E + 128) >> 8)
G = clip(( 298 * C - 100 * D - 208 * E + 128) >> 8)
B = clip(( 298 * C + 516 * D + 128) >> 8)
The >> 8 is a bit shift, equivalent to dividing by 256. This is just to allow you to do all the arithmatic in integer units rather than floating point for speed
Was experimenting with formulas present on wiki and found that mixed formula:
byte c = (byte) (y - 16);
byte d = (byte) (u - 128);
byte e = (byte) (v - 128);
byte r = (byte) (c + (1.370705 * (e)));
byte g = (byte) (c - (0.698001 * (d)) - (0.337633 * (e)));
byte b = (byte) (c + (1.732446 * (d)));
produces "better" errors for my images, simply makes some black points pure green (i.e. rgb = 0x00FF00) which is better for detection and correction ...
wiki source: https://en.wikipedia.org/wiki/YUV#Y.27UV420p_.28and_Y.27V12_or_YV12.29_to_RGB888_conversion

Fast RGB => YUV conversion in OpenCL

I know the following formula can be used to convert RGB images to YUV images. In the following formula, R, G, B, Y, U, V are all 8-bit unsigned integers, and intermediate values are 16-bit unsigned integers.
Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16
U = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128
V = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128
But when the formula is used in OpenCL it's a different story.
1. 8-bit memory write access is an optional extension, which means some OpenCL implementations may not support it.
2. even the above extension is supported, it's deadly slow compared with 32-bit write access.
In order to get better performance, every 4 pixels will be processed at the same time, so the input is 12 8-bit integers and the output is 3 32-bit unsigned integers(the first one stands for 4 Y samples, the second one stands for 4 U samples, the last one stands for 4 V samples).
My question is how to get these 3 32-bit integers directly from the 12 8-bit integers? Is there a formula to get these 3 32-bit integers, or I just need to use the old formula to get 12 8-bit integer results(4 Y, 4 U, 4 V) and construct the 3 32-bit integers with bit-wise operation?
Even though this question was asked 2 years ago, i think some working code would help here. In terms of the initial concerns about bad performance when directly accessing 8-bit values, it's better to perform 32-bit direct access when possible.
Some time ago I've developed and used the following OpenCL kernel to convert ARGB (typical windows bitmap pixel layout) to the y-plane (full sized), u/v-half-plane (quarter sized) memory layout as input for libx264 encoding.
__kernel void ARGB2YUV (
__global unsigned int * sourceImage,
__global unsigned int * destImage,
unsigned int srcHeight,
unsigned int srcWidth,
unsigned int yuvStride // must be srcWidth/4 since we pack 4 pixels into 1 Y-unit (with 4 y-pixels)
)
{
int i,j;
unsigned int RGBs [ 4 ];
unsigned int posSrc, RGB, Value4 = 0, Value, yuvStrideHalf, srcHeightHalf, yPlaneOffset, posOffset;
unsigned char red, green, blue;
unsigned int posX = get_global_id(0);
unsigned int posY = get_global_id(1);
if ( posX < yuvStride ) {
// Y plane - pack 4 y's within each work item
if ( posY >= srcHeight )
return;
posSrc = (posY * srcWidth) + (posX * 4);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 1 ];
RGBs [ 2 ] = sourceImage [ posSrc + 2 ];
RGBs [ 3 ] = sourceImage [ posSrc + 3 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( 66 * red + 129 * green + 25 * blue ) >> 8 ) + 16;
Value4 |= (Value << (i * 8));
}
destImage [ (posY * yuvStride) + posX ] = Value4;
return;
}
posX -= yuvStride;
yuvStrideHalf = yuvStride >> 1;
// U plane - pack 4 u's within each work item
if ( posX >= yuvStrideHalf )
return;
srcHeightHalf = srcHeight >> 1;
if ( posY < srcHeightHalf ) {
posSrc = ((posY * 2) * srcWidth) + (posX * 8);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
RGBs [ 3 ] = sourceImage [ posSrc + 6 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( -38 * red + -74 * green + 112 * blue ) >> 8 ) + 128;
Value4 |= (Value << (i * 8));
}
yPlaneOffset = yuvStride * srcHeight;
posOffset = (posY * yuvStrideHalf) + posX;
destImage [ yPlaneOffset + posOffset ] = Value4;
return;
}
posY -= srcHeightHalf;
if ( posY >= srcHeightHalf )
return;
// V plane - pack 4 v's within each work item
posSrc = ((posY * 2) * srcWidth) + (posX * 8);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
RGBs [ 3 ] = sourceImage [ posSrc + 6 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( 112 * red + -94 * green + -18 * blue ) >> 8 ) + 128;
Value4 |= (Value << (i * 8));
}
yPlaneOffset = yuvStride * srcHeight;
posOffset = (posY * yuvStrideHalf) + posX;
destImage [ yPlaneOffset + (yPlaneOffset >> 2) + posOffset ] = Value4;
return;
}
This code performs only global 32-bit memory access while 8-bit processing happens within each work item.
Oh.. and the proper code to invoke the kernel
unsigned int width = 1024;
unsigned int height = 768;
unsigned int frameSize = width * height;
const unsigned int argbSize = frameSize * 4; // ARGB pixels
const unsigned int yuvSize = frameSize + (frameSize >> 1); // Y,U,V planes
const unsigned int yuvStride = width >> 2; // since we pack 4 RGBs into "one" YYYY
// Allocates ARGB buffer
ocl_rgb_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, argbSize, 0, &error );
// ... error handling ...
ocl_yuv_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, yuvSize, 0, &error );
// ... error handling ...
error = clSetKernelArg ( kernel, 0, sizeof(cl_mem), &ocl_rgb_buffer );
error |= clSetKernelArg ( kernel, 1, sizeof(cl_mem), &ocl_yuv_buffer );
error |= clSetKernelArg ( kernel, 2, sizeof(unsigned int), &height);
error |= clSetKernelArg ( kernel, 3, sizeof(unsigned int), &width);
error |= clSetKernelArg ( kernel, 4, sizeof(unsigned int), &yuvStride);
// ... error handling ...
const size_t local_ws[] = { 16, 16 };
const size_t global_ws[] = { yuvStride + (yuvStride >> 1), height };
error = clEnqueueNDRangeKernel ( queue, kernel, 2, NULL, global_ws, local_ws, 0, NULL, NULL );
// ... error handling ...
Note: have a look at the work item calculations. Some additional code needs to be added (e.g. using mod so as to add sufficient spare items) to make sure that work item sizes fit to local work sizes.
Like this? Use int4 unless your platform can use int3. Also you can pack 5 pixels into an int16 so you are wasting 1/16 instead of 1/4 of the memory bandwidth.
__kernel void rgb2yuv( __global int3* input, __global int3* output){
rgb = input[get_global_id(0)];
R = rgb.x;
G = rgb.y;
B = rgb.z;
yuv.x = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16;
yuv.y = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128;
yuv.z = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128;
output[get_global_id(0)] = yuv;
}
Along with opencl specification data type int3 doesn't exists.
Page 123:
Supported values of n are 2, 4, 8, and 16...
In your kernel variables rgb, R, G, B, and yuv should be at least __private int4.
OpenCL 1.1 added support for typen where n = 3. However, I strongly recommend you don't use it. Different vendor implementations have different bugs, and it's not saving you anything.

Resources