On iOS how to quickly convert RGB24 to BGR24? - ios

I use vImageConvert_RGB888toPlanar8 and vImageConvert_Planar8toRGB888 from Accelerate.framework to convert RGB24 to BGR24, but when the data need to transform is very big, such as 3M or 4M, the time need to spend on this is about 10ms. So some one know some fast enough idea?.My code like this:
- (void)transformRGBToBGR:(const UInt8 *)pict{
rgb.data = (void *)pict;
vImage_Error error = vImageConvert_RGB888toPlanar8(&rgb,&red,&green,&blue,kvImageNoFlags);
if (error != kvImageNoError) {
NSLog(#"vImageConvert_RGB888toARGB8888 error");
}
error = vImageConvert_Planar8toRGB888(&blue,&green,&red,&bgr,kvImageNoFlags);
if (error != kvImageNoError) {
NSLog(#"vImagePermuteChannels_ARGB8888 error");
}
free((void *)pict);
}

With a RGB888ToPlanar8 call you scatter the data and then gather it once again. This is very-very-very bad. If the memory overhead of 33% is affordable, try using the RGBA format and permute the B/R bytes in-place.
If you want to save 33% percents, then I might suggest the following. Iterate all the pixels, but read only a multiple of 4 bytes (since lcm(3,4) is 12, that is 3 dwords).
uint8_t* src_image;
uint8_t* dst_image;
uint32_t* src = (uint32_t*)src_image;
uint32_t* dst = (uint32_t*)dst_image;
uint32_t v1, v2, v3;
uint32_t nv1, nv2, nv3;
for(int i = 0 ; i < num_pixels / 12 ; i++)
{
// read 12 bytes
v1 = *src++;
v2 = *src++;
v3 = *src++;
// shuffle bits in the pixels
// [R1 G1 B1 R2 | G2 B2 R3 G3 | B3 R4 G4 B4]
nv1 = // [B1 G1 R1 B2]
((v1 >> 8) & 0xFF) | (v1 & 0x00FF0000) | ((v1 >> 16) & 0xFF) | ((v2 >> 24) & 0xFF);
nv2 = // [G2 R2 B3 G3]
...
nv3 = // [R3 B4 G4 R4]
...
// write 12 bytes
*dst++ = nv1;
*dst++ = nv2;
*dst++ = nv3;
}
Even better can be done with NEON intrinsics.
See this link from ARM's website to see how the 24-bit swapping is done.
The BGR-to-RGB can be done in-place like this:
void neon_asm_convert_BGR_TO_RGB(uint8_t* img, int numPixels24)
{
// numPixels is divided by 24
__asm__ volatile(
"0: \n"
"# load 3 64-bit regs with interleave: \n"
"vld3.8 {d0,d1,d2}, [%0] \n"
"# swap d0 and d2 - R and B\n"
"vswp d0, d2 \n"
"# store 3 64-bit regs: \n"
"vst3.8 {d0,d1,d2}, [%0]! \n"
"subs %1, %1, #1 \n"
"bne 0b \n"
:
: "r"(img), "r"(numPixels24)
: "r4", "r5"
);
}

Just swap the channels - BGRA to RGBA
- (void)convertBGRAFrame:(const CLPBasicVideoFrame &)bgraFrame toRGBA:(CLPBasicVideoFrame &)rgbaFrame
{
vImage_Buffer bgraImageBuffer = {
.width = bgraFrame.width,
.height = bgraFrame.height,
.rowBytes = bgraFrame.bytesPerRow,
.data = bgraFrame.rawPixelData
};
vImage_Buffer rgbaImageBuffer = {
.width = rgbaFrame.width,
.height = rgbaFrame.height,
.rowBytes = rgbaFrame.bytesPerRow,
.data = rgbaFrame.rawPixelData
};
const uint8_t byteSwapMap[4] = { 2, 1, 0, 3 };
vImage_Error error;
error = vImagePermuteChannels_ARGB8888(&bgraImageBuffer, &rgbaImageBuffer, byteSwapMap, kvImageNoFlags);
if (error != kvImageNoError) {
NSLog(#"%s, vImage error %zd", __PRETTY_FUNCTION__, error);
}
}

Related

how to convert pixelBuffer from BGRA to YUV

i want to convert pixelBuffer from BGRA to YUV(420V).
Using the convert function, most of the videos in my mobile phone photo albums are running normally ,
Execpt the one video from my colleagues, after converted the pixels are insanity,
the video from my colleagues is quite normal,
Video
ID : 1
Format : AVC
Format/Info : Advanced Video Codec
Format profile : Main#L3.1
Format settings : CABAC / 1 Ref Frames
Format settings, CABAC : Yes
Format settings, Reference frames : 1 frame
Format settings, GOP : M=1, N=15
Codec ID : avc1
Codec ID/Info : Advanced Video Coding
Duration : 6 s 623 ms
Source duration : 6 s 997 ms
Bit rate : 4 662 kb/s
Width : 884 pixels
Clean aperture width : 884 pixels
Height : 492 pixels
Clean aperture height : 492 pixels
Display aspect ratio : 16:9
Original display aspect ratio : 16:9
Frame rate mode : Variable
Frame rate : 57.742 FPS
Minimum frame rate : 20.000 FPS
Maximum frame rate : 100.000 FPS
Color space : YUV
Chroma subsampling : 4:2:0
Bit depth : 8 bits
Scan type : Progressive
Bits/(Pixel*Frame) : 0.186
Stream size : 3.67 MiB (94%)
Source stream size : 3.79 MiB (97%)
Title : Core Media Video
Encoded date : UTC 2021-10-29 09:54:03
Tagged date : UTC 2021-10-29 09:54:03
Color range : Limited
Color primaries : Display P3
Transfer characteristics : BT.709
Matrix coefficients : BT.709
Codec configuration box : avcC
this is my function, i do not know what is wrong.
CFDictionaryRef CreateCFDictionary(CFTypeRef* keys, CFTypeRef* values, size_t size) {
return CFDictionaryCreate(kCFAllocatorDefault,
keys,
values,
size,
&kCFTypeDictionaryKeyCallBacks,
&kCFTypeDictionaryValueCallBacks);
}
static void bt709_rgb2yuv8bit_TV(uint8_t R, uint8_t G, uint8_t B, uint8_t &Y, uint8_t &U, uint8_t &V)
{
Y = 0.183 * R + 0.614 * G + 0.062 * B + 16;
U = -0.101 * R - 0.339 * G + 0.439 * B + 128;
V = 0.439 * R - 0.399 * G - 0.040 * B + 128;
}
CVPixelBufferRef RGB2YCbCr8Bit(CVPixelBufferRef pixelBuffer)
{
CVPixelBufferLockBaseAddress(pixelBuffer, 0);
uint8_t *baseAddress = (uint8_t *)CVPixelBufferGetBaseAddress(pixelBuffer);
int w = (int) CVPixelBufferGetWidth(pixelBuffer);
int h = (int) CVPixelBufferGetHeight(pixelBuffer);
// int stride = (int) CVPixelBufferGetBytesPerRow(pixelBuffer) / 4;
OSType pixelFormat = kCVPixelFormatType_420YpCbCr8BiPlanarVideoRange;
CVPixelBufferRef pixelBufferCopy = NULL;
const size_t attributes_size = 1;
CFTypeRef keys[attributes_size] = {
kCVPixelBufferIOSurfacePropertiesKey,
};
CFDictionaryRef io_surface_value = CreateCFDictionary(nullptr, nullptr, 0);
CFTypeRef values[attributes_size] = {io_surface_value};
CFDictionaryRef attributes = CreateCFDictionary(keys, values, attributes_size);
CVReturn status = CVPixelBufferCreate(kCFAllocatorDefault,
w,
h,
pixelFormat,
attributes,
&pixelBufferCopy);
if (status != kCVReturnSuccess) {
std::cout << "YUVBufferCopyWithPixelBuffer :: failed" << std::endl;
return nullptr;
}
if (attributes) {
CFRelease(attributes);
attributes = nullptr;
}
CVPixelBufferLockBaseAddress(pixelBufferCopy, 0);
size_t y_stride = CVPixelBufferGetBytesPerRowOfPlane(pixelBufferCopy, 0);
size_t uv_stride = CVPixelBufferGetBytesPerRowOfPlane(pixelBufferCopy, 1);
int plane_h1 = (int) CVPixelBufferGetHeightOfPlane(pixelBufferCopy, 0);
int plane_h2 = (int) CVPixelBufferGetHeightOfPlane(pixelBufferCopy, 1);
uint8_t *y = (uint8_t *) CVPixelBufferGetBaseAddressOfPlane(pixelBufferCopy, 0);
memset(y, 0x80, plane_h1 * y_stride);
uint8_t *uv = (uint8_t *) CVPixelBufferGetBaseAddressOfPlane(pixelBufferCopy, 1);
memset(uv, 0x80, plane_h2 * uv_stride);
int y_bufferSize = w * h;
int uv_bufferSize = w * h / 4;
uint8_t *y_planeData = (uint8_t *) malloc(y_bufferSize * sizeof(uint8_t));
uint8_t *u_planeData = (uint8_t *) malloc(uv_bufferSize * sizeof(uint8_t));
uint8_t *v_planeData = (uint8_t *) malloc(uv_bufferSize * sizeof(uint8_t));
int u_offset = 0;
int v_offset = 0;
uint8_t R, G, B;
uint8_t Y, U, V;
for (int i = 0; i < h; i ++) {
for (int j = 0; j < w; j ++) {
int offset = i * w + j;
B = baseAddress[offset * 4];
G = baseAddress[offset * 4 + 1];
R = baseAddress[offset * 4 + 2];
bt709_rgb2yuv8bit_TV(R, G, B, Y, U, V);
y_planeData[offset] = Y;
//隔行扫描 偶数行的偶数列取U 奇数行的偶数列取V
if (j % 2 == 0) {
(i % 2 == 0) ? u_planeData[u_offset++] = U : v_planeData[v_offset++] = V;
}
}
}
for (int i = 0; i < plane_h1; i ++) {
memcpy(y + i * y_stride, y_planeData + i * w, w);
if (i < plane_h2) {
for (int j = 0 ; j < w ; j+=2) {
//NV12 和 NV21 格式都属于 YUV420SP 类型。它也是先存储了 Y 分量,但接下来并不是再存储所有的 U 或者 V 分量,而是把 UV 分量交替连续存储。
//NV12 是 IOS 中有的模式,它的存储顺序是先存 Y 分量,再 UV 进行交替存储。
memcpy(uv + i * y_stride + j, u_planeData + i * w/2 + j/2, 1);
memcpy(uv + i * y_stride + j + 1, v_planeData + i * w/2 + j/2, 1);
}
}
}
free(y_planeData);
free(u_planeData);
free(v_planeData);
CVPixelBufferUnlockBaseAddress(pixelBuffer, 0);
CVPixelBufferUnlockBaseAddress(pixelBufferCopy, 0);
return pixelBufferCopy;
}
pixelBuffer BGRA is normal
pixelBuffer YUV insanity
In the video metadata there is a line Color space: YUV It looks like that this video isn't BGRA
When you calculate source pixel you must use stride (length of image row in bytes) instead of width because distance between rows in image may be bigger than width * pixel_size_in_bytes. I recommend to check this case on images with odd width.
int offset = i * stride + j;
You already has it commented at the beginning of function:
int stride = (int) CVPixelBufferGetBytesPerRow(pixelBuffer) / 4;
It is better to use builtin functions for converting images. Here is an example from one of my projects:
vImage_CGImageFormat out_cg_format = CreateVImage_CGImageFormat( target_pixel_format );
CGColorSpaceRef color_space = CGColorSpaceCreateDeviceRGB();
vImageCVImageFormatRef in_cv_format = vImageCVImageFormat_Create(
MSPixFmt_to_CVPixelFormatType(source_pixel_format),
kvImage_ARGBToYpCbCrMatrix_ITU_R_601_4,
kCVImageBufferChromaLocation_Center,
color_space,
0 );
CGColorSpaceRelease(color_space);
CGColorSpaceRelease(out_cg_format.colorSpace);
vImage_Error err = kvImageNoError;
vImageConverterRef converter = vImageConverter_CreateForCVToCGImageFormat(in_cv_format, &out_cg_format, NULL, kvImagePrintDiagnosticsToConsole, &err);
vImage_Buffer src_planes[4] = {{0}};
vImage_Buffer dst_planes[4] = {{0}};
unsigned long source_plane_count = vImageConverter_GetNumberOfSourceBuffers(converter);
for( unsigned int i = 0; i < source_plane_count; i++ )
{
src_planes[i] = (vImage_Buffer){planes_in[i], pic_size.height, pic_size.width, strides_in[i]};
}
unsigned long target_plane_count = vImageConverter_GetNumberOfDestinationBuffers(converter);
for( unsigned int i = 0; i < target_plane_count; i++ )
{
dst_planes[i] = (vImage_Buffer){planes_out[i], pic_size.height, pic_size.width, strides_out[i]};
}
err = vImageConvert_AnyToAny(converter, src_planes, dst_planes, NULL, kvImagePrintDiagnosticsToConsole);

How to Calculate CRC Starting at Last Byte

I'm trying to implement a CRC-CCITT calculator in VHDL. I was able to initially do that; however, I recently found out that data is delivered starting at the least-significant byte. In my code, data is transmitted 7 bytes at a time through a frame. So let's say we have the following data: 123456789 in ASCII or 313233343536373839 in hex. The data would be transmitted as such (with the following CRC):
-- First frame of data
RxFrame.Data <= (
1 => x"39", -- LSB
2 => x"38",
3 => x"37",
4 => x"36",
5 => x"35",
6 => x"34",
7 => x"33"
);
-- Second/last frame of data
RxFrame.Data <= (
1 => x"32",
2 => x"31", -- MSB
3 => xx, -- "xx" means irrelevant data, not part of CRC calculation.
4 => xx, -- This occurs only in the last frame, when it specified in
5 => xx, -- byte 0 which bytes contain data
6 => xx,
7 => xx
);
-- Calculated CRC should be 0x31C3
Another example with data 0x4376669A1CFC048321313233343536373839 and its correct CRC is shown below:
-- First incoming frame of data
RxFrame.Data <= (
1 => x"39", -- LSB
2 => x"38",
3 => x"37",
4 => x"36",
5 => x"35",
6 => x"34",
7 => x"33"
);
-- Second incoming frame of data
RxFrame.Data <= (
1 => x"32",
2 => x"31",
3 => x"21",
4 => x"83",
5 => x"04",
6 => x"FC",
7 => x"1C"
);
-- Third/last incoming frame of data
RxFrame.Data <= (
1 => x"9A",
2 => x"66",
3 => x"76",
4 => x"43", -- MSB
5 => xx, -- Irrelevant data, specified in byte 0
6 => xx,
7 => xx
);
-- Calculated CRC should be 0x2848
Is there a concept I'm missing? Is there a way to calculate the CRC with the data being received in reverse order? I am implementing this for CANopen SDO block protocols. Thanks!
CRC calculation algorithm to verify SDO block transfer from CANopen standard
Example code to generate a CRC16 with the bytes read in reverse (last byte first), using a function to do a carryless multiply modulo the CRC polynomial. An explanation follows.
#include <stdio.h>
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define POLY (0x1021u)
/* carryless multiply modulo crc polynomial */
uint16_t MpyModPoly(uint16_t a, uint16_t b) /* (a*b)%poly */
{
uint16_t pd = 0;
uint16_t i;
for(i = 0; i < 16; i++){
/* assumes twos complement */
pd = (pd<<1)^((0-(pd>>15))&POLY);
pd ^= (0-(b>>15))&a;
b <<= 1;
}
return pd;
}
/* generate crc in reverse byte order */
uint16_t Crc16R(uint8_t * b, size_t sz)
{
uint8_t *e = b + sz; /* end of bfr ptr */
uint16_t crc = 0u; /* crc */
uint16_t pdm = 0x100u; /* padding multiplier */
while(e > b){ /* generate crc */
pdm = MpyModPoly(0x100, pdm);
crc ^= MpyModPoly( *--e, pdm);
}
return(crc);
}
/* msg will be processed in reverse order */
static uint8_t msg[] = {0x43,0x76,0x66,0x9A,0x1C,0xFC,0x04,0x83,
0x21,0x31,0x32,0x33,0x34,0x35,0x36,0x37,
0x38,0x39};
int main()
{
uint16_t crc;
crc = Crc16R(msg, sizeof(msg));
printf("%04x\n", crc);
return 0;
}
Example code using X86 xmm pclmulqdq and psrlq, to emulate a 16 bit by 16 bit hardware (VHDL) carryless multiply:
/* __m128i is an intrinsic for X86 128 bit xmm register */
static __m128i poly = {.m128i_u32[0] = 0x00011021u}; /* poly */
static __m128i invpoly = {.m128i_u32[0] = 0x00008898u}; /* 2^31 / poly */
/* carryless multiply modulo crc polynomial */
/* using xmm pclmulqdq and psrlq */
uint16_t MpyModPoly(uint16_t a, uint16_t b)
{
__m128i ma, mb, mp, mt;
ma.m128i_u64[0] = a;
mb.m128i_u64[0] = b;
mp = _mm_clmulepi64_si128(ma, mb, 0x00); /* mp = a*b */
mt = _mm_srli_epi64(mp, 16); /* mt = mp>>16 */
mt = _mm_clmulepi64_si128(mt, invpoly, 0x00); /* mt = mt*ipoly */
mt = _mm_srli_epi64(mt, 15); /* mt = mt>>15 = (a*b)/poly */
mt = _mm_clmulepi64_si128(mt, poly, 0x00); /* mt = mt*poly */
return mp.m128i_u16[0] ^ mt.m128i_u16[0]; /* ret mp^mt */
}
/* external code to generate invpoly */
#define POLY (0x11021u)
static __m128i invpoly; /* 2^31 / poly */
void GenMPoly(void) /* generate __m12i8 invpoly */
{
uint32_t N = 0x10000u; /* numerator = x^16 */
uint32_t Q = 0; /* quotient = 0 */
for(size_t i = 0; i <= 15; i++){ /* 31 - 16 = 15 */
Q <<= 1;
if(N&0x10000u){
Q |= 1;
N ^= POLY;
}
N <<= 1;
}
invpoly.m128i_u16[0] = Q;
}
Explanation: consider the data as separate strings of ever increasing length, padded with zeroes at the end. For the first few bytes of your example, the logic would calculate
CRC = CRC16({39})
CRC ^= CRC16({38 00})
CRC ^= CRC16({37 00 00})
CRC ^= CRC16({36 00 00 00})
...
To speed up this calculation, rather than actually pad with n zero bytes, you can do a carryless multiply of a CRC by 2^{n·8} modulo POLY, where POLY is the 17 bit polynomial used for CRC16:
CRC = CRC16({39})
CRC ^= (CRC16({38}) · (2^08 % POLY)) % POLY
CRC ^= (CRC16({37}) · (2^10 % POLY)) % POLY
CRC ^= (CRC16({36}) · (2^18 % POLY)) % POLY
...
A carryless multiply modulo POLY is equivalent to what CRC16 does, so this translates into pseudo code (all values in hex, 2^8 = 100)
CRC = 0
PDM = 100 ;padding multiplier
PDM = (100 · PDM) % POLY ;main loop (2 lines per byte)
CRC ^= ( 39 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 38 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 37 · PDM) % POLY
PDM = (100 · PDM) % POLY
CRC ^= ( 36 · PDM) % POLY
...
Implementing (A · B) % POLY is based on binary math:
(A · B) % POLY = (A · B) ^ (((A · B) / POLY) · POLY)
Where multiply is carryless (XOR instead of add) and divide is borrowless (XOR instead of subtract). Since the divide is borrowless, and most significant term of POLY is x^16, the quotient
Q = (A · B) / POLY
only depends on the upper 16 bits of (A · B). Dividing by POLY uses multiplication by the 16 bit constant IPOLY = (2^31)/POLY followed by a right shift:
Q = (A · B) / POLY = (((A · B) >> 16) · IPOLY) >> 15
The process uses a 16 bit by 16 bit carryless multiply, producing a 31 bit product.
POLY = 0x11021u ; CRC polynomial (17 bit)
IPOLY = 0x08898u ; 2^31 / POLY
; generated by external software
MpyModPoly(A, B)
{
MP = A · B ; MP = A · B
MT = MP >> 16 ; MT = MP >> 16
MT = MT · IPOLY ; MT = MT · IPOLY
MT = MT >> 15 ; MT = (A · B) / POLY
MT = MT · POLY ; MT = ((A · B) / POLY) * POLY
return MP xor MT ; (A·B) ^ (((A · B) / POLY) · POLY)
}
A hardware based carryless multiply would look something like this 4 bit · 4 bit example.
p[] = [a3 a2 a1 a0] · [b3 b2 b1 b0]
p[] is a 7 bit product generated with 7 parallel circuits.
The time for multiply would be worst case propagation time for p3.
p6 = a3&b3
p5 = a3&b2 ^ a2&b3
p4 = a3&b1 ^ a2&b2 ^ a1&b3
p3 = a3&b0 ^ a2&b1 ^ a1&b2 ^ a0&b3
p2 = a2&b0 ^ a1&b1 ^ a0&b2
p1 = a1&b0 ^ a0&b1
p0 = a0&b0
If the xor gates available only have 2 bit inputs, the logic can
be split up. For example:
p3 = (a3&b0 ^ a2&b1) ^ (a1&b2 ^ a0&b3)
I don't know if your VHDL toolset includes a library for carryless multiply. For a 16 bit by 16 bit multiply resulting in a 31 bit product (p30 to p00), p15 has 16 outputs from the 16 ands (in parallel), which could be xor'ed using a tree like structure, 8 xors in parallel feeding into 4 xors in parallel feeding into 2 xor's in parallel into a single xor. So the propagation time would be 1 and and 4 xor propagation times.
Here is an example in C that you can adapt. Since you mentioned VHDL, this is a bit-wise implementation suitable for casting into gates and flip-flops. However, if cycles are more precious to you than memory and gates, then there is also a byte-wise table-driven version that would run in 1/8 the number of cycles.
What this does is the inverse of what is done in a normal CRC calculation. It then applies the same size input in zeros with a normal CRC to get what the normal CRC would have been on that input. Running the zeros through takes the same number of cycles as the inverse CRC, i.e. O(n) where n is the size of the input. If that latency is too large, that can be reduced to O(log n) cycles, with some investment in gates.
#include <stddef.h>
// Update crc with the CRC-16/XMODEM of n zero bytes. (This can be done in
// O(log n) time or cycles instead of O(n), with a little more effort.)
static unsigned crc16x_zeros_bit(unsigned crc, size_t n) {
for (size_t i = 0; i < n; i++)
for (int k = 0; k < 8; k++)
crc = crc & 0x8000 ? (crc << 1) ^ 0x1021 : crc << 1;
return crc & 0xffff;
}
// Update crc with the CRC-16/XMODEM of the len bytes at mem in reverse. If mem
// is NULL, then return the initial value for the CRC. When done,
// crc16x_zeros_bit() must be used to apply the total length of zero bytes, in
// order to get what the CRC would have been if it were calculated on the bytes
// fed in the opposite order.
static unsigned crc16x_inverse_bit(unsigned crc, void const *mem, size_t len) {
unsigned char const *data = mem;
if (data == NULL)
return 0;
crc &= 0xffff;
for (size_t i = 0; i < len; i++) {
for (int k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0x8810 : crc >> 1;
crc ^= (unsigned)data[i] << 8;
}
return crc;
}
#include <stdio.h>
int main(void) {
// Do framed example.
unsigned crc = crc16x_inverse_bit(0, NULL, 0);
crc = crc16x_inverse_bit(crc, (void const *)"9876543", 7);
crc = crc16x_inverse_bit(crc, (void const *)"21", 2);
crc = crc16x_zeros_bit(crc, 9);
printf("%04x\n", crc);
// Do another one.
crc = crc16x_inverse_bit(0, NULL, 0);
crc = crc16x_inverse_bit(crc, (void const *)"9876543", 7);
crc = crc16x_inverse_bit(crc, (void const *)"21!\x83\x04\xfc\x1c", 7);
crc = crc16x_inverse_bit(crc, (void const *)"\x9a" "fvC", 4);
crc = crc16x_zeros_bit(crc, 18);
printf("%04x\n", crc);
return 0;
}
Here is the O(log n) version of crc16x_zeros_bit():
// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, a cannot be zero.
static inline unsigned multmodp(unsigned a, unsigned b) {
unsigned p = 0;
for (;;) {
if (a & 1) {
p ^= b;
if (a == 1)
break;
}
a >>= 1;
b = b & 0x8000 ? (b << 1) ^ 0x1021 : b << 1;
}
return p & 0xffff;
}
// Return x^(8n) modulo p(x).
static unsigned x2nmodp(size_t n) {
unsigned p = 1; // x^0 == 1
unsigned q = 0x10; // x^2^2
while (n) {
q = multmodp(q, q); // x^2^k mod p(x), k = 3,4,...
if (n & 1)
p = multmodp(q, p);
n >>= 1;
}
return p;
}
// Update crc with the CRC-16/XMODEM of n zero bytes.
static unsigned crc16x_zeros_bit(unsigned crc, size_t n) {
return multmodp(x2nmodp(n), crc);
}

Converting from single channel 12-bit little endian raw binary data

I have a raw binary image file where every pixel consists of 12 bit data (gray-scale). For example, the first four pixels in the raw file:
0x0 0xC0
0x1 0x05
0x2 0x5C
0x3 0xC0
0x4 0x05
0x5 0x5C
This corresponds to 4 pixel values with the value 0x5C0 (little endian).
Unfortunately, using the following command:
convert -size 384x184 -depth 12 gray:frame_0.raw out.tiff
interprets the pixel values incorrectly (big endian), resulting in the pixel values 0xC00 0x55C 0xC00 0x55C.
I tried the options -endian LSB and -endian MSB, but unfortunately they only change the output byte order, not the input byte order.
How do I get convert to open the raw image as 12-bit little endian data?
I had a quick try at this, but I have no test data but it should be fairly close and easy to detect errors with your images:
// pad12to16.c
// Mark Setchell
// Pad 12-bit data to 16-bit
//
// Compile with:
// gcc pad12to16.c -o pad12to16
//
// Run with:
// ./pad12to16 < 12-bit.dat > 16-bit.dat
#include <stdio.h>
#include <sys/uio.h>
#include <unistd.h>
#include <sys/types.h>
#define BYTESPERREAD 6
#define PIXPERWRITE 4
int main(){
unsigned char buf[BYTESPERREAD];
unsigned short pixel[PIXPERWRITE];
// Read 6 bytes at a time and decode to 4 off 16-bit pixels
while(read(0,buf,BYTESPERREAD)==BYTESPERREAD){
pixel[0] = buf[0] | ((buf[1] & 0xf) << 8);
pixel[1] = (buf[2] << 4) | ((buf[1] & 0xf0) >> 4);
pixel[2] = buf[3] | ((buf[2] & 0xf) << 8);
pixel[3] = (buf[5] << 4) | ((buf[4] & 0xf0) >> 4);
write(1,pixel,PIXPERWRITE*2);
}
return 0;
}
So you would run this (I think):
./pad12to16 < 12-bit.dat | convert -size 384x184 -depth 16 gray:- result.tif
Mark's answer is correct, as you'll need to involve some external tool to sort-out the data stream. Usually there's some sort of padding when working with 12-bit depth. In the example blob provided, we see that the each pair of pixels share a common byte. The task of splitting the shared byte, and shifting what-to-where is fairly easy. This answer compliments Mark's answer, and argues that ImageMagick's C-API might as well be used.
// my12bit_convert.c
#include <stdio.h>
#include <stdlib.h>
#include <magick/MagickCore.h>
#include <wand/MagickWand.h>
static ExceptionType serverty;
#define LEADING_HALF(x) ((x >> 4) & 0xF)
#define FOLLOWING_HALF(x) (x & 0xF)
#define TO_DOUBLE(x) ((double)x / (double)0xFFF);
#define IS_OK(x,y) if(x == MagickFalse) { fprintf(stderr, "%s\n", MagickGetException(y, &serverty)); }
int main(int argc, const char * argv[]) {
// Prototype vars
int
i,
tmp_pixels[2];
double * pixel_buffer;
size_t
w = 0,
h =0,
total = 0,
iterator = 0;
ssize_t
x = 0,
y = 0;
const char
* path = NULL,
* output = NULL;
unsigned char read_pixel_chunk[3];
FILE * fh;
MagickWand * wand;
PixelWand * pwand;
MagickBooleanType ok;
// Iterate over arguments and collect size, input, & output.
for ( i = 1; i < argc; i++ ) {
if (argv[i][0] == '-') {
if (LocaleCompare("size", &argv[i][1]) == 0) {
i++;
if (i == argc) {
fprintf(stderr, "Missing `WxH' argument for `-size'.");
return EXIT_FAILURE;
}
GetGeometry(argv[i], &x, &y, &w, &h);
}
} else if (path == NULL){
path = argv[i];
} else {
output = argv[i];
}
}
// Validate to some degree
if ( path == NULL ) {
fprintf(stderr, "Missing input path\n");
return EXIT_FAILURE;
}
if ( output == NULL ) {
fprintf(stderr, "Missing output path\n");
return EXIT_FAILURE;
}
total = w * h;
if (total == 0) {
fprintf(stderr, "Unable to determine size of %s. (use `-size WxH')\n", path);
return EXIT_FAILURE;
}
// Allocated memory and start the party!
pixel_buffer = malloc(sizeof(double) * total);
MagickWandGenesis();
// Read input file, and sort 12-bit pixels.
fh = fopen(path, "rb");
if (fh == NULL) {
fprintf(stderr, "Unable to read `%s'\n", path);
return 1;
}
while(!feof(fh)) {
total = fread(read_pixel_chunk, 3, 1, fh);
if (total) {
// 0xC0 0x05
// ^------' ==> 0x05C0
tmp_pixels[0] = FOLLOWING_HALF(read_pixel_chunk[1]) << 8 | read_pixel_chunk[0];
// 0x05 0x5C
// '------^ ==> 0x05C0
tmp_pixels[1] = read_pixel_chunk[2] << 4 | LEADING_HALF(read_pixel_chunk[1]);
// 0x5C0 / 0xFFF ==> 0.359463
pixel_buffer[iterator++] = TO_DOUBLE(tmp_pixels[0]);
pixel_buffer[iterator++] = TO_DOUBLE(tmp_pixels[1]);
}
}
fclose(fh);
// Create image
wand = NewMagickWand();
pwand = NewPixelWand();
ok = PixelSetColor(pwand, "white");
IS_OK(ok, wand);
// Create new Image
ok = MagickNewImage(wand, w, h, pwand);
IS_OK(ok, wand);
// Import pixels as gray, or intensity, values.
ok = MagickImportImagePixels(wand, x, y, w, h, "I", DoublePixel, pixel_buffer);
IS_OK(ok, wand);
// Save ouput
ok = MagickWriteImage(wand, output);
IS_OK(ok, wand);
// Clean house
DestroyPixelWand(pwand);
DestroyMagickWand(wand);
MagickWandTerminus();
if (pixel_buffer) {
free(pixel_buffer);
}
return 0;
}
Which can be compiled with
LLVM_CFLAGS=`MagickWand-config --cflags`
LLVM_LDFLAGS=`MagickWand-config --ldflags`
clang $LLVM_CFLAGS $LLVM_LDFLAGS -o my12bit_convert my12bit_convert.c
And usage
./my12bit_convert -size 384x184 frame_0.raw out.tiff

How to quickly transform from RGB24 to BGRA on iOS?

which I use is vImageConvert_RGB888toARGB8888 and vImagePermuteChannels_ARGB8888. First change it from RGB24 to ARGB, because OPENGL ES doesn't support ARGB, so need to change ARGB to BGRA with vImagePermuteChannels_ARGB8888. But when the data need to change is big the time cost is high, so someone with better methods? thanks! My codes like this:
- (void)transformRGBToBGRA:(const UInt8 *)pict{
rgb.data = (void *)pict;
vImage_Error error = vImageConvert_RGB888toARGB8888(&rgb,NULL,0,&argb,NO,kvImageNoFlags);
if (error != kvImageNoError) {
NSLog(#"vImageConvert_RGB888toARGB8888 error");
}
const uint8_t permuteMap[4] = {3,2,1,0};
error = vImagePermuteChannels_ARGB8888(&argb,&bgra,permuteMap,kvImageNoFlags);
if (error != kvImageNoError) {
NSLog(#"vImagePermuteChannels_ARGB8888 error");
}
free((void *)pict);
}
Now I found the fastest solution is assembly. Just like this:
__asm__ volatile(
"0: \n"
"# load 3 64-bit regs with interleave: \n"
"vld3.8 {d0,d1,d2}, [%0]! \n"
"# swap d0 and d2 - R and B\n"
"vswp d0, d2 \n"
"# store 4 64-bit regs: \n"
"vst4.8 {d0,d1,d2,d3}, [%2]! \n"
"subs %1, %1, #1 \n"
"bne 0b \n"
:
: "r"(img), "r"(numPixels24), "+r"(bgraData)
: "r6", "r7","r8"
);

Fast RGB => YUV conversion in OpenCL

I know the following formula can be used to convert RGB images to YUV images. In the following formula, R, G, B, Y, U, V are all 8-bit unsigned integers, and intermediate values are 16-bit unsigned integers.
Y = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16
U = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128
V = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128
But when the formula is used in OpenCL it's a different story.
1. 8-bit memory write access is an optional extension, which means some OpenCL implementations may not support it.
2. even the above extension is supported, it's deadly slow compared with 32-bit write access.
In order to get better performance, every 4 pixels will be processed at the same time, so the input is 12 8-bit integers and the output is 3 32-bit unsigned integers(the first one stands for 4 Y samples, the second one stands for 4 U samples, the last one stands for 4 V samples).
My question is how to get these 3 32-bit integers directly from the 12 8-bit integers? Is there a formula to get these 3 32-bit integers, or I just need to use the old formula to get 12 8-bit integer results(4 Y, 4 U, 4 V) and construct the 3 32-bit integers with bit-wise operation?
Even though this question was asked 2 years ago, i think some working code would help here. In terms of the initial concerns about bad performance when directly accessing 8-bit values, it's better to perform 32-bit direct access when possible.
Some time ago I've developed and used the following OpenCL kernel to convert ARGB (typical windows bitmap pixel layout) to the y-plane (full sized), u/v-half-plane (quarter sized) memory layout as input for libx264 encoding.
__kernel void ARGB2YUV (
__global unsigned int * sourceImage,
__global unsigned int * destImage,
unsigned int srcHeight,
unsigned int srcWidth,
unsigned int yuvStride // must be srcWidth/4 since we pack 4 pixels into 1 Y-unit (with 4 y-pixels)
)
{
int i,j;
unsigned int RGBs [ 4 ];
unsigned int posSrc, RGB, Value4 = 0, Value, yuvStrideHalf, srcHeightHalf, yPlaneOffset, posOffset;
unsigned char red, green, blue;
unsigned int posX = get_global_id(0);
unsigned int posY = get_global_id(1);
if ( posX < yuvStride ) {
// Y plane - pack 4 y's within each work item
if ( posY >= srcHeight )
return;
posSrc = (posY * srcWidth) + (posX * 4);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 1 ];
RGBs [ 2 ] = sourceImage [ posSrc + 2 ];
RGBs [ 3 ] = sourceImage [ posSrc + 3 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( 66 * red + 129 * green + 25 * blue ) >> 8 ) + 16;
Value4 |= (Value << (i * 8));
}
destImage [ (posY * yuvStride) + posX ] = Value4;
return;
}
posX -= yuvStride;
yuvStrideHalf = yuvStride >> 1;
// U plane - pack 4 u's within each work item
if ( posX >= yuvStrideHalf )
return;
srcHeightHalf = srcHeight >> 1;
if ( posY < srcHeightHalf ) {
posSrc = ((posY * 2) * srcWidth) + (posX * 8);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
RGBs [ 3 ] = sourceImage [ posSrc + 6 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( -38 * red + -74 * green + 112 * blue ) >> 8 ) + 128;
Value4 |= (Value << (i * 8));
}
yPlaneOffset = yuvStride * srcHeight;
posOffset = (posY * yuvStrideHalf) + posX;
destImage [ yPlaneOffset + posOffset ] = Value4;
return;
}
posY -= srcHeightHalf;
if ( posY >= srcHeightHalf )
return;
// V plane - pack 4 v's within each work item
posSrc = ((posY * 2) * srcWidth) + (posX * 8);
RGBs [ 0 ] = sourceImage [ posSrc ];
RGBs [ 1 ] = sourceImage [ posSrc + 2 ];
RGBs [ 2 ] = sourceImage [ posSrc + 4 ];
RGBs [ 3 ] = sourceImage [ posSrc + 6 ];
for ( i=0; i<4; i++ ) {
RGB = RGBs [ i ];
blue = RGB & 0xff; green = (RGB >> 8) & 0xff; red = (RGB >> 16) & 0xff;
Value = ( ( 112 * red + -94 * green + -18 * blue ) >> 8 ) + 128;
Value4 |= (Value << (i * 8));
}
yPlaneOffset = yuvStride * srcHeight;
posOffset = (posY * yuvStrideHalf) + posX;
destImage [ yPlaneOffset + (yPlaneOffset >> 2) + posOffset ] = Value4;
return;
}
This code performs only global 32-bit memory access while 8-bit processing happens within each work item.
Oh.. and the proper code to invoke the kernel
unsigned int width = 1024;
unsigned int height = 768;
unsigned int frameSize = width * height;
const unsigned int argbSize = frameSize * 4; // ARGB pixels
const unsigned int yuvSize = frameSize + (frameSize >> 1); // Y,U,V planes
const unsigned int yuvStride = width >> 2; // since we pack 4 RGBs into "one" YYYY
// Allocates ARGB buffer
ocl_rgb_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, argbSize, 0, &error );
// ... error handling ...
ocl_yuv_buffer = clCreateBuffer ( context, CL_MEM_READ_WRITE, yuvSize, 0, &error );
// ... error handling ...
error = clSetKernelArg ( kernel, 0, sizeof(cl_mem), &ocl_rgb_buffer );
error |= clSetKernelArg ( kernel, 1, sizeof(cl_mem), &ocl_yuv_buffer );
error |= clSetKernelArg ( kernel, 2, sizeof(unsigned int), &height);
error |= clSetKernelArg ( kernel, 3, sizeof(unsigned int), &width);
error |= clSetKernelArg ( kernel, 4, sizeof(unsigned int), &yuvStride);
// ... error handling ...
const size_t local_ws[] = { 16, 16 };
const size_t global_ws[] = { yuvStride + (yuvStride >> 1), height };
error = clEnqueueNDRangeKernel ( queue, kernel, 2, NULL, global_ws, local_ws, 0, NULL, NULL );
// ... error handling ...
Note: have a look at the work item calculations. Some additional code needs to be added (e.g. using mod so as to add sufficient spare items) to make sure that work item sizes fit to local work sizes.
Like this? Use int4 unless your platform can use int3. Also you can pack 5 pixels into an int16 so you are wasting 1/16 instead of 1/4 of the memory bandwidth.
__kernel void rgb2yuv( __global int3* input, __global int3* output){
rgb = input[get_global_id(0)];
R = rgb.x;
G = rgb.y;
B = rgb.z;
yuv.x = ( ( 66 * R + 129 * G + 25 * B + 128) >> 8) + 16;
yuv.y = ( ( -38 * R - 74 * G + 112 * B + 128) >> 8) + 128;
yuv.z = ( ( 112 * R - 94 * G - 18 * B + 128) >> 8) + 128;
output[get_global_id(0)] = yuv;
}
Along with opencl specification data type int3 doesn't exists.
Page 123:
Supported values of n are 2, 4, 8, and 16...
In your kernel variables rgb, R, G, B, and yuv should be at least __private int4.
OpenCL 1.1 added support for typen where n = 3. However, I strongly recommend you don't use it. Different vendor implementations have different bugs, and it's not saving you anything.

Resources