Fast UInt to Float conversion in Swift - ios

I am doing some realtime image analysis on a live videostream. I am using vImage to calculate histograms and vDSP for some further processing. I have Objective-C code that has been working well over the years. I am now about to convert it to Swift. And while it works it is too slow. I have found that the main problem is converting the vImage histogram, which is UInt (vImagePixelCount), to Float that vDSP can handle. In Objective-C I am using vDSP to do the conversion:
err = vImageHistogramCalculation_Planar8(&vBuffY,histogramY, 0);
vDSP_vfltu32((const unsigned int*)histogramY,2,histFloatY,1,256);
However, the vImage histogram is UInt, not UInt32, so I can't use vDSP_vfltu32 in Swift. Instead I am using
let err = vImageHistogramCalculation_Planar8(&vBuffY, &histogramY, 0)
let histFloatY = histogramY.compactMap{ Float($0) }
The problem is that this code is more than 100 times slower than the objective-C version. Are there any alternatives that are faster?

vImageHistogramCalculation_Planar8() writes the histogram into a buffer with 256 elements of type vImagePixelCount which is a type alias for unsigned long in C, and that is a 64-bit integer on 64-bit platforms.
Your Objective-C code “cheats” by casting the unsigned long pointer to an unsigned int pointer in the call to vDSP_vfltu32 () and setting the stride to 2. So what happens here is that the lower 32-bit of each unsigned long are converted to a float. That works as long as the counts do not exceed the value 232-1.
You can do exactly the same in Swift, only that the type casting is here done by “rebinding” the memory:
let err = vImageHistogramCalculation_Planar8(&vBuffY, &histogramY, 0)
histogramY.withUnsafeBytes {
let uint32ptr = $0.bindMemory(to: UInt32.self)
vDSP_vfltu32(uint32ptr.baseAddress!, 2, &histFloatY, 1, 256);
}

Related

Fast vectorized pixel-wise operations on images

I want to measure the similarity degree between two grayscale same sized images using mean square error. I can't use any framework which is not a part of macOS SDK(e.g. OpenCV, Eigen). Simple realization of this algorithm without vectorization looks like this:
vImage_Buffer imgA;
vImage_Buffer imgB;
NSUInteger mse = 0;
unsigned char *pxlsA = (unsigned char *)imgA.data;
unsigned char *pxlsB = (unsigned char *)imgB.data;
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
NSUInteger d = pxlsA[i] - pxlsB[i]);
mse += d * d;
}
Is there some way to do this without loop, in more vectorized way? Maybe something like:
mse = ((imgA - imgB) ^ 2).sum();
The answer to this question is stored in vDSP library, which is part of macOS SDK.
https://developer.apple.com/documentation/accelerate/vdsp
vDSP - Perform basic arithmetic operations and common digital signal processing routines on large vectors.
In my situation I have not really big vectors, but still.
Firstly, you need to convert unsigned char * to float *, and btw it is a significant moment, I don't know how to do this not in loop. Then you need two vDSP function: vDSP_vsbsbm and vDSP_sve.
vDSP_vsbsm - Multiplies the difference of two single-precision vectors by a second difference of two single-precision vectors.
vDSP_sve - Calculates the sum of values in a single-precision vector.
So the final code looks like that:
float *fpxlsA = (float *)malloc(imgA.height * imgA.width * sizeof(float));
float *fpxlsB = (float *)malloc(imgB.height * imgB.width * sizeof(float));
float *output = (float *)malloc(imgB.height * imgB.width * sizeof(float));
for (size_t i = 0; i < imgA.height * imgA.width; ++i) {
fpxlsA[i] = (float)(pxlsA[i]);
fpxlsB[i] = (float)(pxlsB[i]);
}
vDSP_vsbsbm(fpxlsA, 1, fpxlsB, 1, fpxlsA, 1, fpxlsB, 1, output, 1, imgA.height * imgB.width);
float sum;
vDSP_sve(output, 1, &sum, imgA.height * imgB.width);
free(output);
free(fpxlsA);
free(fpxlsB);
So, this code did exactly what I wanted and in a more vectorized form. But the result isn't good enough. Comparing performances of the loop approach and vDSP approach, vDSP is two times faster if there isn't any additional memory allocation. But in reality, where additional memory allocation takes place, loop approach is slightly faster.
This appears to be part of Mac OS: https://developer.apple.com/documentation/accelerate
Nice and fast using pointer arithmetic way to loop that would be as follows ...
int d;
size_t i = imgA.height * imgA.width;
while ( i -- )
{
d = ( int )(*pxlsA++) - ( int )(*pxlsB++);
mse += d * d;
}
EDIT
Ooops since those are unsigned char's and since we calculate the difference we need to use signed integers to do so.
And another edit - must use pxls... here, don't know what img... is.

Convert decibels to volume using Accelerate Framework

I am building some kind of an audio fader effect.
I am using vDSP_vdbcon to turn a buffer of volumes into decibels, applying some modifications in db-space and would like to convert the decibel buffer into volume using the accelerate framework.
Thanks!
Here is what I use for each element for decibel values between -40 and 0. It gives pretty good results.
float decibelsToMag(float decibels){
return pow (10, (0.05 * decibels));
}
I don't know the Accelerate vector equivalent for the pow function. But here's a half vectorized version.
void decibelsToMags(float *decibels, float *mag, int count){
float mul = 0.05;
vDSP_vsmul(decibels, 1, &mul, mag, 1, count);
for (int i = 0; i < count; i++) {
mag[i] = pow(10,mag[i]);
}
}
Post back if you can figure out the vDSP version of the loop.

NEON acceleration for 12-bit to 8-bit

I have a buffer of 12-bit data (stored in 16-bit data)
and need to converts into 8-bit (shift by 4)
How can the NEON accelerate this processing ?
Thank you for your help
Brahim
Took the liberty to assume a few things explained below, but this kind of code (untested, may require a few modifications) should provide a good speedup compared to naive non-NEON version:
#include <arm_neon.h>
#include <stdint.h>
void convert(const restrict *uint16_t input, // the buffer to convert
restrict *uint8_t output, // the buffer in which to store result
int sz) { // their (common) size
/* Assuming the buffer size is a multiple of 8 */
for (int i = 0; i < sz; i += 8) {
// Load a vector of 8 16-bit values:
uint16x8_t v = vld1q_u16(buf+i);
// Shift it by 4 to the right, narrowing it to 8 bit values.
uint8x8_t shifted = vshrn_n_u16(v, 4);
// Store it in output buffer
vst1_u8(output+i, shifted);
}
}
Things I assumed here:
that you're working with unsigned values. If it's not the case, it will be easy to adapt anyway (uint* -> int*, *_u8->*_s8 and *_u16->*_s16)
as the values are loaded 8 by 8, I assumed the buffer length was a multiple of 8 to avoid edge cases. If that's not the case, you should probably pad it artificially to a multiple of 8.
Finally, the 2 resource pages used from the NEON documentation:
about loads and stores of vectors.
about shifting vectors.
Hope this helps!
prototype : void dataConvert(void * pDst, void * pSrc, unsigned int count);
1:
vld1.16 {q8-q9}, [r1]!
vld1.16 {q10-q11}, [r1]!
vqrshrn.u16 d16, q8, #4
vqrshrn.u16 d17, q9, #4
vqrshrn.u16 d18, q10, #4
vqrshrn.u16 d19, q11, #4
vst1.16 {q8-q9}, [r0]!
subs r2, #32
bgt 1b
q flag : saturation
r flag : rounding
change u16 to s16 in case of signed data.

What are the upper and lower limits and types of pixel values in OpenCV?

What are the upper and lower limits of pixel values in OpenCV and how can I get them?
The only limits I could figure out are CV_8U type Mat's, where the lower limit for pixel values in a channel is 0, the upper is 255. What are these values for other Mat's?
Say CV_32F, CV_32S?
OpenCV Equivalent C/C++ data types:
CV_8U -> unsigned char (min = 0, max = 255)
CV_8S -> char (min = -128, max = 127)
CV_16U -> unsigned short (min = 0, max = 65535)
CV_16S -> short (min = -32768, max = 32767)
CV_32S -> int (min = -2147483648, max = 2147483647)
CV_32F -> float
CV_64F -> double
Check this tutorial for data type ranges.
One thing to consider is that while displaying images of type CV_32F or CV_64F with imshow or cvShowImage, OpenCV expects values to be normalized between 0.0 and 1.0. Else, it saturates the pixel values.
CV_32F means a 32 bit floating point number. CV_32S means a 32 bit signed integer. I'm sure you can guess what CV_64F stands for. The internet is full of references for the ranges that different data types can take on, here is 32S for instance.

how to make a CUDA Histogram kernel?

I am writing a CUDA kernel for Histogram on a picture, but I had no idea how to return a array from the kernel, and the array will change when other thread read it. Any possible solution for it?
__global__ void Hist(
TColor *dst, //input image
int imageW,
int imageH,
int*data
){
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;
if(ix < imageW && iy < imageH)
{
int pixel = get_red(dst[imageW * (iy) + (ix)]);
//this assign specific RED value of image to pixel
data[pixel] ++; // ?? problem statement ...
}
}
#para d_dst: input image TColor is equals to float4.
#para data: the array for histogram size [255]
extern "C" void
cuda_Hist(TColor *d_dst, int imageW, int imageH,int* data)
{
dim3 threads(BLOCKDIM_X, BLOCKDIM_Y);
dim3 grid(iDivUp(imageW, BLOCKDIM_X), iDivUp(imageH, BLOCKDIM_Y));
Hist<<<grid, threads>>>(d_dst, imageW, imageH, data);
}
Have you looked at the SDK sample? The "histogram" sample is available in the CUDA SDK (currently version 3.0 on the NVIDIA developer site, version 3.1 beta available for registered developers).
The documentation with the sample explains nicely how to handle your summation, either using global memory atomics on the GPU or by collecting the results for each block separately and then doing a separate reduction (either on the host or the GPU).
Histogramming is not particularly efficient when implemented with CUDA (or with GPGPU in general) - typically you need to generate lots of partial histograms in shared memory and then sum them. You might want to consider keeping this particular task on the CPU.
You will have to either use atomic function to block other thread from using he same memory, or use the partial histogram. Either way it not that efficient unless the input image is very very large.

Resources