Vulkan Buffer WorkGroupID not returning actual value when large number of elements - buffer

Creating a buffer with pow(2, 24) and a local_size_x = 64 for the layout input qualifier will return WorkGroupID = 262143 which is all fine due to pow(2,24) / 64 - 1, it is zero indexed.
However if we increase the global dimension / no elements / size of the problem to pow(2, 25) lets say WorkGroupID will return values without a reason, they do not match the math.
Here are some limits that the device got that I think matter:
maxStorageBufferRange: uint32_t = 4294967295
maxComputeSharedMemorySize: uint32_t = 32768
maxComputeWorkGroupCount: uint32_t[3] = 00000202898A8EC4
maxComputeWorkGroupCount[0]: uint32_t = 65535
maxComputeWorkGroupCount[1]: uint32_t = 65535
maxComputeWorkGroupCount[2]: uint32_t = 65535
maxComputeWorkGroupInvocations: uint32_t = 1024
maxComputeWorkGroupSize: uint32_t[3] = 00000202898A8ED4
maxComputeWorkGroupSize[0]: uint32_t = 1024
maxComputeWorkGroupSize[1]: uint32_t = 1024
maxComputeWorkGroupSize[2]: uint32_t = 1024
I do not go overboard with allocating more elements that the device supports.
So after 2 days + 16 hrs I still did not figure out whats going on...
WorkGroupSize, WorkGroupID, LocalInvocationID and GlobalInvocationID presents the same problem when I reach a n no. of elements. It is no wonder that GlobalInvocationID presents the same problem due to how it is calculated...
#version 450
// Size of the Local Work-group is defined trough input layout qualifier
layout(local_size_x = 64, local_size_y = 1, local_size_z = 1) in;
layout(set = 0, binding = 0) buffer deviceBuffer
{
uint x[];
};
void main() {
uint i = gl_GlobalInvocationID.x;
//uint i = gl_WorkGroupSize.x * gl_WorkGroupID.x * gl_LocalInvocationID.x;
//x[i] += x[i];
// Total No. of Work Items (threads) in Global Dimension
//x[i] = gl_NumWorkGroups.x;
// Size of Work Dimension specified in Input Layout Qualifier
//x[i] = gl_WorkGroupSize.x;
// Is given by Global Dimension / Work Group Size
x[i] = gl_WorkGroupID.x;
//x[i] = gl_LocalInvocationID.x;
}

maxComputeWorkGroupCount[0]: uint32_t = 65535
maxComputeWorkGroupCount[1]: uint32_t = 65535
maxComputeWorkGroupCount[2]: uint32_t = 65535
vkCmdDispatch have the size in x = pow(2, 25), y = 1, z = 1
Based on the info you provided groupCountX = 225 = 33554432, but the limit is maxComputeWorkGroupCount[0] = 65535 = 216-1.
The Vulkan specification Valid Usage for vkCmdDispatch says:
groupCountX must be less than or equal to VkPhysicalDeviceLimits::maxComputeWorkGroupCount[0]
Violating Valid Usage is undefined behavior. "Undefined behavior" means anything from "everything seemingly working fine" to "your PC colapses into a black hole and destroys this solar system". For all intents and purposes violating Valid Usage is a logical error of the application code.

Related

Convolution of Image Processing in Processing language

Since the Corona situation characterizes my studies as self-study, as a Processing-Language newbie I don't have an easy time getting into the subject of image processing , more specifically convolution. Therefore I hope that you can help me.
My lecturer, who unfortunately is nearly never reachable, left me the following conv code. The theory behind convolution is clear to me, but I have many gaps in understanding related to the code. Could someone leave a line comment so that I can get into the code a bit more fluently?
The Code is following
color convolution (int x, int y, float[][] matrix, int matrix_size, PImage img){
float rtotal = 0.0;
float gtotal = 0.0;
float btotal = 0.0;
int offset = matrix_size / 2;
for (int i = 0; i < matrix_size; i++){
for (int j= 0; j < matrix_size; j++){
int xloc = x+i-offset;
int yloc = y+j-offset;
int loc = xloc + img.width*yloc;
rtotal += (red(img.pixels[loc]) * matrix[i][j]);
gtotal += (green(img.pixels[loc]) * matrix[i][j]);
btotal += (blue(img.pixels[loc]) * matrix[i][j]);
}
}
rtotal = constrain(rtotal, 0, 255);
gtotal = constrain(gtotal, 0, 255);
btotal = constrain(btotal, 0, 255);
return color(rtotal, gtotal, btotal);
}
I have to do a bit of guesswork since I'm not positive about all of the functions you're using and I'm not familiar with the Processing 3+ library, but here's my best shot at it.
color convolution (int x, int y, float[][] matrix, int matrix_size, PImage img){
// Note: the 'matrix' parameter here will also frequently be referred to as
// a 'window' or 'kernel' in research
// I'm not certain what your PImage class is from, but I'll assume
// you're using the Processing 3+ library and work off of that assumption
// how much of each color we see within the kernel (matrix) space
float rtotal = 0.0;
float gtotal = 0.0;
float btotal = 0.0;
// this offset is to zero-center our kernel
// the fact that we use matrix_size / 2 sort of implicitly
// alludes to the fact that our matrix_size should be an odd-number
// so that we can have a middle-pixel
int offset = matrix_size / 2;
// looping through the kernel. the fact that we use 'matrix_size'
// as our end-condition for both dimensions means that our 'matrix' kernel
// must always be a square
for (int i = 0; i < matrix_size; i++){
for (int j= 0; j < matrix_size; j++){
// calculating the index conversion from 2D to the 1D format that PImage uses
// refer to: https://processing.org/tutorials/pixels/
// for a better understanding of PImage indexing (about 1/3 of the way down the page)
// WARNING: by subtracting the offset it is possible to hit negative
// x,y values here if you pick an x or y position less than matrix_size / 2.
// the same index-out-of-bounds can occur on the high end.
// When you convolve using a kernel of N x N size (N here would be matrix_size)
// you can only convolve from [N / 2, Width - (N / 2)] for x and y
int xloc = x+i-offset;
int yloc = y+j-offset;
// this is the final 1D PImage index that corresponds to [xloc, yloc] in our 2D image
// really go back up and take a look at the link if this doesn't make sense, it's pretty good
int loc = xloc + img.width*yloc;
// I have to do some speculation again since I'm not certain what red(img.pixels[loc]) does
// I'll assume it returns the red red channel of the pixel
// this section just adds up all of the pixel colors multiplied by the value in the kernel
rtotal += (red(img.pixels[loc]) * matrix[i][j]);
gtotal += (green(img.pixels[loc]) * matrix[i][j]);
btotal += (blue(img.pixels[loc]) * matrix[i][j]);
}
}
// the fact that no further division or averaging happens after the for-loops implies
// that the kernel you feed in should have balanced values for your kernel size
// for example, a kernel that's designed to average out the color over the 3 x 3 area
// it covers (this would be like blurring the image) would be filled with 1/9
// in general: the kernel you're using should have a sum of 1 for all of the numbers inside
// this is just 'in general' you can play around with not doing that, but you'll probably notice a
// darkening effect for when the sum is less than 1, and a brightening effect if it's greater than 1
// for more info on kernels, read this: https://en.wikipedia.org/wiki/Kernel_(image_processing)
// I don't have the code for this constrain function,
// but it's almost certainly just your typical clamp (constrains the values to [0, 255])
// Note: this means that your values saturate at 0 and 255
// if you see a lot of black or white then that means your kernel
// probably isn't balanced as mentioned above
rtotal = constrain(rtotal, 0, 255);
gtotal = constrain(gtotal, 0, 255);
btotal = constrain(btotal, 0, 255);
// Finished!
return color(rtotal, gtotal, btotal);
}

How can I align the frequency bins with the fourier transform magnitude?

I am attempting to implement a Fast Fourier Transform with associated complex magnitude function on the STM32F411RE Nucleo developer board. My goal is to separate a combined signal with multiple sinusoidal elements into their separate frequency components, with correct amplitude.
My issues is that I cannot correctly line up the frequency bins outcomes from the Complex magnitude function with the frequencies. I am also starting to question the validity of these outcomes as such.
I have tried to use a number of different implementations posted by people for the FFT algorithm with the magnitude fix, most notably the examples listed on StackoverFlow by SleuthEye and Blog by LB9MG.
AFAIK I have a similar approach, but somehow their approaches yield the desired results and mine do not. Below is my code that I have altered to work via the implementation that SleuthEye has created.
int main(void)
{
fftLen = 32; // can be 32, 64, 128, 256, 512, 1024, 2048, 4096
half_fftLen = fftLen/2;
volatile float32_t sampleFreq = 50 * fftLen; // Fs = binsize * fft length, desired binsize = 50 hz
arm_rfft_fast_instance_f32 inst;
arm_status status;
status = arm_rfft_fast_init_f32(&inst, fftLen);
float32_t signalCombined[fftLen] = {0};
float32_t fftCombined[fftLen] = {0};
float32_t fftMagnitude[fftLen] = {0};
volatile float32_t fftFreq[fftLen] = {0};
float32_t maxAmp;
uint32_t maxAmpInd;
while (1)
{
for (int i = 0; i< fftLen; i++)
{
signalCombined[i] = 40 * arm_sin_f32(450 * i); // 450 frequency at 40 amplitude
}
arm_rfft_fast_f32(&inst, signalCombined, fftCombined, 0); // perhaps switch to complex transform to allow for negative frequencies?
arm_cmplx_mag_f32(fftCombined, fftMagnitude, half_fftLen);
fftMagnitude[0] = fftCombined[0];
fftMagnitude[half_fftLen] = fftCombined[1];
arm_max_f32(fftMagnitude, half_fftLen, &maxAmp, &maxAmpInd); // We need the 3 max values
for (int k = 0; k < fftLen ; k++)
{
fftFreq[k] = ((k*sampleFreq)/fftLen);
}
}
Shown below are the results that I get out of the code listed above: whilst I do get a magnitude out of the algorithms (at the correct index 12), it does not correspond to the frequency or the amplitude of the input array signalCombined[].
Does anyone have an idea of why this is happening? Like so many of my errors it is probably a really trivial and stupid thing, but I cannot figure out for the life of me why this is happening.
EDIT: thanks to SleuthEye's help finding the frequencies is now possible, as the initial approach for generating the sin() signal was done incorrectly.
Some new issues popped up as the FFT only appears to yield the correct frequencies for the 32 samples, despite the bin size scaling accordingly to accommodate the adjusted sample size.
I am also unable to implement the amplitude fixing algorith: as per SleuthEye's Link with the example code 2*(1/N)*abs(X(k))^2 I have made my own implementation 2 * powf(fabs(fftMagnitude[j]), 2) / fftLen as shown in the code below, but this does not yield results that are even close to correct.
while (1)
{
for (int i = 0; i < fftLen; i++)
{
signalCombined[i] = 400 * arm_sin_f32(2 * PI * 450 * i / sampleFreq); // Sin Alpha, 400 amp at 10 kHz
// 700 * arm_sin_f32(2 * PI * 33000 * i / sampleFreq) + // Sin Bravo, 700 amp at 33 kHz
// 300 * arm_sin_f32(2 * PI * 50000 * i / sampleFreq); // Sin Charlie, 300 amp at 50 kHz
}
arm_rfft_fast_f32(&inst, signalCombined, fftCombined, 0); // calculate the fourier transform of the time domain signal
arm_cmplx_mag_f32(fftCombined, fftMagnitude, half_fftLen); // calculate the magnitude of the fourier transform
fftMagnitude[0] = fftCombined[0];
fftMagnitude[half_fftLen] = fftCombined[1];
for (int j = 0; j < sizeof(fftMagnitude); j++)
{
fftMagnitude[j] = 2 * powf(fabs(fftMagnitude[j]), 2) / fftLen; // Algorithm to fix the amplitude of each unique frequency
}
arm_max_f32(fftMagnitude, half_fftLen, &maxAmp, &maxAmpInd); // We need the 3 max values
for (int k = 0; k < fftLen ; k++)
{
fftFreq[k] = ((k*sampleFreq)/fftLen);
}
}
Your tone generation does not take into account the sampling frequency of 1600Hz, so you are effectively generating a tone at a frequency of 450*1600/(2*PI) ~ 114591Hz which gets aliased to ~608Hz. That 608Hz frequency roughly corresponds to a frequency index around 12 when using an FFT size of 32.
The generation of a 450Hz tone at a 1600Hz sampling frequency should be done as follows:
for (int i = 0; i< fftLen; i++)
{
signalCombined[i] = 40 * arm_sin_f32(2 * PI * 450 * i / sampleFreq);
}
As far as matching the amplitude, keep in kind that there is a scaling factor between the time-domain and frequency-domain of approximately 0.5*fftLen (see this other post of mine).

How to make OpenCV IplImage for 16 bit gray-data?

This code is for 8 bit data to make gray-scale IplImage.
IplImage* img_gray_resize = NULL;
img_gray_resize = cvCreateImage(cvSize(320, 256), IPL_DEPTH_8U, 1);
DWORD dwCount;
LVDS_SetDataMode(0); // o for 8 bit mode and 1 for 16 bit mode
dwCount = (LONG)320 * (LONG)256;
unsigned char* m_pImage = NULL;
m_pImage = new unsigned char[320 * 256];
for (int i=0; i<320 * 256; i++) m_pImage[i] = NULL;
LVDS_GetFrame(&dwCount, m_pImage);
int width = 320;
int height = 256;
int nn = 0;
int ii = 0;
for (int y=0; y<height; y++)
{
for (int x=0; x<width; x++)
{
ii = y * width + x;
if(nn < (height*width))
img_gray_resize->imageData[ii] = m_pImage[nn++];
}
}
delete [] m_pImage;
I need to display 16 bit gray-scale image. If I display 8 bit data, some information is missing from the image. However, LVDS_SetDataMode() can provide both types of data. I am using a library for frame grabber device. Please help me.
16 bit images should be stored in IPL_DEPTH_16U (or CV_16U) mode. This is the correct memory layout.
However, displaying them depends on your display hardware.
Most regular display APIs, e.g. OpenCV's highgui, can only display 8-bit images.
To actually display the image, you will have to convert your image to 8-bits for display.
You will need to decide how to do this. There are many ways to do this, depending on your application and complexity. Some options are:
Show MSB = right-shift the image by 8 pixels.
Show LSB = saturate anything above 255.
In fact, right-shift by any value between 0-8 bits, combined with a cv::saturate_cast to avoid value wrap-around.
HDR->LDR = Apply dynamic range compression algorithms.
as I know,only 8bit data can be displayed,you need to find the best way to convert the 16bit to 8bit to minimize the information you lose. Histogram equalization can be applyed to do this.
Finally, I have solved the problem by following way:
dwCount = (LONG)320 * (LONG)256 * 2;
LVDS_SetDataMode(1);
img_gray_resize->imageData[ii] = m_pImage[nn++] >> 6;
Just shift bits to right (2, 3, 4, 5, 6, ...), where you get good result, use that value.

Stored UIImage pixel data into c array, unable to determine array's element count

I initialized the array like so
CGImageRef imageRef = CGImageCreateWithImageInRect(image.CGImage, bounds);
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
NSUInteger width = CGImageGetWidth(imageRef);
NSUInteger height = CGImageGetHeight(imageRef);
unsigned char *rawData = malloc(height * width * 4);
NSUInteger bytesPerPixel = 4;
NSUInteger bytesPerRow = bytesPerPixel * width;
NSUInteger bitsPerComponent = 8;
CGContextRef context = CGBitmapContextCreate(rawData, width, height, bitsPerComponent, bytesPerRow, colorSpace, kCGImageAlphaPremultipliedLast | kCGBitmapByteOrder32Big);
However, when I tried checking the count through an NSLog, I always get 4 (4/1, specifically).
int count = sizeof(rawData)/sizeof(rawData[0]);
NSLog(#"%d", count);
Yet when I NSLog the value of individual elements, it returns non zero values.
ex.
CGFloat f1 = rawData[15];
CGFloat f2 = rawData[n], where n is image width*height*4;
//I wasn't expecting this to work since the last element should be n-1
Finally, I tried
int n = lipBorder.size.width *lipBorder.size.height*4*2; //lipBorder holds the image's dimensions, I tried multiplying by 2 because there are 2 pixels for every CGPoint in retina
CGFloat f = rawData[n];
This would return different values each time for the same image, (ex. 0.000, 115.000, 38.000).
How do I determine the count / how are the values being stored into the array?
rawData is a pointer to unsigned char, as such its size is 32 bits (4 bytes)[1]. rawData[0] is an unsigned char, as such its size is 8 bits (1 byte). Hence, 4/1.
You've probably seen this done with arrays before, where it does work as you would expect:
unsigned char temp[10] = {0};
NSLog(#"%d", sizeof(temp)/sizeof(temp[0])); // Prints 10
Note, however, that you are dealing with a pointer to unsigned char, not an array of unsigned char - the semantics are different, hence why this doesn't work in your case.
If you want the size of your buffer, you'll be much better off simply using height * width * 4, since that's what you passed to malloc anyway. If you really must, you could divide that by sizeof(char) or sizeof(rawData[0]) to get the number of elements, but since they're chars you'll get the same number anyway.
Now, rawData is just a chunk of memory somewhere. There's other memory before and after it. So, if you attempt to do something like rawData[height * width * 4], what you're actually doing is attempting to access the next byte of memory after the chunk allocated for rawData. This is undefined behaviour, and can result in random garbage values being returned[2] (as you've observed), some "unassigned memory" marker value being returned, or a segmentation fault occurring.
[1]: iOS is a 32-bit platform
[2]: probably whatever value was put into that memory location last time it was legitimately used.
The pointer returned by malloc is a void* pointer meaning that it returns a pointer to an address in memory. It seems that the width and the height that are being returned are 0. This would explain why you are only being allocated 4 bytes for your array.
You also said that you tried
int n = lipBorder.size.width *lipBorder.size.height*4*2; //lipBorder holds the image's dimensions, I tried multiplying by 2 because there are 2 pixels for every CGPoint in retina
CGFloat f = rawData[n];
and were receiving different values each time. This behavior is to be expected given that your array is only 4 bytes long and you are accessing an area of memory that is much further ahead in memory. The reason that the value was changing was that you were accessing memory that was not in your array, but in a memory location that was
lipBorder.size.width *lipBorder.size.height*4*2 - 4 bytes passed the end of your array. C in no way prevent you from accessing any memory within your program. If you had accessed memory that is off limits to your program you would have received a segmentation fault.
You can therefore access n + 1 or n + 2 or n + whatever element. It only means that you are accessing memory that is passed the end of your array.
Incrementing the pointer rawdata would move the memory address by one byte. Incrementing and int pointer would increment move the memory address by 4 bytes (sizeof(int)).

Why Global memory version is faster than constant memory in my CUDA code?

I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.

Resources