I would like to be able to define a MTLBuffer and populate data directly to the buffer (or as efficiently as possible).
If I do the following, the values used in the shader are 1.0 and 2.0 (for X and Y respectively), not 3.0 and 4.0 which are set after the MTLBuffer is created.
int bufferLength = 128 * 128;
float pointBuffer[bufferLength * 2]; // 2 for X and Y
//Populate array with test values
for (int i = 0; i < (bufferLength * 2); i += 2) {
pointBuffer[i] = 1.0; //X
pointBuffer[i + 1] = 2.0; //Y
}
id<MTLBuffer> pointDataBuffer = [device newBufferWithBytes:&pointBuffer length:sizeof(pointBuffer) options:MTLResourceOptionCPUCacheModeDefault];
//Populate array with updated test values
for (int i = 0; i < (bufferLength * 2); i += 2) {
pointBuffer[i] = 3.0; //X
pointBuffer[i + 1] = 4.0; //Y
}
//In the (Swift) class with the pipeline:
commandEncoder!.setBuffer(pointDataBuffer, offset: 0, index: 4)
Based on the docs, it seems like I need to call didModifyRange: but pointDataBuffer does not seem to recognize the selector.
Is there a way to update the array without having to recreate the MTLBuffer?
-newBufferWithBytes:... makes a copy of the passed in bytes. It does not keep referencing them. So, subsequent changes to pointBuffer do not affect it.
However, buffers like this one (whose storage mode is not private) provide access to their storage through the -contents method. So, you could do something like this:
float *points = pointDataBuffer.contents;
for (int i = 0; i < (bufferLength * 2); i += 2) {
points[i] = 3.0; //X
points[i + 1] = 4.0; //Y
}
Be careful, though. The CPU and GPU operate asynchronously relative to each other. If there might be commands being processed by the GPU that reference the buffer, then modifying it from the CPU may interfere with the operation of those commands. So, you'll want to take steps to synchronize access to the buffer or otherwise avoid simultaneous CPU and GPU access.
Related
So I'm aware how you can use loadPixels() and updatePixels() to alter the individual pixels of the main canvas as though it were a bitmap. Is there any similar technique for accessing the pixels of a createGraphics() object? Or do I have to write it to the canvas then manipulate that?
Or am I supposed to use a drawingContext object somehow?
If you want to manipulate pixels use createImage()
If you want to draw easily using the graphics functions use createGraphics() and loadPixels() / reading pixels[] should work:
var buffer;
function setup() {
createCanvas(400, 400);
buffer = createGraphics(10,10);
buffer.ellipse(5,5,5);
buffer.loadPixels();
console.log(buffer.pixels);
}
function draw() {
background(220);
image(buffer,0,0,400,400);
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.0.0/p5.min.js"></script>
You can of course write pixels into PGraphics too if you want.
PImage is a bit lighter weight if you don't need the drawing functionality and just need pixels.
Here's an example:
var buffer;
function draw() {
background(220);
image(buffer,0,0,400,400);
}
function setup() {
createCanvas(400, 400);
buffer = createGraphics(10,10);
buffer.ellipse(5,5,5);
buffer.loadPixels();
// print pixels (list of bytes in order (e.g. [r0,g0,b0,a0,r1,g1,b1,a1,...])
console.log(buffer.pixels);
var gradientW = 3;
var gradientH = 3;
for(var y = 0; y < gradientH; y++){
for(var x = 0; x < gradientH; x++){
// calculate 1D index from x,y
let pixelIndex = x + (y * buffer.width);
// note that as opposed to Processing Java, p5.Image is RGBA (has 4 colour channels, hence the 4 bellow)
// and the pixels[] array is equal to width * height * 4 (colour cannels)
// therefore the index is also * 4
let rIndex = pixelIndex * 4;
console.log('x',x,'y',y,'pixelIndex',pixelIndex,'red index',rIndex);
// access and assign red
buffer.pixels[rIndex] = round(map(x,0,3,0,255));
// access and assign green
buffer.pixels[rIndex + 1] = round(map(y,0,3,0,255));
// access and assign blue
buffer.pixels[rIndex + 2] = 255 - buffer.pixels[rIndex] + buffer.pixels[rIndex + 1]
// access and assign alpha
buffer.pixels[rIndex + 3] = 255;
}
}
buffer.updatePixels();
}
function draw() {
background(220);
image(buffer,0,0,width,height);
}
<script src="https://cdnjs.cloudflare.com/ajax/libs/p5.js/1.0.0/p5.min.js"></script>
I have a function defined by Intel IPP to operate on an Image / Region of Image.
The input to the image are the pointer to the image, parameters to define the size to process and parameters of the filter.
The IPP function is single threaded.
Now, I have an image of size M x N.
I want to apply the filter on it in parallel.
The main idea is simple, break the image into 4 sub images which are independent of each other.
Apply the filter to each sub image and write the result to a sub block of an empty image where each thread write to a distinct set of pixels.
It's really like processing 4 images each on it own core.
This is the program I'm doing it with:
void OpenMpTest()
{
const int width = 1920;
const int height = 1080;
Ipp32f input_image[width * height];
Ipp32f output_image[width * height];
IppiSize size = { width, height };
int step = width * sizeof(Ipp32f);
/* Splitting the image */
IppiSize section_size = { width / 2, height / 2};
Ipp32f* input_upper_left = input_image;
Ipp32f* input_upper_right = input_image + width / 2;
Ipp32f* input_lower_left = input_image + (height / 2) * width;
Ipp32f* input_lower_right = input_image + (height / 2) * width + width / 2;
Ipp32f* output_upper_left = input_image;
Ipp32f* output_upper_right = input_image + width / 2;
Ipp32f* output_lower_left = input_image + (height / 2) * width;
Ipp32f* output_lower_right = input_image + (height / 2) * width + width / 2;
Ipp32f* input_sections[4] = { input_upper_left, input_upper_right, input_lower_left, input_lower_right };
Ipp32f* output_sections[4] = { output_upper_left, output_upper_right, output_lower_left, output_lower_right };
/* Filter Params */
Ipp32f pKernel[7] = { 1, 2, 3, 4, 3, 2, 1 };
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < 4; i++)
ippiFilterRow_32f_C1R(
input_sections[i], step,
output_sections[i], step,
section_size, pKernel, 7, 3);
}
Now, the issues is I see no gain versus working Single Threaded mode on all image.
I tried to change the image size or filter size and nothing will the change the picture.
The most I could gain was nothing significant (10-20%).
I thought it might have something to do with that I can't "Promise" each thread the zone it received is "Read Only".
Moreover to let it know the memory location it writes to is also belongs only to himself.
I read about defining variables as private and share, yet I couldn't find a guide to deal with arrays and pointers.
What would be the proper way to deal with pointers and sub arrays in OpenMP?
How does the performance of threaded IPP compare?
Assuming no race conditions, performance problems with writing to shared arrays are most likely to occur in cache lines where part of the line is written by one thread and another part is read by another.
It's likely to require a data region larger than a 10 megabytes or so before full parallel speedup is seen.
You would need deeper analysis, e.g. by Intel VTune Amplifier, to see whether memory bandwidth or data overlaps are limiting performance.
Using Intel IPP Filter, the best solution was using:
int height = dstRoiSize.height;
int width = dstRoiSize.width;
Ipp32f *pSrc1, *pDst1;
int nThreads, cH, cT;
#pragma omp parallel shared( pSrc, pDst, nThreads, width, height, kernelSize,\
xAnchor, cH, cT ) private( pSrc1, pDst1 )
{
#pragma omp master
{
nThreads = omp_get_num_threads();
cH = height / nThreads;
cT = height % nThreads;
}
#pragma omp barrier
{
int curH;
int id = omp_get_thread_num();
pSrc1 = (Ipp32f*)( (Ipp8u*)pSrc + id * cH * srcStep );
pDst1 = (Ipp32f*)( (Ipp8u*)pDst + id * cH * dstStep );
if( id != ( nThreads - 1 )) curH = cH;
else curH = cH + cT;
ippiFilterRow_32f_C1R( pSrc1, srcStep, pDst1, dstStep,
width, curH, pKernel, kernelSize, xAnchor );
}
}
Thank You.
I am parsing a 3D file into OpenGL ES on an iOS device and after I get the vertices I can't seem to add them to the GLfloat containing my vertices. At the top of my file I declare this GLFloat:
GLfloat gFileVertices[] = {
-0.686713, 0.346845, 3.725390, -0.000288, -0.000652, -0.000109,
-0.677196, 0.350971, 3.675733, -0.000288, -0.000652, -0.000109,
-0.673889, 0.340921, 3.726985, -0.000288, -0.000652, -0.000109,
-0.677424, 0.337048, 3.775731, -0.000283, -0.000631, -0.000071,
And so on...
}
But how can I put that same data (x,y,z normal.x, normal.y, normal.z) into that array in an instance in which each of those are variables and there are a variable number of rows?
The solution is to allocate the vertices buffer dynamically at runtime, rather than statically at compile time. In your code there is no way to change the size of gFileVertices once the program is compiled.
For management purposes I will instead use separate normals and vertices array instead of interleaving the data into one.
To parse the file, determine the number of vertices and allocate buffers.
GLfloat* verticesBuff = malloc(sizeof(GLfloat) * vertCount * 3); /* 3 floats per vert */
GLfloat* normalsBuff = malloc(sizeof(GLfloat) * vertCount * 3); /* 3 floats per vert */
Then copy each element into the new array:
/* read from file or whatevs */
for (int i = 0; i < vertCount; i ++)
{
verticesBuff[i * 3] = ...
verticesBuff[i * 3 + 1] = ...
verticesBuff[i * 3 + 2] = ...
normalsBuff[i * 3] = ...
normalsBuff[i * 3 + 1] = ...
normalsBuff[i * 3 + 2] = ...
}
The dynamic arrays can be used in OpenGL just like the static:
glVertexPointer(3, GL_FLOAT, 0, verticesBuff);
glNormalPointer(GL_FLOAT, 0, normalsBuff);
Thats it! Just make sure to delete when you are done:
free(verticesBuff);
free(normalsBuff);
I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:
__global__ void test_shift(float *data_old, float *data_new)
{
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
// load from global to shared
VAR = data_old[glob_index];
// do some stuff on VAR
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
__syncthreads();
// write to global memory
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
}
The call to the kernel:
test_shift <<< grid, block >>> (data_old, data_new);
and grid and blocks (blockDim.x is equal to the matrix width, i.e. 64):
dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny);
I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?
VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index];.
You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1], so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).
You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.
Shared variables are, well, shared by all threads in a single block. This means that you don't have blockDim.y complects of shared variables but only a single complect per block.
uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;
__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];
VAR = data_old[glob_index];
if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}
This instructs all threads in a block to write data into a single variable (VAR). Next you have no synchronization, and you use this variable in the second assignment. This will have undefined result, because threads from the first warp are reading from this variable and threads from the second warp are still trying to write something there.
You should change VAR to be local, or create an array of shared memory variables for all threads in block.
if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x;
data_new[glob_index] = VAR2[threadIdx.x];
}
In VAR2[0] you still have some garbage (you've never written there). threadIdx.y is always zero in your blocks.
And avoid using uints. They have (or used to have) some perfomance problems.
Actually, for such simple task you don't need to use shared memory
__global__ void test_shift(float *data_old, float *data_new)
{
int glob_index = threadIdx.x + blockIdx.y*blockDim.x;
float VAR;
// load from global to local
VAR = data_old[glob_index];
int glob_index_new;
// calculate only if we are going to output something
if ( (blockIdx.y < gridDim.y - 1) && ( threadIdx.x < blockDim.x - 1 ))
{
glob_index_new = threadIdx.x + 1 + (blockIdx.y + 1)*blockDim.x;
// do some stuff on VAR
} else // just write 0.0 to remove garbage
{
glob_index_new = ( (blockIdx.y == gridDim.y - 1) && ( threadIdx.x == blockDim.x - 1 ) ) ? 0 : ((blockIdx.y == gridDim.y - 1) ? threadIdx.x : (blockIdx.y)*blockDim.x );
VAR = 0.0;
}
// write to global memory
data_new[glob_index_new] = VAR;
}
I've been using this web page as a guideline for formant tracking of speech...
http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
It all seems to be going pretty well, except for the last step, which is the converting of the cepstrum into a smoothed representation for simple peak picking for the formant tracking. The spectrograph looks good, and the cepstrograph (can I say that? :P) also looks good (from what I can tell), but the final stage the results (smoothed formant representation) are not what I expected.
I uploaded a sample of each stage as visual images to...
http://imgur.com/a/62duS
This sample is for the speech of the sound 'i' as in 'beed'. According to this site...
http://home.cc.umanitoba.ca/~robh/howto.html#formants
the first formant should come in around 500hz, and the second and third around 2200hz and 2800 hz respectively. The spetrograph shows something very similar, but on the last stage I am gettings results similar to...
F1 - 891
F2 - 1550
F3 - 2329
Any insight would be greatly appreciated. I've been going round in circles on this for some time. My code looks as follows...
// set up fft parameters
UInt32 log2n = 9;
UInt32 n = 512;
UInt32 window = n;
UInt32 halfN = n/2;
UInt32 stride = 1;
FFTSetup setupReal = [appDelegate getFftSetup];
int stepSize = (hpBuffer.sampleCount-window) / quantizeCount;
// calculate volume from raw samples, because it seems more reliable that fft
UInt32 volumeWindow = 128;
volumeBuffer = malloc(sizeof(float)*quantizeCount);
int windowPos = 0;
for (int i=0; i < quantizeCount; i++) {
windowPos += stepSize;
float total = 0.0f;
float max = 0.0f;
for (int p=windowPos; p < windowPos+volumeWindow; p++) {
total += sampleBuffer.buffer[p];
if (sampleBuffer.buffer[p] > max)
max = sampleBuffer.buffer[p];
}
volumeBuffer[i] = max;
}
// normalize volumebuffer
[FloatAudioBuffer normalizePositiveBuffer:volumeBuffer ofSize:quantizeCount];
// allocate memory for complex array
COMPLEX_SPLIT complexArray;
complexArray.realp = (float*)malloc(4096*sizeof(float));
complexArray.imagp = (float*)malloc(4096*sizeof(float));
// allocate some space for temporary hamming buffer
float *hamBuffer = malloc(n*sizeof(float));
// create spectrum and feature buffer
spectrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
formantBuffer = malloc(sizeof(float)*4096*quantizeCount);
cepstrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
lowCepstrumBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
featureBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
// create data point for each quantize segment
float TWOPI = 2.0f * M_PI;
for (int s=0; s < quantizeCount; s++) {
// copy buffer data into a seperate array and apply hamming window
int offset = (int)(s * stepSize);
for (int i=0; i < n; i++)
hamBuffer[i] = hpBuffer.buffer[offset+i] * ((1.0f-0.46f) - 0.46f*cos(TWOPI*i/((float)n-1.0f)));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)hamBuffer, 2, &complexArray, 1, halfN);
// run FFT
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n, FFT_FORWARD);
// Absolute square (equivalent to mag^2)
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, halfN);
bzero(complexArray.imagp, (halfN) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, halfN);
// get log of absolute values for passing to inverse FFT for cepstrum
for (int i=0; i < halfN; i++)
complexArray.realp[i] = logf(sqrtf(complexArray.realp[i]));
// save this into spectrum buffer
memcpy(&spectrumBuffer[s*halfN], complexArray.realp, halfN*sizeof(float));
// convert spectrum to interleaved ready for inverse fft
vDSP_ctoz((COMPLEX*)&spectrumBuffer[s*halfN], 2, &complexArray, 1, halfN/2);
// create cepstrum
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real and straight into cepstrum buffer
vDSP_ztoc(&complexArray, 1, (COMPLEX*)&cepstrumBuffer[s*halfN], 2, halfN/2);
// copy first part of cepstrum into low cepstrum buffer
memcpy(&lowCepstrumBuffer[s*featureCount], &cepstrumBuffer[s*halfN], featureCount*sizeof(float));
// make 8000 point array based on the first 15 values
float *tempArray = malloc(8192*sizeof(float));
for (int i=0; i < 8192; i++) {
if (i < 15)
tempArray[i] = cepstrumBuffer[s*halfN+i];
else
tempArray[i] = 0.0f;
}
vDSP_ctoz((COMPLEX*)tempArray, 2, &complexArray, 1, 4096);
float newLog2n = log2f(8192.0f);
complexArray.imagp[0] = 0.0f;
vDSP_fft_zrip(setupReal, &complexArray, stride, newLog2n, FFT_FORWARD);
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, 4096);
bzero(complexArray.imagp, (4096) * sizeof(float));
// scale
scale = 1.0f / (2.0f*(float)8192);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, 4096);
// get magnitude
for (int i=0; i < 4096; i++)
complexArray.realp[i] = sqrtf(complexArray.realp[i]);
// write to formant buffer
memcpy(&formantBuffer[s*4096], complexArray.realp, 4096*sizeof(float));
// complex array now contains formant spectrum
// it's large, so get features here!
// try simple peak picking algorithm for first 3 formants
int formantIndex = 0;
float *peaks = malloc(6*sizeof(float));
for (int i=0; i < 6; i++)
peaks[i] = 0.0f;
for (int i=1; i < 4096-1 && formantIndex < 6; i++) {
if (complexArray.realp[i-1] < complexArray.realp[i] &&
complexArray.realp[i+1] < complexArray.realp[i])
peaks[formantIndex++] = i;
}