Webgl2 maximum size of 3D texture? - webgl

I'm creating a 3D texture in webgl with the
gl.bindTexture(gl.TEXTURE_3D, texture) {
const level = 0;
const internalFormat = gl.R32F;
const width = 512;
const height = 512;
const depth = 512;
const border = 0;
const format = gl.RED;
const type = gl.FLOAT;
const data = imagesDataArray;
...... command
It seems that the size of 512512512 when using 32F values is somewhat of a dealbraker since the chrome (running on laptop 8 gb ram) browser crashes when uploading a 3D texture of this size, but not always. Using a texture of say size 512512256 seems to always work on my laptop.
Is there any way to tell in advance the maximum size of 3D texture that the GPU in relation to webgl2 can accomodate?
Best regards

Unfortunately no, there isn't a way to tell how much space there is.
You can query the largest dimensions the GPU can handle but you can't query to amount of memory it has available just like you can't query how much memory is available to JavaScript.
That said, 512*512*512*1(R)*4(32F) is at least 0.5 Gig. Does your laptop's GPU have 0.5Gig? You actually probably need at least 1gig of GPU memory to use .5gig as a texture since the OS needs space for your apps' windows etc...
The Browsers also put different limits on how much memory you can use.
Some things to check.
How much GPU memory do you have.
If it's not more than 0.5gig you're out of luck
If your GPU has more than 0.5gig, try a different browser
Firefox probably has different limits than Chrome
Can you create the texture at all?
use gl.texStorage3D and then call gl.getError. Does it get an out of memory error or crash right there.
If gl.texStorage3D does not crash can you upload a little data at a time with gl.texSubImage3D
I suspect this won't work even if gl.texStorage3D does work because the browser will still have to allocate 0.5gig to clear out your texture. If it does work this points to another issue which is that to upload a texture you need 3x-4x the memory, at least in Chrome.
There's your data in JavaScript
data = new Float32Array(size);
That data gets sent to the GPU process
gl.texSubImage3D (or any other texSub or texImage command)
The GPU process sends that data to the driver
glTexSubImage3D(...) in C++
Whether the driver needs 1 or 2 copies I have no idea. It's possible it keeps
a copy in ram and uploads one to the GPU. It keeps the copy so it can
re-upload the data if it needs to swap it out to make room for something else.
Whether or not this happens is up to the driver.
Also note that while I don't think this is the issue the drive is allowed to expand the texture to RGBA32F needing 2gig. It's probably not doing this but I know in the past certain formats were emulated.
Note: texImage potentially takes more memory than texStorage because the semantics of texImage mean that the driver can't actually make the texture until just before you draw since it has no idea if you're going to add mip levels later. texStorage on the other hand you tell the driver the exact size and number of mips to start with so it needs no intermediate storage.
function main() {
const gl = document.createElement('canvas').getContext('webgl2');
if (!gl) {
return alert('need webgl2');
}
const tex = gl.createTexture();
gl.bindTexture(gl.TEXTURE_3D, tex);
gl.texStorage3D(gl.TEXTURE_3D, 1, gl.R32F, 512, 512, 512);
log('texStorage2D:', glEnumToString(gl, gl.getError()));
const data = new Float32Array(512*512*512);
for (let depth = 0; depth < 512; ++depth) {
gl.texSubImage3D(gl.TEXTURE_3D, 0, 0, 0, depth, 512, 512, 1, gl.RED, gl.FLOAT, data, 512 * 512 * depth);
}
log('texSubImage3D:', glEnumToString(gl, gl.getError()));
}
main();
function glEnumToString(gl, value) {
return Object.keys(WebGL2RenderingContext.prototype)
.filter(k => gl[k] === value)
.join(' | ');
}
function log(...args) {
const elem = document.createElement('pre');
elem.textContent = [...args].join(' ');
document.body.appendChild(elem);
}

Related

I can't get vImage (Accelerate Framework) to convert 420Yp8_Cb8_Cr8 (planar) to ARGB8888

I'm trying to convert Planar YpCbCr to RGBA and it's failing with error kvImageRoiLargerThanInputBuffer.
I tried two different ways. Here're some code snippets.
Note 'thumbnail_buffers + 1' and 'thumbnail_buffers + 2' have width and height half of 'thumbnail_buffers + 0'
because I'm dealing with 4:2:0 and have (1/2)*(1/2) as many chroma samples each as luma samples. This silently fails
(even though I asked for an explanation (kvImagePrintDiagnosticsToConsole).
error = vImageConvert_YpCbCrToARGB_GenerateConversion(
kvImage_YpCbCrToARGBMatrix_ITU_R_709_2,
&fullrange_8bit_clamped_to_fullrange,
&convertInfo,
kvImage420Yp8_Cb8_Cr8, kvImageARGB8888,
kvImagePrintDiagnosticsToConsole);
uint8_t BGRA8888_permuteMap[4] = {3, 2, 1, 0};
uint8_t alpha = 255;
vImage_Buffer dest;
error = vImageConvert_420Yp8_Cb8_Cr8ToARGB8888(
thumbnail_buffers + 0, thumbnail_buffers + 1, thumbnail_buffers + 2,
&dest,
&convertInfo, BGRA8888_permuteMap, alpha,
kvImagePrintDiagnosticsToConsole //I don't think this flag works here
);
So I tried again with vImageConvert_AnyToAny:
vImage_CGImageFormat cg_BGRA8888_format = {
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.colorSpace = baseColorspace,
.bitmapInfo =
kCGImageAlphaNoneSkipFirst | kCGBitmapByteOrder32Little,
.version = 0,
.decode = (CGFloat*)0,
.renderingIntent = kCGRenderingIntentDefault
};
vImageCVImageFormatRef vformat = vImageCVImageFormat_Create(
kCVPixelFormatType_420YpCbCr8Planar,
kvImage_ARGBToYpCbCrMatrix_ITU_R_709_2,
kCVImageBufferChromaLocation_Center,
baseColorspace,
0);
vImageConverterRef icref = vImageConverter_CreateForCVToCGImageFormat(
vformat,
&cg_BGRA8888_format,
(CGFloat[]){0, 0, 0},
kvImagePrintDiagnosticsToConsole,
&error );
vImage_Buffer dest;
error = vImageBuffer_Init( &dest, image_height, image_width, 8, kvImagePrintDiagnosticsToConsole);
error = vImageConvert_AnyToAny( icref, thumbnail_buffers, &dest, (void*)0, kvImagePrintDiagnosticsToConsole); //kvImageGetTempBufferSize
I get the same error but this time I get the following message printed to the console.
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[1].height must be >= dests[0].height
But this doesn't make any sense to me. How can my Cb height be anything other than half my Yp height (which is the same as my dest RGB height) when I've got 4:2:0 data?
(Likewise for width?)
What on earth am I doing wrong? I'm going to be doing other conversions as well (4:4:4, 4:2:2, etc) so any clarification
on these APIs would be greatly appreciated. Further, what is my siting supposed to be for these conversions? Above I use
kCVImageBufferChromaLocation_Center. Is that correct?
Some new info:
Since posting this I saw a glaring error, but fixing it didn't help. Notice that in the vImageConvert_AnyToAny case above, I initialized the destination buffer with just the image width instead of 4*width to make room for RGBA. That must be the problem, right? Nope.
Notice further that in the vImageConvert_* case, I didn't initialize the destination buffer at all. Fixed that too and it didn't help.
So far I've tried the conversion six different ways choosing one from (vImageConvert_* | vImageConvert_AnyToAny) and choosing one from (kvImage420Yp8_Cb8_Cr8 | kvImage420Yp8_CbCr8 | kvImage444CrYpCb8) feeding the appropriate number of input buffers each time--carefully checking that the buffers take into account the number of samples per pixel per plane. Each time I get:
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[0].width must be >= dests[0].width
which makes no sense to me. If my luma plane is say 100 wide, my RGBA buffer should be 400 wide. Please any guidance or working code going from YCC to RGBA would be greatly appreciated.
Okay, I figured it out--part user error, part Apple bug. I was thinking of the vImage_Buffer's width and height wrong. For example, the output buffer I specified as 4 * image_width, and 8 bits per pixel, when it should have been simply image_width and 32 bits per pixel--the same amount of memory but sending the wrong message to the APIs. The literal '8' on that line blinded me from remembering what that slot was, I guess. A lesson I must have learned many times--name your magic numbers.
Anyway, now the bug part. Making the input and output buffers correct with regards to width, height, pixel depth fixed all the calls to the low-level vImageConvert_420Yp8_Cb8_Cr8ToARGB8888 and friends. For example in the planar YCC case, your Cb and Cr buffers would naturally have half the width and half the height of the Yp plane. However, in the vImageConvert_AnyToAny cases these buffers caused the calls to fail and bail--saying silly things like I needed my Cb plane to have the same dimensions as my Yp plane even for 4:2:0. This appears to be a bug in some preflighting done by Apple before calling the lower-level code that does the work.
I worked around the vImageConvert_AnyToAny bug by simply making input buffers that were too big and only filling Cb and Cr data in the top-left quadrant. The data were found there during the conversion just fine. I made these too-big buffers with vImageBuffer_Init() where Apple allocated the too-big malloc that goes to waste. I didn't try making the vImage_Buffer's by hand--lying about the size and allocating just the memory I need. This may work, or perhaps Apple will crawl off into the weeds trusting the width and height. If you hand make one, you better tell the truth about the rowBytes however.
I'm going to leave this answer for a bit before marking it correct, hoping someone at Apple sees this and fixes the bug and perhaps gets inspired to improve the documentation for those of us stumbling about.

How can I make this function as fast as the vDSP version?

I have written the function below, which uses vDSP function calls to compute a certain result. I thought it would be faster if I rewrote it using the 128 bit vFloat data type to avoid the vDSP function calls. But my vFloat code is still 2-3 times slower than the vDSP version.
I am targeting iOS mainly, but it would be best if the code also runs well on Mac OS.
I measure the speed of these functions on arrays of length 256, which is the typical array length for my application. I want to know how to get this function to run as fast as possible because I have many others like it and I am hoping that once I figure out how to optimize this one I can use the same principles for all the others.
Here is the vDSP version, which on Mac OS is 50% faster with aggressive optimizations enabled, or 2-3x faster with less aggressive compiler settings:
void asymptoticLimitTest2(float limit,
const float* input,
float* output,
size_t numSamples){
// input / limit => output
vDSP_vsdiv(input, 1, &limit, output, 1, numSamples);
// abs(output) => output
vDSP_vabs(output, 1, output, 1, numSamples);
// 1 + output => output
float one = 1.0;
vDSP_vsadd(output, 1, &one, output, 1, numSamples);
// input / output => output
vDSP_vdiv(output, 1, input, 1, output, 1, numSamples);
}
Here is my vFloat version, which I thought would be faster because it avoids all the function calls, but for my application's standard vector length of 256, is not faster:
void asymptoticLimitTest3(float limit,
const float* input,
float* output,
size_t numSamples){
vFloat limitv = {limit, limit, limit, limit};
vFloat onev = {1.0,1.0,1.0,1.0};
size_t n = numSamples;
// process in chunks of 8 samples
while(n > 4){
vFloat d = vfabsf(*(vFloat *)input / limitv) + onev;
*(vFloat *)output = *(vFloat *)input / d;
input += 4;
output += 4;
n -= 4;
}
// process the remaining samples individually
while(n > 0){
float d = fabsf(*input / limit) + 1.0;
*output = *input / d;
input++;
output++;
n--;
}
}
I am hoping to get asymptoticLimitTest3() to run faster than asymptoticLimitTest2(). I'm interested to hear any and all suggestions that will speed up asymptoticLimitTest3()
Thanks in advance for your help.
I did more tests on the timing, and it turns out that the vFloat version is generally NOT slower than the vDSP version. When I did the original tests, I was calling the same function over and over again in a for loop to get the timing info. When I rewrote the loop so that it interleaves the calls with calls to other functions (as would be more normal for an actual application), then the vFloat version is faster. Apparently the vDSP version was picking up some benefit from running the same code repeatedly, perhaps because the code itself remained in the cache.

Save the data of Mat (OpenCV) on memoryarea located by a pointer

I have created a memory object on the shared memory with following OpenCL-Function call:
cl_mem buffer_img_GAUSS_TEST = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, sizeof(uchar) * size_cols * size_rows,NULL,&status);
A call of this function gives me the pointer:
uchar *src_ptr;
src_ptr = (uchar *)clEnqueueMapBuffer(cmdQueue, buffer_img_GAUSS_TEST, CL_TRUE, CL_MAP_READ, 0, sizeof(uchar) * size_cols* size_rows, 0, NULL, NULL, &status);
Now I want to read an image with following OpenCV function call:
Mat img= imread( "scene.jpg", IMREAD_GRAYSCALE );
Is it possible to "say" that the data of the picture should be placed at the data-area pointed to by src_ptr? The area located by the buffer_img_GAUSS_TEST has exactly the size needed for the data of Template
In other words:
I want to replace this part of the code, which copys the data from the Image to the address pointed to by src_pointer.
memcpy ( src_ptr, img.data, sizeof(uchar) * img.cols * img.rows);
No.
According to cv::Mat design it is impossible to load image data at specified memory location.
The proper way is to use memcpy (as you do). Move semantics for cv::Mat is undefined (at least I haven't seen anything like that and there is no information in docs).
The other solution is to provide pointer to img data (by calling cv::Mat::data()) without creating a copy of image, if there is such ability in other lib you use. In this case you should implement some kind of shared ownership and no deallocate cv::Mat while pointer to its data is in use.

Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.
So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.
It will look a bit like that:
kernel void multiplyBy2(constant float* in [[buffer(0)]],
device float* out [[buffer(1)]],
constant Uniforms& uniforms [[buffer(2)]],
uint gid [[thread_position_in_grid]])
{
if (gid >= uniforms.buffer_size) { return; }
out[gid] = in[gid] * 2.0;
}
The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?
Would it be more effective to, for example, dispatch 4 times less threads, that do something like that
out[gid * 4 + 0] = in[gid + 0] * 2.0;
out[gid * 4 + 1] = in[gid + 1] * 2.0;
out[gid * 4 + 2] = in[gid + 2] * 2.0;
out[gid * 4 + 3] = in[gid + 3] * 2.0;
So that thread can work a little bit longer? Or it is better to make threads as thin as possible?
Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.
For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.
If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.
Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.

Improve performance with libpng

I have a microcontroller with a LCD display. I need to display several PNG images. Since the performance of the microcontroller is limited the time to display an image is too large.
I made benchmarks and detected that the most time is spent in the libpng and not in accessing the display memory or the storage where the (compressed) file is located.
I can manipulate the PNG files before transferring them to the microcontroller.
The data is actually be read inside the callback function registerd with png_set_read_fn.
Edit:
The pictures are encoded with 8 bits per color plus transparency resulting in 32 bits per pixel. But most of the pictures have gray colors.
Here is the sequence of functions that I use to convert:
png_ptr = png_create_read_struct(PNG_LIBPNG_VER_STRING, 0, show_png_error, show_png_warn);
info_ptr = png_create_info_struct(png_ptr);
end_info = png_create_info_struct(png_ptr);
png_set_user_limits(png_ptr, MAX_X, MAX_Y);
png_set_read_fn(png_ptr, 0, &read_callback);
png_set_sig_bytes(png_ptr, 0);
png_read_info(png_ptr, info_ptr);
png_read_update_info(png_ptr, info_ptr);
result->image = malloc(required_size);
height = png_get_image_height(png_ptr, info_ptr);
png_bytep *row_pointers = malloc(sizeof(void*) * height);
for (i = 0; i < height; ++i)
row_pointers[i] = result->image + (i * png_get_rowbytes(png_ptr, info_ptr));
png_set_invert_alpha(png_ptr);
png_read_image(png_ptr, row_pointers);
png_read_end(png_ptr, end_info);
free(row_pointers);
png_destroy_read_struct(&png_ptr, &info_ptr, &end_info);
What parameters should be considered to get the fastest decompression?
It depends upon the nature of the images.
For photos, pngcrush method 12 (filter type 1, zlib strategy 2, zlib level 2) works well. For images with 256 or fewer colors, method 7 (filter type 0, zlib level 9, zlib strategy 0) works well.
Method 12 also happens to be a very fast compressor but as I understand it, that does not matter to you. zlib strategy 2 is Huffman-only compression so the result is the same for any non-zero zlib compression level.
In your code, to obtain the same behavior as pngcrush method 7, use
png_set_compression_level(png_ptr, 9);
png_set_compression_strategy(png_ptr, 0);
png_set_filter(png_ptr,PNG_FILTER_NONE);
and to get pngcrush method 12 behavior,
png_set_compression_level(png_ptr, 2);
png_set_compression_strategy(png_ptr, 2);
png_set_filter(png_ptr,PNG_FILTER_SUB);

Resources