Improve performance with libpng - libpng

I have a microcontroller with a LCD display. I need to display several PNG images. Since the performance of the microcontroller is limited the time to display an image is too large.
I made benchmarks and detected that the most time is spent in the libpng and not in accessing the display memory or the storage where the (compressed) file is located.
I can manipulate the PNG files before transferring them to the microcontroller.
The data is actually be read inside the callback function registerd with png_set_read_fn.
The pictures are encoded with 8 bits per color plus transparency resulting in 32 bits per pixel. But most of the pictures have gray colors.
Here is the sequence of functions that I use to convert:
png_ptr = png_create_read_struct(PNG_LIBPNG_VER_STRING, 0, show_png_error, show_png_warn);
info_ptr = png_create_info_struct(png_ptr);
end_info = png_create_info_struct(png_ptr);
png_set_user_limits(png_ptr, MAX_X, MAX_Y);
png_set_read_fn(png_ptr, 0, &read_callback);
png_set_sig_bytes(png_ptr, 0);
png_read_info(png_ptr, info_ptr);
png_read_update_info(png_ptr, info_ptr);
result->image = malloc(required_size);
height = png_get_image_height(png_ptr, info_ptr);
png_bytep *row_pointers = malloc(sizeof(void*) * height);
for (i = 0; i < height; ++i)
row_pointers[i] = result->image + (i * png_get_rowbytes(png_ptr, info_ptr));
png_read_image(png_ptr, row_pointers);
png_read_end(png_ptr, end_info);
png_destroy_read_struct(&png_ptr, &info_ptr, &end_info);
What parameters should be considered to get the fastest decompression?

It depends upon the nature of the images.
For photos, pngcrush method 12 (filter type 1, zlib strategy 2, zlib level 2) works well. For images with 256 or fewer colors, method 7 (filter type 0, zlib level 9, zlib strategy 0) works well.
Method 12 also happens to be a very fast compressor but as I understand it, that does not matter to you. zlib strategy 2 is Huffman-only compression so the result is the same for any non-zero zlib compression level.
In your code, to obtain the same behavior as pngcrush method 7, use
png_set_compression_level(png_ptr, 9);
png_set_compression_strategy(png_ptr, 0);
and to get pngcrush method 12 behavior,
png_set_compression_level(png_ptr, 2);
png_set_compression_strategy(png_ptr, 2);


I can't get vImage (Accelerate Framework) to convert 420Yp8_Cb8_Cr8 (planar) to ARGB8888

I'm trying to convert Planar YpCbCr to RGBA and it's failing with error kvImageRoiLargerThanInputBuffer.
I tried two different ways. Here're some code snippets.
Note 'thumbnail_buffers + 1' and 'thumbnail_buffers + 2' have width and height half of 'thumbnail_buffers + 0'
because I'm dealing with 4:2:0 and have (1/2)*(1/2) as many chroma samples each as luma samples. This silently fails
(even though I asked for an explanation (kvImagePrintDiagnosticsToConsole).
error = vImageConvert_YpCbCrToARGB_GenerateConversion(
kvImage420Yp8_Cb8_Cr8, kvImageARGB8888,
uint8_t BGRA8888_permuteMap[4] = {3, 2, 1, 0};
uint8_t alpha = 255;
vImage_Buffer dest;
error = vImageConvert_420Yp8_Cb8_Cr8ToARGB8888(
thumbnail_buffers + 0, thumbnail_buffers + 1, thumbnail_buffers + 2,
&convertInfo, BGRA8888_permuteMap, alpha,
kvImagePrintDiagnosticsToConsole //I don't think this flag works here
So I tried again with vImageConvert_AnyToAny:
vImage_CGImageFormat cg_BGRA8888_format = {
.bitsPerComponent = 8,
.bitsPerPixel = 32,
.colorSpace = baseColorspace,
.bitmapInfo =
kCGImageAlphaNoneSkipFirst | kCGBitmapByteOrder32Little,
.version = 0,
.decode = (CGFloat*)0,
.renderingIntent = kCGRenderingIntentDefault
vImageCVImageFormatRef vformat = vImageCVImageFormat_Create(
vImageConverterRef icref = vImageConverter_CreateForCVToCGImageFormat(
(CGFloat[]){0, 0, 0},
&error );
vImage_Buffer dest;
error = vImageBuffer_Init( &dest, image_height, image_width, 8, kvImagePrintDiagnosticsToConsole);
error = vImageConvert_AnyToAny( icref, thumbnail_buffers, &dest, (void*)0, kvImagePrintDiagnosticsToConsole); //kvImageGetTempBufferSize
I get the same error but this time I get the following message printed to the console.
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[1].height must be >= dests[0].height
But this doesn't make any sense to me. How can my Cb height be anything other than half my Yp height (which is the same as my dest RGB height) when I've got 4:2:0 data?
(Likewise for width?)
What on earth am I doing wrong? I'm going to be doing other conversions as well (4:4:4, 4:2:2, etc) so any clarification
on these APIs would be greatly appreciated. Further, what is my siting supposed to be for these conversions? Above I use
kCVImageBufferChromaLocation_Center. Is that correct?
Some new info:
Since posting this I saw a glaring error, but fixing it didn't help. Notice that in the vImageConvert_AnyToAny case above, I initialized the destination buffer with just the image width instead of 4*width to make room for RGBA. That must be the problem, right? Nope.
Notice further that in the vImageConvert_* case, I didn't initialize the destination buffer at all. Fixed that too and it didn't help.
So far I've tried the conversion six different ways choosing one from (vImageConvert_* | vImageConvert_AnyToAny) and choosing one from (kvImage420Yp8_Cb8_Cr8 | kvImage420Yp8_CbCr8 | kvImage444CrYpCb8) feeding the appropriate number of input buffers each time--carefully checking that the buffers take into account the number of samples per pixel per plane. Each time I get:
<Error>: kvImagePrintDiagnosticsToConsole: vImageConvert_AnyToAny: srcs[0].width must be >= dests[0].width
which makes no sense to me. If my luma plane is say 100 wide, my RGBA buffer should be 400 wide. Please any guidance or working code going from YCC to RGBA would be greatly appreciated.
Okay, I figured it out--part user error, part Apple bug. I was thinking of the vImage_Buffer's width and height wrong. For example, the output buffer I specified as 4 * image_width, and 8 bits per pixel, when it should have been simply image_width and 32 bits per pixel--the same amount of memory but sending the wrong message to the APIs. The literal '8' on that line blinded me from remembering what that slot was, I guess. A lesson I must have learned many times--name your magic numbers.
Anyway, now the bug part. Making the input and output buffers correct with regards to width, height, pixel depth fixed all the calls to the low-level vImageConvert_420Yp8_Cb8_Cr8ToARGB8888 and friends. For example in the planar YCC case, your Cb and Cr buffers would naturally have half the width and half the height of the Yp plane. However, in the vImageConvert_AnyToAny cases these buffers caused the calls to fail and bail--saying silly things like I needed my Cb plane to have the same dimensions as my Yp plane even for 4:2:0. This appears to be a bug in some preflighting done by Apple before calling the lower-level code that does the work.
I worked around the vImageConvert_AnyToAny bug by simply making input buffers that were too big and only filling Cb and Cr data in the top-left quadrant. The data were found there during the conversion just fine. I made these too-big buffers with vImageBuffer_Init() where Apple allocated the too-big malloc that goes to waste. I didn't try making the vImage_Buffer's by hand--lying about the size and allocating just the memory I need. This may work, or perhaps Apple will crawl off into the weeds trusting the width and height. If you hand make one, you better tell the truth about the rowBytes however.
I'm going to leave this answer for a bit before marking it correct, hoping someone at Apple sees this and fixes the bug and perhaps gets inspired to improve the documentation for those of us stumbling about.

MedianBlur() calculate max kernel size

I want to use MedianBlur function with very high Ksize, like 301 or more. But if I pass ksize too high, sometimes the function will crash. The error message is:
OpenCV Error: (k < 16) in cv::medianBlur_8u_O1, in file ../opencv\modules\imgproc\src\smooth.cpp
(I use opencv4nodejs, but I also tried the original OpenCV 3.4.6).
I did reduce the ksize in a try/catch loop, but not so effective, since I have to work with videos.
I did checkout the OpenCV source code and did some research.
In OpenCV 3.4.6, the crash come from line 241, file opencv\modules\imgproc\src\median_blur.simd.hpp:
for ( k = 0; k < 16 ; ++k )
sum += H.coarse[k];
if ( sum > t )
sum -= H.coarse[k];
CV_Assert( k < 16 ); // Error here
t is caculated base on ksize. But sum and H.coarse array's calculations are quite complicated.
Did further researches, I found a scientific document about the algorithm:
I am trying to read but honestly, I don't understand too much.
How do I calculate the maximum ksize with a given image?
The maximum kernel size is determined from the bit depth of the image. As mentioned in the publication you cited:
"An 8-bit value is limited to a max value of 255. Our goal is to
support larger kernel sizes, including kernels that are greater in
size than 17 × 17, thus the larger 32-bit data type is used"
so for an image of data type CV_8U the maximum kernel size is 255.

python opencv create image from bytearray

I am capturing video from a Ricoh Theta V camera. It delivers the video as Motion JPEG (MJPEG). To get the video you have to do an HTTP POST alas which means I cannot use the cv2.VideoCapture(url) feature.
So the way to do this per numerous posts on the web and SO is something like this:
bytes = bytes()
while True:
bytes +=
a = bytes.find(b'\xff\xd8')
b = bytes.find(b'\xff\xd9')
if a != -1 and b != -1:
jpg = bytes[a:b+2]
bytes = bytes[b+2:]
i = cv2.imdecode(np.fromstring(jpg, dtype=np.uint8), cv2.IMREAD_COLOR)
cv2.imshow('i', i)
if cv2.waitKey(1) == 27:
That actually works, except it is slow. I'm processing a 1920x1080 jpeg stream. on a Mac Book Pro running OSX 10.12.6. The call to imdecode takes approx 425000 microseconds to process each image
Any idea how to do this without imdecode or make imdecode faster? I'd like it to work at 60FPS with HD video (at least).
I'm using Python3.7 and OpenCV4.
Updated Again
I looked into JPEG decoding from the memory buffer using PyTurboJPEG, the code goes like this to compare with OpenCV's imdecode():
#!/usr/bin/env python3
import cv2
from turbojpeg import TurboJPEG, TJPF_GRAY, TJSAMP_GRAY
# Load image into memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# Decode JPEG from memory into Numpy array using OpenCV
i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
# Use default library installation
jpeg = TurboJPEG()
# Decode JPEG from memory using turbojpeg
i1 = jpeg.decode(r)
cv2.imshow('Decoded with TurboJPEG', i1)
And the answer is that TurboJPEG is 7x faster! That is 4.6ms versus 32.2ms.
In [18]: %timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
32.2 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit i1 = jpeg.decode(r)
4.63 ms ± 55.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Kudos to #Nuzhny for spotting it first!
Updated Answer
I have been doing some further benchmarks on this and was unable to verify your claim that it is faster to save an image to disk and read it with imread() than it is to use imdecode() from memory. Here is how I tested in IPython:
import cv2
# First use 'imread()'
%timeit i1 = cv2.imread('image.jpg', cv2.IMREAD_COLOR)
116 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now prepare the exact same image in memory
r = open('image.jpg','rb').read()
inp = np.asarray(bytearray(r), dtype=np.uint8)
# And try again with 'imdecode()'
%timeit i0 = cv2.imdecode(inp, cv2.IMREAD_COLOR)
113 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, I find imdecode() around 3% faster than imread() on my machine. Even if I include the np.asarray() into the timing, it is still quicker from memory than disk - and I have seriously fast 3GB/s NVME disks on my machine...
Original Answer
I haven't tested this but it seems to me that you are doing this in a loop:
read 1k bytes
append it to a buffer
look for JPEG SOI marker (0xffdb)
look for JPEG EOI marker (0xffd9)
if you have found both the start and the end of a JPEG frame, decode it
1) Now, most JPEG images with any interesting content I have seen are between 30kB to 300kB so you are going to do 30-300 append operations on a buffer. I don't know much abut Python but I guess that may cause a re-allocation of memory, which I guess may be slow.
2) Next you are going to look for the SOI marker in the first 1kB, then again in the first 2kB, then again in the first 3kB, then again in the first 4kB - even if you have already found it!
3) Likewise, you are going to look for the EOI marker in the first 1kB, the first 2kB...
So, I would suggest you try:
1) allocating a bigger buffer at the start and acquiring directly into it at the appropriate offset
2) not searching for the SOI marker if you have already found it - e.g. set it to -1 at the start of each frame and only try and find it if it is still -1
3) only look for the EOI marker in the new data on each iteration, not in all the data you have already searched on previous iterations
4) furthermore, actually, don't bother looking for the EOI marker unless you have already found the SOI marker, because the end of a frame without the corresponding start is no use to you anyway - it is incomplete.
I may be wrong in my assumptions, (I have been before!) but at least if they are public someone cleverer than me can check them!!!
I recommend to use turbo-jpeg. It has a python API: PyTurboJPEG.

Webgl2 maximum size of 3D texture?

I'm creating a 3D texture in webgl with the
gl.bindTexture(gl.TEXTURE_3D, texture) {
const level = 0;
const internalFormat = gl.R32F;
const width = 512;
const height = 512;
const depth = 512;
const border = 0;
const format = gl.RED;
const type = gl.FLOAT;
const data = imagesDataArray;
...... command
It seems that the size of 512512512 when using 32F values is somewhat of a dealbraker since the chrome (running on laptop 8 gb ram) browser crashes when uploading a 3D texture of this size, but not always. Using a texture of say size 512512256 seems to always work on my laptop.
Is there any way to tell in advance the maximum size of 3D texture that the GPU in relation to webgl2 can accomodate?
Best regards
Unfortunately no, there isn't a way to tell how much space there is.
You can query the largest dimensions the GPU can handle but you can't query to amount of memory it has available just like you can't query how much memory is available to JavaScript.
That said, 512*512*512*1(R)*4(32F) is at least 0.5 Gig. Does your laptop's GPU have 0.5Gig? You actually probably need at least 1gig of GPU memory to use .5gig as a texture since the OS needs space for your apps' windows etc...
The Browsers also put different limits on how much memory you can use.
Some things to check.
How much GPU memory do you have.
If it's not more than 0.5gig you're out of luck
If your GPU has more than 0.5gig, try a different browser
Firefox probably has different limits than Chrome
Can you create the texture at all?
use gl.texStorage3D and then call gl.getError. Does it get an out of memory error or crash right there.
If gl.texStorage3D does not crash can you upload a little data at a time with gl.texSubImage3D
I suspect this won't work even if gl.texStorage3D does work because the browser will still have to allocate 0.5gig to clear out your texture. If it does work this points to another issue which is that to upload a texture you need 3x-4x the memory, at least in Chrome.
There's your data in JavaScript
data = new Float32Array(size);
That data gets sent to the GPU process
gl.texSubImage3D (or any other texSub or texImage command)
The GPU process sends that data to the driver
glTexSubImage3D(...) in C++
Whether the driver needs 1 or 2 copies I have no idea. It's possible it keeps
a copy in ram and uploads one to the GPU. It keeps the copy so it can
re-upload the data if it needs to swap it out to make room for something else.
Whether or not this happens is up to the driver.
Also note that while I don't think this is the issue the drive is allowed to expand the texture to RGBA32F needing 2gig. It's probably not doing this but I know in the past certain formats were emulated.
Note: texImage potentially takes more memory than texStorage because the semantics of texImage mean that the driver can't actually make the texture until just before you draw since it has no idea if you're going to add mip levels later. texStorage on the other hand you tell the driver the exact size and number of mips to start with so it needs no intermediate storage.
function main() {
const gl = document.createElement('canvas').getContext('webgl2');
if (!gl) {
return alert('need webgl2');
const tex = gl.createTexture();
gl.bindTexture(gl.TEXTURE_3D, tex);
gl.texStorage3D(gl.TEXTURE_3D, 1, gl.R32F, 512, 512, 512);
log('texStorage2D:', glEnumToString(gl, gl.getError()));
const data = new Float32Array(512*512*512);
for (let depth = 0; depth < 512; ++depth) {
gl.texSubImage3D(gl.TEXTURE_3D, 0, 0, 0, depth, 512, 512, 1, gl.RED, gl.FLOAT, data, 512 * 512 * depth);
log('texSubImage3D:', glEnumToString(gl, gl.getError()));
function glEnumToString(gl, value) {
return Object.keys(WebGL2RenderingContext.prototype)
.filter(k => gl[k] === value)
.join(' | ');
function log(...args) {
const elem = document.createElement('pre');
elem.textContent = [...args].join(' ');

Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.
So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.
It will look a bit like that:
kernel void multiplyBy2(constant float* in [[buffer(0)]],
device float* out [[buffer(1)]],
constant Uniforms& uniforms [[buffer(2)]],
uint gid [[thread_position_in_grid]])
if (gid >= uniforms.buffer_size) { return; }
out[gid] = in[gid] * 2.0;
The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?
Would it be more effective to, for example, dispatch 4 times less threads, that do something like that
out[gid * 4 + 0] = in[gid + 0] * 2.0;
out[gid * 4 + 1] = in[gid + 1] * 2.0;
out[gid * 4 + 2] = in[gid + 2] * 2.0;
out[gid * 4 + 3] = in[gid + 3] * 2.0;
So that thread can work a little bit longer? Or it is better to make threads as thin as possible?
Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.
For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.
If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.
Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.
