I am trying to split a compressed JPEG bitstream into the 8x8 blocks of the original image. However, I am routinely finding fewer than I know there to be based on the size of the image.
I have narrowed this down to the first row of the image, where the padded edge block (identifiable by its lower mean value) is reached after 65 blocks when the image is 80 blocks across. The end of subsequent rows are then reached after the expected 80 blocks, indicating no further skipped blocks.
Am I simply missing some EOB markers in the first row, or is there a scenario in which some 8x8 blocks are not encoded into the bitstream?
If you are decoding a color image, it is very possible that the Cb and Cr components are subsampled so that there are not as many 8x8 blocks as for the Y component.
Related
I've got some Textures from an old computer game. They are stored in one huge file with a simple run-length encoding. I know the image width, and height, the images have also some fully transperant parts.
one texture is structured as follows:
-a color map: with 256 Colors, each color contains 4 bytes for each color. One value for b,g,r and one zero value
-texel Data: in general each byte corresponds to a color in the color map but
texel datas transparent is compressed with some run-length encoding:
first value indicates with how many tranparent texel the first row starts
the following value indicates how many of the folloing bytes corresponds to a texel from the colormap
(byte 5 means you have to map the next 5 bytes to the color map)
the value after the 5 bytes could be another transparent pixel count indicator, which would be followed by another color pixel count.
but if the value is 0xFE then there is a no other colored texels in this row. if there less texels in this row than given by the image width, the rest of the texels are transparent.
if the value is 0xFF the end of the image/the last row is reached. if there was less rows then given by the image height, the rest of the texels are transparent.
for example a 4x4 texture could look like this:
02 01 99 FE (two transparent pixel, color 99, one transperent pixel to fullfill the width)
01 02 98 99 FE (one transparent pixel, color 98, color 99 , one transperent pixel to fullfill the width)
00 01 99 01 02 98 99 FE (color 99, one transparent pixel, color 98, color 99)
02 02 99 98 FF (two transparent pixel, color 99, color 98)
so i think because this is a very rudimentary compression, maybe somebody knows if it is called a specific name or something?
And the most important, is there a way to upload this "compressed" data to openGL? I know for that i have to specify some encoding for the data in openGl.
I ve already writen an algorythm to convert this data to normal rgba data. But this takes much more graphics memory than the game actually specifies (about 30% of each image is transparent which could be run-length encoded instead). So if the game is not converting the image to all rgba i want also find a way for that.
can anybody give me some help?
This is just a paletted image that uses RLE encoding for empty spaces. There's not really a name for that. It's like GIF only not as good, but probably easier to decompress.
I ve already writen an algorythm to convert this data to normal rgba data.
Then you're done. Upload that to an OpenGL texture.
So if the game is not converting the image to all rgba i want also find a way for that.
You can't.
While you could implement a palette in a shader by using either two textures or a texture and a UBO/SSBO for the palette, you can't implement the run-length encoding scheme in a shader.
RLE is fine for data storage and bulk decompression, but it is terrible at random access. And random access is precisely how textures work. There's no real way to map a texture coordinate to a memory address containing the data for the corresponding texel. And if you can't do that, you can't access the texel.
Actual compressed texture formats are designed in a way that you can go directly from a texture coordinate to an exact memory address for the block of data containing that texel. RLE isn't like that.
You have to know how are the texels being indexed within the file, but you can actually access the data without decompressing it, though you have to run through the file linearly. You read it, say... "run" by "run" and it's associated colour, and you increment a counter by the number of texels in the "run" you just read, given the pixel index as a scalar (in a flat array) you can then compare the counter to the index you want to access, while:
counter >= index
counter += texels_in_run(value)
You read the next "run", and as soon as:
index < counter
you know the pixel is within that "run" and you can get the colour. You can use this to read the data both on CPU and GPU, but on GPU side each thread would need to read the data from start until it finds the desired pixel. But since the data is still compressed you got O(n) time, based on the number of "runs" you read, and not on the number of pixels, which may be not that bad…
Or you could probably find a better way to read the data, instead of accessing it linearly... :)
I didn’t look closely to what you wrote, so I can’t say this method will work with the way the data is formatted in your files, but maybe with some tinkering you’re able to make it work.
I need to iterate over the pixels of a YUV NV12 buffer and set color. I think the conversion for NV12 format should be easy but I can't figure it out. If I could set the top 50x50 pixels at 0,0 to white, I'd be set. Thank you in advance.
Have you tried setting the first 3 bytes (12 bits) * number of pixels to all 0x00 or all 0xFF?
Since you really don't seem to care about the conversion simply overwriting the buffer would suffice. If that works, you can tackle the other problems, like finding the right color and producing a rect instead of a line.
For the first, you need to understand the YUV coding. https://wiki.videolan.org/YUV#NV12. According to this document you will most likely need to overwrite bits in the Y range and in the UV range. So writing at two different locations. Thats very contrary to the RGB buffer where all pixel colors have close locality. So you can start and overwrite the first 8 bits in the Y range and the first or last 2 bits in the UV range. That should set you one pixel to a different color than before.
Finally you can tackle the display of the 50x50 rectangle. You'll need to know the image dimensions, because you'll need to offset after each row (if the buffer is transmitted by rows!). E.g., this graph:
.------.
|xx |
|xx |
| |
'------'
In a rgb color space, with row major transmitted values, the buffer would look like this: xx0000xx0000000000. So you would need to overwrite bytes 0-6 and bytes 18-24 (RGB). Because: first range * 3 bytes RGB. Then next range starts at row number (1) * image width (6) * 3 bytes (RGB), and so on. You have to apply the same thinking to the YUV color space.
While I was reading jpeg spec, I came to know while encoding jpeg, image is first broken into 8x8 blocks then DCT and other things happen.
So I am curious to know how would an image (raw file) containing a single row get encoded using jpeg?
would jpeg add extra 7 rows to file so that it can break it in 8x8 blocks?
A very nice explanation is given in https://dsp.stackexchange.com/questions/35339/jpeg-dct-padding
From Baseline JPEG:
The image is partitioned into blocks of size 8x8.
Each block is then independently transformed using the 8x8 DCT. If the image dimensions are not exact multiples of 8, the blocks on the lower and right hand boundaries may be only partially occupied. These boundary blocks must be padded to the full 8x8 block size and processed in an identical fashion to every other block. The compressor is free to select the value used to pad partial boundary blocks.
In JPEG compression, images that are not multiples of the MCU size are padded upwards to that size.
I am trying to optimize a block matching algorithm for motion estimation in OpenCL. Basically the image size is 384x288 and supposing the image is divided into a number of non-overlapping macro blocks of size 16x16, a total of 24x18 macro blocks can be realized.
At each macro block location, the motion in two consecutive frames has to be estimated (involves searching nearby region for sum of absolute differences in pixel intensity - gray using 16x16 blocks), am I correct in setting the global sizes to 24 and 18 respectively while launching the kernel?
My understanding is that when the opencl kernel launches, the location of the macroblock location on original image can be worked out as {get_local_size(0) x 16 -1, get_local_size(1) x 16 - 1}. Is this correct? Also what would be the optimal value for local work group size for this use case?
Thank you
am I correct in setting the global sizes to 24 and 18 respectively
while launching the kernel
If each thread computes a whole macroblock, yes you are right about global size but local size should be 1 or something like 3x2. But if single thread computes single pixel, no, global parameter is total threads. It should be 384x288 if you calculate one pixel per thread.
Number of groups/macroblocks change with the local size versus global size.
If there are 16 threads in a group and if there are 32 threads total, there would be only 2 groups of threads. Same thing happens for 2D and 3D kernel executions.
The location of the macroblock location on original image can be worked out as
x=get_group_id(0) * get_local_size(0)
y=get_group_id(1) * get_local_size(1)
id starts from zero. Where location(x,y) points to the upper-left corner of the patch. Then the lower-right corner would be
xLast=get_group_id(0) * get_local_size(0)+get_local_size(0)
yLast=get_group_id(1) * get_local_size(1)+get_local_size(1)
Ofcourse origin is assumed to be 0,0 at most-top-most-left.
Also what would be the optimal value for local work group size for
this use case?
If you leave local size parameter empty (null), opencl implementation chooses it itself(with a suitable size but may not be best) so number of groups is unknown.
Global size and local size will be different if you have thread per pixel or thread per group or even more than one thread per pixel. For example, if 2 new frames are to be calculated from older 5 frames, 2 threads per pixel could be used. Or you can do all job in single thread of a pixel, or you can do all 16x16 pixels job in a single thread, or you can do everything in a single thread. Choice is yours, you should test/farsee if your algorithm is embarrassingly parallel or serial.
I guess estimation is something like a 5(or 11)-point stencil (2d time differentiation?) so it will add things, multiply things suitably by single thread, then apply to a pixel, then do same for another frame's pixel, then do same for all 16x16 pixels of macroblock, then do same for all macroblock, it should use 1 thread per pixel(re-using the already computed stencil to compute 2 frames)(with only 1 color?).
You could start with a working code(or re-write yourself), then parallelize it on its nested loops, for example you could scan lines (1D kernel), scan pixels (2D kernel), scan pixels and their sub-pixels (3D?) such that i becomes get_global_id(0) and j becomes get_global_id(1).
Whenever i read a colored image with 3 channels via cv::imread; its data alignment is a bit awkward (neither a byte nor an integer) and slows me down when i read a single pixel data on GPU memory. And it seems cv::Mat class's logic behind the alignment is a bit different than what i had initially thought. It does not add an extra byte between two pixels in a single row in order to have each pixel in a row started at every 4 bytes; but rather it pads some extra bytes at the END of each row for which any row may start at every 4 bytes boundary.
What should i do to pack each pixel data into a single unsigned integer? Is there a built-in method in OpenCV so that i do not have to use logical OR operation for packing each pixel data one by one?
Kind Regards.
You can convert the pixel format from BGR to BGRA
See this example.