High- vs. low bit encoding for 10/12 bit images - image-processing

I had a discussion with a colleague recently on the topic of high-bit encodings vs. low bit encodings of 12 bit images in 16 bits of memory (e.g. PNG files). He argued that high-bit encodings were easier to use, since many image viewers (e.g. windows image preview, explorer thumbnails, ...) can display them more easily in a human-readable way using trivial 16-to-8 conversions, while low-bit encoded images appear mostly black.
I'm thinking more from an image processing perspective and thought that surely, low-bit encodings are better, so I sat down, did some experimentation and wrote out my findings. I'm curious if the community here has additional or better insights that I am missing.
When using some image processing backend (e.g. Intel IPP), 10 and 12 bit encodings are often assumed to physically take 16 bits (unsigned short). The memory is read as "a number", so that a gray value of 1 (12 bit) encoded in the low bits yields a "1", while encoded in the high bits it yields a 16 (effectively, left shift by four).
Taking a low-bit image and the corresponding left shifted high-bit image and performing some operations (possibly including interpolation) will yield identical results after right-shifting the result of the high-bit input (for comparison's sake).
The main differences come when looking at the histograms of the images: the low-bit histogram is "dense" and 4k entries long, the high-bit histogram contains 15 zeros for every non-zero entry, and is 65k entries long.
3 a) Generating lookup tables (LUTs) for operations takes 16x as long, applying them should take identical time.
3 b) Operations scaling with histrogram^2 (e.g. gray scale cooccurrence matrices) become much more costly: 256x memory and time if done naively (but then again, in order to correctly treat any potential interpolated pixels with values not allowed in the original 12 bit encoding correctly, you kind of have to...)
3 c) Debugging histogram vectors is a PITA when looking at mostly-zero interspersed high-bit histograms.
To my great surprise, that was all I could come up with. Anything obvious that I am missing?

Related

16 bit Checksum fuzzy analsysis - Leveraging "collisions", biases a thing?

If playing around with CRC RevEng fails, what next? That is the gist of my question. I am trying to learn more how to think for myself, not just looking for an answer 1 time to 1 problem.
Assuming the following:
1.) You have full control of white box algorithm and can create as many chosen sample messages as you want with valid 16 bit / 2 byte checksums
2.) You can verify as many messages as you want to see if they are valid or not
3.) Static or dynamic analysis of the white box code is off limits (say the MCU is of a lithography that would require electron microscope to analyze for example, not impossible but off limits for our purposes).
Can you use any of these methods or lines of thinking:
1.) Inspect "collisions", i.e. different messages with same checksum. Perhaps XOR these messages together and reveal something?
2.) Leverage strong biases towards certain checksums?
3.) Leverage "Rolling over" of the checksum "keyspace", i.e. every 65535 sequentially incremented messages you will see some type of sequential patterns?
4.) AI ?
Perhaps there are other strategies I am missing?
CRC RevEng tool was not able to find the answer using numerous settings configurations
Key properties and attacks:
If you have two messages + CRCs of the same length and exclusive-or them together, the result is one message and one pure CRC on that message, where "pure" means a CRC definition with a zero initial value and zero final exclusive-or. This helps take those two parameters out of the equation, which can be solved for later. You do not need to know where the CRC is, how long it is, or which bits of the message are participating in the CRC calculation. This linear property holds.
Working with purified examples from #1, if you take any two equal-length messages + pure CRCs and exclusive-or them together, you will get another valid message + pure CRC. Using this fact, you can use Gauss-Jordan elimination (over GF(2)) to see how each bit in the message affects other generated bits in the message. With this you can find out a) which bits in the message are particiapting, b) which bits are likely to be the CRC (though it is possible that other bits could be a different function of the input, which can be resolved by the next point), and c) verify that the check bits are each indeed a linear function of the input bits over GF(2). This can also tell you that you don't have a CRC to begin with, if there don't appear to be any bits that are linear functions of the input bits. If it is a CRC, this can give you a good indication of the length, assuming you have a contiguous set of linearly dependent bits.
Assuming that you are dealing with a CRC, you can now take the input bits and the output bits for several examples and try to solve for the polynomial given different assumptions for the ordering of the input bits (reflected or not, perhaps by bytes or other units) and the direction of the CRC shifting (reflected or not). Since you're talking about an allegedly 16-bit CRC, this can be done most easily by brute force, trying all 32,768 possible polynomials for each set of bit ordering assumptions. You can even use brute force on a 32-bit CRC. For larger CRCs, e.g. 64 bits, you would need to use Berlekamp's algorithm for factoring polynomials over finite fields in order to solve the problem before the heat death of the universe. Then you would factor each message + pure CRC as a polynomial, and look for common factors over multiple examples.
Now that you have the message bits, CRC bits, bit orderings, and the polynomial, you can go back to your original non-pure messages + CRCs, and solve for the initial value and final exclusive-or. All you need are two examples with different lengths. Then it's a simple two-equations-in-two-unknowns over GF(2).
Enjoy!

Which one is the better CRC scheme?

Say I have to error-check a message of some 120-bits long.I have two alternative for checksum schemes:
Split message to 5 24-bit strings and append each with a CRC8 field
Append the whole message with a CRC32 field
Which scheme has a higher error detection probability, and why? Let's assume no prior knowledge about the error patterns distribution.
UPDATE:
What if the system has a natural mode of failure which is a received cleared bit instead of a set bit (i.e., "1" was Tx-ed but "0" was Rx-ed), and the opposite does not happen?
In this case, the probability of long bursts of error bits is much smaller, assuming that the valid data has a uniform distribution of "0"s and "1"s, so the longest burst will be bound by the longest string of "1"s in the message.
You have to make some assumption about the error patterns. If you have a uniform distribution over all possible errors, then five 8-bit CRCs will detect more of the errors than one 32-bit CRC, simply because the former has 40 bits of redundancy.
However, I can construct many 24-bit error patterns that fool an 8-bit CRC, and use any combination of five of those to get no errors over all of the 8-bit CRCs. Yet almost all of those will be caught by the 32-bit CRC.
A good paper by Philip Koopman goes through evaluation of several CRCs, mostly focusing on their Hamming Distance. Like Mark Adler pointed out, the error distribution plays an important role in CRC selection (e.g. burst errors detection is one of the variable properties of CRC), as is the length of the CRC'ed data.
The Hamming Distance of a CRC indicates the maximum number of errors in the data which are 100% detectable.
Ref:
Cyclic Redundancy Code (CRC) Polynomial Selection For Embedded Networks:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.5.5027&rep=rep1&type=pdf
8-bit vs 32-bit CRC
For exemple, the 0x97 8-bit CRC polynomial has HD=4 up to 119 bits data words (which is more than your required 24-bit word), which means it detects 100% of 4 bits (or less) errors for data of length 119 bits or less.
On the 32-bit side, the 32-bit CRC 0x9d7f97d6 offer HD=9 up to 223 bits (greater than 5*24=120bits) data words. This means that it will detect 100% of the 9 bits (or less) errors for data composed of 223 bits or less.
Theoretically, 5x 8-bit CRCs would be able to 100% detect 4*4=16 evenly distributed bit flips across your 5 chunks (4 errors per 24-bit chunk). On the other end, the 32-bit CRC would only be able to 100% detect 9 bit flips per 120-bit chunk.
Error Distribution
Knowing all that, the only missing piece is the error distribution pattern. With it in hand, you'll be able to make an informed decision on the best CRC method to use. You seem to say that long burst of errors are not possible, but do not mention the exact maximum length. If that length does go up to 9 bits, you might be better off with the CRC32. If you expect occasional, <4-bit errors, both would do, though the 5x8-bit will consume more bandwidth (40 bits instead of 32 bits). If this is the case, a 32-bit CRC might even be overkill, a smaller CRC16 or even CRC9 could provide enough detection capabilities.
Beyond the hamming window, the CRC will not be able to catch every possible errors. The bigger the data length, the worse the CRC performances.
The CRC32 of course. It will detect ordering errors as between the five segments, as well as giving you 224 as much error detection.

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

How are matrices stored in memory?

Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)

On-the-fly lossless image compression

I have an embedded application where an image scanner sends out a stream of 16-bit pixels that are later assembled to a grayscale image. As I need to both save this data locally and forward it to a network interface, I'd like to compress the data stream to reduce the required storage space and network bandwidth.
Is there a simple algorithm that I can use to losslessly compress the pixel data?
I first thought of computing the difference between two consecutive pixels and then encoding this difference with a Huffman code. Unfortunately, the pixels are unsigned 16-bit quantities so the difference can be anywhere in the range -65535 .. +65535 which leads to potentially huge codeword lengths. If a few really long codewords occur in a row, I'll run into buffer overflow problems.
Update: my platform is an FPGA
PNG provides free, open-source, lossless image compression in a standard format using standard tools. PNG uses zlib as part of its compression. There is also a libpng. Unless your platform is very unusual, it should not be hard to port this code to it.
How many resources do you have available on your embedded platform?
Could you port zlib and do gzip compression? Even with limited resources, you should be able to port something like LZ77 or LZ88.
There are a wide variety of image compression libraries available. For example, this page lists nothing but libraries/toolkits for PNG images. Which format/library works best for you will most likely depend on the particular resource constraints you are working under (in particular, whether or not your embedded system can do floating-point arithmetic).
The goal with lossless compression is to be able to predict the next pixel based on previous pixels, and then to encode the difference between your prediction and the real value of the pixel. This is what you initial thought to do, but you were only using the one previous pixel and making the prediction that the next pixel would be the same.
Keep in mind that if you have all of the previous pixels, you have more relevant information than just the preceding pixel. That is, if you are trying to predict the value of X, you should use the O pixels:
..OOO...
..OX
Also, you would not want to use the previous pixel, B, in the stream to predict X in the following situation:
OO...B <-- End of row
X <- Start of next row
Instead you would make your prediction base on the Os.
How 'lossless' do you need?
If this is a real scanner there is a limit to the bandwidth/resolution so even if it can send +/-64K values it may be unphysical for adjacent pixels to have a difference of more than say 8 bits.
In which case you can do a start pixel value for each row and then do differences between each pixel.
This will smear out peaks but it may be that any peaks more than 'N'bits are noise anyway.
A good LZ77/RLE hybrid with bells and wwhistles can get wonderful compression that is fairly quick to decompress. They will also be bigger, badder compressors on smaller files due to the lack of library overhead. For a good, but GPLd implentation of this, check out PUCrunch

Resources