Is there a way to force PMULHRSW to treat 0x8000 as 1.0 instead of -1.0? - image-processing

To process 8-bit pixels, to do things like gamma correction without losing information, we normally upsample the values, work in 16 bits or whatever, and then downsample them to 8 bits.
Now, this is a somewhat new area for me, so please excuse incorrect terminology etc.
For my needs I have chosen to work in "non-standard" Q15, where I only use the upper half of the range (0.0-1.0), and 0x8000 represents 1.0 instead of -1.0. This makes it much easier to calculate things in C.
But I ran into a problem with SSSE3. It has the PMULHRSW instruction which multiplies Q15 numbers, but it uses the "standard" range of Q15 is [-1,1-2⁻¹⁵], so multplying (my) 0x8000 (1.0) by 0x4000 (0.5) gives 0xC000 (-0.5), because it thinks 0x8000 is -1. This is quite annoying.
What am I doing wrong? Should I keep my pixel values in the 0000-7FFF range? Doesn't this kind of defeat the purpose of it being a fixed-point format? Is there a way around this? Maybe some trick?
Is there some kind of definitive treatise on Q15 which discusses all this?

Personally, I'd go with the solution of limiting the max value to 0x7FFF (~0.99something).
You don't have to jump through hoops getting the processor to work the way you'd like it
You don't have to spend a long time documenting the ins and outs of your "weird" code, as operating over 0-0x7FFF will be immediately recognisable to the readers of your code - Q-format is understood (in my experience) to run from -1.0 to +1.0-one lsb. The arithmetic doesn't work out so well otherwise, as the value of 1 lsb is different on each side of the 0!
Unless you can imagine yourself successfully arguing, to a panel of argumentative code reviewers, that that extra bit is critical to the operation of the algorithm rather than just "the last 0.01% of performance", stick to code everyone can understand, and which maps to the hardware you have available.
Alternatively, re-arrange your previous operation so that the pixels all come out to be the negative of what you originally had. Or the following operations to take in the negative of what you previously sent it. Then use values from -1.0 to 0.0 in Q15 format.

If you are sure that you won’t use any number “bigger” than $8000, the only problem would be when at least one of the multipliers is $8000 (–1, though you wish it were 1).
In this case the solution is rather simple:
pmulhrsw xmm0, xmm1
psignw xmm0, xmm0
Or, absolutely equivalent in our case (Thanks, Peter Cordes!):
pmulhrsw xmm0, xmm1
pabsw xmm0, xmm0
This will revert the negative values from multiplying by –1 to their positive values.

Related

16 bit Checksum fuzzy analsysis - Leveraging "collisions", biases a thing?

If playing around with CRC RevEng fails, what next? That is the gist of my question. I am trying to learn more how to think for myself, not just looking for an answer 1 time to 1 problem.
Assuming the following:
1.) You have full control of white box algorithm and can create as many chosen sample messages as you want with valid 16 bit / 2 byte checksums
2.) You can verify as many messages as you want to see if they are valid or not
3.) Static or dynamic analysis of the white box code is off limits (say the MCU is of a lithography that would require electron microscope to analyze for example, not impossible but off limits for our purposes).
Can you use any of these methods or lines of thinking:
1.) Inspect "collisions", i.e. different messages with same checksum. Perhaps XOR these messages together and reveal something?
2.) Leverage strong biases towards certain checksums?
3.) Leverage "Rolling over" of the checksum "keyspace", i.e. every 65535 sequentially incremented messages you will see some type of sequential patterns?
4.) AI ?
Perhaps there are other strategies I am missing?
CRC RevEng tool was not able to find the answer using numerous settings configurations
Key properties and attacks:
If you have two messages + CRCs of the same length and exclusive-or them together, the result is one message and one pure CRC on that message, where "pure" means a CRC definition with a zero initial value and zero final exclusive-or. This helps take those two parameters out of the equation, which can be solved for later. You do not need to know where the CRC is, how long it is, or which bits of the message are participating in the CRC calculation. This linear property holds.
Working with purified examples from #1, if you take any two equal-length messages + pure CRCs and exclusive-or them together, you will get another valid message + pure CRC. Using this fact, you can use Gauss-Jordan elimination (over GF(2)) to see how each bit in the message affects other generated bits in the message. With this you can find out a) which bits in the message are particiapting, b) which bits are likely to be the CRC (though it is possible that other bits could be a different function of the input, which can be resolved by the next point), and c) verify that the check bits are each indeed a linear function of the input bits over GF(2). This can also tell you that you don't have a CRC to begin with, if there don't appear to be any bits that are linear functions of the input bits. If it is a CRC, this can give you a good indication of the length, assuming you have a contiguous set of linearly dependent bits.
Assuming that you are dealing with a CRC, you can now take the input bits and the output bits for several examples and try to solve for the polynomial given different assumptions for the ordering of the input bits (reflected or not, perhaps by bytes or other units) and the direction of the CRC shifting (reflected or not). Since you're talking about an allegedly 16-bit CRC, this can be done most easily by brute force, trying all 32,768 possible polynomials for each set of bit ordering assumptions. You can even use brute force on a 32-bit CRC. For larger CRCs, e.g. 64 bits, you would need to use Berlekamp's algorithm for factoring polynomials over finite fields in order to solve the problem before the heat death of the universe. Then you would factor each message + pure CRC as a polynomial, and look for common factors over multiple examples.
Now that you have the message bits, CRC bits, bit orderings, and the polynomial, you can go back to your original non-pure messages + CRCs, and solve for the initial value and final exclusive-or. All you need are two examples with different lengths. Then it's a simple two-equations-in-two-unknowns over GF(2).
Enjoy!

Explanation for Values in Scharr-Filter used in OpenCV (and other places)

The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.

Hardware implementation for integer data processing

I am currently trying to implement a data path which processes an image data expressed in gray scale between unsigned integer 0 - 255. (Just for your information, my goal is to implement a Discrete Wavelet Transform in FPGA)
During the data processing, intermediate values will have negative numbers as well. As an example process, one of the calculation is
result = 48 - floor((66+39)/2)
The floor function is used to guarantee the integer data processing. For the above case, the result is -4, which is a number out of range between 0~255.
Having mentioned above case, I have a series of basic questions.
To deal with the negative intermediate numbers, do I need to represent all the data as 'equivalent unsigned number' in 2's complement for the hardware design? e.g. -4 d = 1111 1100 b.
If I represent the data as 2's complement for the signed numbers, will I need 9 bits opposed to 8 bits? Or, how many bits will I need to process the data properly? (With 8 bits, I cannot represent any number above 128 in 2's complement.)
How does the negative number division works if I use bit wise shifting? If I want to divide the result, -4, with 4, by shifting it to right by 2 bits, the result becomes 63 in decimal, 0011 1111 in binary, instead of -1. How can I resolve this problem?
Any help would be appreciated!
If you can choose to use VHDL, then you can use the fixed point library to represent your numbers and choose your rounding mode, as well as allowing bit extensions etc.
In Verilog, well, I'd think twice. I'm not a Verilogger, but the arithmetic rules for mixing signed and unsigned datatypes seem fraught with foot-shooting opportunities.
Another option to consider might be MyHDL as that gives you a very powerful verification environment and allows you to spit out VHDL or Verilog at the back end as you choose.

Input of a fixed point DSP

i'm new to working with dsps and fixed point and i really need to know:
1. Is it the fixed point dsp that converts the float number to Q format or a device does that before feeding the Dsp?
2. Who specifies the Q format to be used. Does each DSP come with a specified Q_format or the programmer does that in his codes.
3. Can i have an idea of how to perform a simple say 4 by 4 fixed point matrix multiplication in c++?
Thanks in anticipation
The format is usually fixed for a given DSP, e.g. Motorola DSP 56k family uses a 24 bit signed fractional format (Q23).
Fixed point is really just the same as an ordinary integer but there's an implicit scale factor. For most operations this makes no difference, e.g. load/store/add/subtract all work the same way regardless of whether the data is integer or fixed point.
When it comes to multiplication or division however the implicit scaling factor needs to be taken into account - typically there will be a shift after the operation to correct for this. DSP instructions take care of this automatically, whereas normal CPUs have to do this explicitly.
When you're doing e.g. a 4x4 matrix multiply you just use the DSP's native fixed point arithmetic instructions and the scaling is all taken care of automatically.

Does CRC has the following feature

When the data transmission is tampered 1 bit or 2 bits, can the receiver correct it automatically?
No, CRC is an error-detecting code, not an error-correcting code.
Read more here
CRC is primarily used as an error-detecting code. If the total number of bits (including those in the CRC) is smaller than the CRC's period, though, it is possible to correct single-bit errors by computing the syndrome (xor the calculated and received CRC's). Each bit will, if flipped individually, generate a unique syndrome. One can iterate the CRC algorithm to find the syndrome that would be associated with each bit; if one finds the syndrome associated with each bit, one can flip it and correct a single-bit error.
One major danger with doing this, though, is that the CRC will be much less useful for rejecting bogus data. If one uses an 8-bit CRC on a packet with 15 bytes of data, only one in 256 random packets would pass validity, but half of all random packets could be "corrected" by flipping a single bit.

Resources