why should the disparity value in sgbm be divisble by 16?

why should the disparity value in sgbm be divisble by 16? - opencv

I am working on opencv sgbm(semi global block matching) function. Here two parameters (minDisparity and numberOfDisparities) are used.
In that why numberOfDisparities values should be divisible by 16?

Probably to simplify the code internally, which uses SSE2. In general, SSE2 instructions:
Work on multiple numbers simultaneously; having the total number of pieces of information be evenly divisible makes things simpler.
SSE2 requires 128-bit (16 byte) memory alignment; alignment can be more easily maintained when things are nice multiples of 16...
If you examine the OpenCV source code, you'll see lots of SSE2 code for the SGBM algorithm.

Related

16 bit Checksum fuzzy analsysis - Leveraging "collisions", biases a thing?

If playing around with CRC RevEng fails, what next? That is the gist of my question. I am trying to learn more how to think for myself, not just looking for an answer 1 time to 1 problem.
Assuming the following:
1.) You have full control of white box algorithm and can create as many chosen sample messages as you want with valid 16 bit / 2 byte checksums
2.) You can verify as many messages as you want to see if they are valid or not
3.) Static or dynamic analysis of the white box code is off limits (say the MCU is of a lithography that would require electron microscope to analyze for example, not impossible but off limits for our purposes).
Can you use any of these methods or lines of thinking:
1.) Inspect "collisions", i.e. different messages with same checksum. Perhaps XOR these messages together and reveal something?
2.) Leverage strong biases towards certain checksums?
3.) Leverage "Rolling over" of the checksum "keyspace", i.e. every 65535 sequentially incremented messages you will see some type of sequential patterns?
4.) AI ?
Perhaps there are other strategies I am missing?
CRC RevEng tool was not able to find the answer using numerous settings configurations

Key properties and attacks:
If you have two messages + CRCs of the same length and exclusive-or them together, the result is one message and one pure CRC on that message, where "pure" means a CRC definition with a zero initial value and zero final exclusive-or. This helps take those two parameters out of the equation, which can be solved for later. You do not need to know where the CRC is, how long it is, or which bits of the message are participating in the CRC calculation. This linear property holds.
Working with purified examples from #1, if you take any two equal-length messages + pure CRCs and exclusive-or them together, you will get another valid message + pure CRC. Using this fact, you can use Gauss-Jordan elimination (over GF(2)) to see how each bit in the message affects other generated bits in the message. With this you can find out a) which bits in the message are particiapting, b) which bits are likely to be the CRC (though it is possible that other bits could be a different function of the input, which can be resolved by the next point), and c) verify that the check bits are each indeed a linear function of the input bits over GF(2). This can also tell you that you don't have a CRC to begin with, if there don't appear to be any bits that are linear functions of the input bits. If it is a CRC, this can give you a good indication of the length, assuming you have a contiguous set of linearly dependent bits.
Assuming that you are dealing with a CRC, you can now take the input bits and the output bits for several examples and try to solve for the polynomial given different assumptions for the ordering of the input bits (reflected or not, perhaps by bytes or other units) and the direction of the CRC shifting (reflected or not). Since you're talking about an allegedly 16-bit CRC, this can be done most easily by brute force, trying all 32,768 possible polynomials for each set of bit ordering assumptions. You can even use brute force on a 32-bit CRC. For larger CRCs, e.g. 64 bits, you would need to use Berlekamp's algorithm for factoring polynomials over finite fields in order to solve the problem before the heat death of the universe. Then you would factor each message + pure CRC as a polynomial, and look for common factors over multiple examples.
Now that you have the message bits, CRC bits, bit orderings, and the polynomial, you can go back to your original non-pure messages + CRCs, and solve for the initial value and final exclusive-or. All you need are two examples with different lengths. Then it's a simple two-equations-in-two-unknowns over GF(2).
Enjoy!

about CUFFT input sizes

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.

This means that input sizes with prime factors larger than 7 would go slower.

The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

Input of a fixed point DSP

i'm new to working with dsps and fixed point and i really need to know:
1. Is it the fixed point dsp that converts the float number to Q format or a device does that before feeding the Dsp?
2. Who specifies the Q format to be used. Does each DSP come with a specified Q_format or the programmer does that in his codes.
3. Can i have an idea of how to perform a simple say 4 by 4 fixed point matrix multiplication in c++?
Thanks in anticipation

The format is usually fixed for a given DSP, e.g. Motorola DSP 56k family uses a 24 bit signed fractional format (Q23).
Fixed point is really just the same as an ordinary integer but there's an implicit scale factor. For most operations this makes no difference, e.g. load/store/add/subtract all work the same way regardless of whether the data is integer or fixed point.
When it comes to multiplication or division however the implicit scaling factor needs to be taken into account - typically there will be a shift after the operation to correct for this. DSP instructions take care of this automatically, whereas normal CPUs have to do this explicitly.
When you're doing e.g. a 4x4 matrix multiply you just use the DSP's native fixed point arithmetic instructions and the scaling is all taken care of automatically.

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.

SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

What is "vectorization"?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorization, and what does it mean for example, that "a loop is vectorized" ?

Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.
The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.
for (int i=0; i<16; ++i)
C[i] = A[i] + B[i];
Unrolling this loop would transform it into something like this:
for (int i=0; i<16; i+=4) {
C[i] = A[i] + B[i];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
}
Vectorizing it, on the other hand, produces something like this:
for (int i=0; i<16; i+=4)
addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);
Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.
Terminology:
Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.
More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?
The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.
From wikipedia:
Scalar approach:
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
Vectorized approach:
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}

Vectorization is used greatly in scientific computing where huge chunks of data needs to be processed efficiently.
In real programming application , i know it's used in NUMPY(not sure of other else).
Numpy (package for scientific computing in python) , uses vectorization for speedy manipulation of n-dimensional array ,which generally is slower if done with in-built python options for handling arrays.
although tons of explanation are out there , HERE'S WHAT VECTORIZATION IS DEFINED AS IN NUMPY DOCUMENTATION PAGE
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
vectorized code is more concise and easier to read
fewer lines of code generally means fewer bugs
the code more closely resembles standard mathematical notation
(making it easier, typically, to correctly code mathematical
constructs)
vectorization results in more “Pythonic” code. Without
vectorization, our code would be littered with inefficient and
difficult to read for loops.

Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.
AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.
In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.
Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.
if(x[i] > 100) x[i] += 10; // this will branch execution flow.
or we can model the condition into arithmetic operation creating a condition vector c,
c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask
this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.
Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.
I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.

It refers to a the ability to do single mathematical operation on a list -- or "vector" -- of numbers in a single step. You see it often with Fortran because that's associated with scientific computing, which is associated with supercomputing, where vectorized arithmetic first appeared. Nowadays almost all desktop CPUs offer some form of vectorized arithmetic, through technologies like Intel's SSE. GPUs also offer a form of vectorized arithmetic.

By Intel people I think is easy to grasp.
Vectorization is the process of converting an algorithm from operating
on a single value at a time to operating on a set of values at one
time. Modern CPUs provide direct support for vector operations where a
single instruction is applied to multiple data (SIMD).
For example, a CPU with a 512 bit register could hold 16 32- bit
single precision doubles and do a single calculation.
16 times faster than executing a single instruction at a time. Combine
this with threading and multi-core CPUs leads to orders of magnitude
performance gains.
Link https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus
In Java there is a option to this be included in JDK 15 of 2020 or late at JDK 16 at 2021. See this official issue.

hope you are well!
vectorization refers to all the techniques that convert scaler implementation, in which a single operation processes a single entity at a time to vector implementation in which a single operation processes multiple entities at the same time.
Vectorization refers to a technique with the help of which we optimize the code to work with huge chunks of data efficiently. application of vectorization seen in scientific applications like NumPy, pandas also you can use this technique while working with Matlab, image processing, NLP, and much more. Overall it optimizes the runtime and memory allocation of the program.
Hope you may get your answer!
Thank you. 🙂

I would define vectorisation a feature of a given language where the responsibility on how to iterate over the elements of a certain collection can be delegated from the programmer (e.g. explicit loop of the elements) to some method provided by the language (e.g. implicit loop).
Now, why do we ever want to do that ?
Code readeability. For some (but not all!) cases operating over the entire collection at once rather than to its elements is easier to read and quicker to code;
Some interpreted languages (R, Python, Matlab.. but not Julia for example) are really slow in processing explicit loops. In these cases vectorisation uses under the hood compiled instructions for these "element order processing" and can be several orders of magnitude faster than processing each programmer-specified loop operation;
Most modern CPUs (and, nowadays, GPUs) have build-in parallelization that is exploitable when we use the vectorisation method provided by the language rather than our self-implemented order of operations of the elements;
In a similar way our programming language of choice will likely use for some vectorisation operations (e.g. matrix operations) software libraries (e.g. BLAS/LAPACK) that exploit multi-threading capabilities of the CPU, another form of parallel computation.
Note that for points 3 and 4 some languages (Julia notably) allow these hardware parallelizations to be exploited also using programmer-defined order processing (e.g. for loops), but this happens automatically and under the hood when using the vectorisation method provided by the language.
Now, while vectorisation has many advantages, sometimes an algorithm is more intuitively expressed using an explicit loop than vectorisation (where perhaps we need to resort to complex linear algebra operations, identity and diagonal matrices... all to retain our "vectorised" approach), and if using an explicit ordering form has no computational disadvantages, this one should be preferred.

See the two answers above. I just wanted to add that the reason for wanting to do vectorization is that these operations can easily be performed in paraell by supercomputers and multi-processors, yielding a big performance gain. On single processor computers there will be no performance gain.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart