What is the use of bit manipulation? - image-processing

I searched for clear explanation with some example close to real life.
What is bit manipulation?
Why we need to use bit manipulation?
We can use bit manipulation in image processing as far as I know. Can anyone show me a simple problem which can be solved using bit manipulation?
I read about bit manipulation from some link:
Link 1
Link 2
In Link 2 Data compression is done using bit packing. Are there any difference between bit manipulation and bit packing?
It will be appreciable If anyone explain me with very simple example which have resemble to real life problem.

What is bit manipulation?
Bit manipulation usually refers to changing data using bit operators.
I think Wikipedia expains it good enough so I won't write another article.
https://en.wikipedia.org/wiki/Bit_manipulation
Bit manipulation is the act of algorithmically manipulating bits or
other pieces of data shorter than a word. Computer programming tasks
that require bit manipulation include low-level device control, error
detection and correction algorithms, data compression, encryption
algorithms, and optimization. For most other tasks, modern programming
languages allow the programmer to work directly with abstractions
instead of bits that represent those abstractions. Source code that
does bit manipulation makes use of the bitwise operations: AND, OR,
XOR, NOT, and possibly other operations analogous to the boolean
operators; there are also bit shifts and operations to count ones and
zeros, find high and low one or zero, set, reset and test bits,
extract and insert fields, mask and zero fields, gather and scatter
bits to and from specified bit positions or fields. Integer arithmetic
operators can also effect bit-operations in conjunction with the other
operators.
Bit manipulation, in some cases, can obviate or reduce the need to
loop over a data structure and can give many-fold speed ups, as bit
manipulations are processed in parallel.
Why we need to use bit manipulation?
Because it is fast and often we don't have another choice. For example in microcontrollers, pretty much everything is controlled by manipulating the bits of 8 bit registers. So an output would go high if you set a certain bit 1.
Bit packing is a compression technique that tries to minimize the number of bits necessary to represent a number. While you'll use bit operators to implement it, it is not the same as "bit-manipulation". It's just one of many many use cases for bit-manipulation.
Can anyone show me a simple problem which can be solved using bit manipulation?
Let's say you have a rgb touple rgb = 0xa1fc03 and you want to make the green channel 0.
rgb_without_green = rgb & 0xFF00FF
We've bitwise ANDed the value with 0xFF00FF.
Now rgb is 0xa10003.
Basically any operation boils down to bit manipulation. For most of them you just have convenient solutions. Say instead of 0x00000011 << 0x0000101 you write 3 * 32
Or have a look at this where the addition of two integers is implemented using bit operations. Add two integers using only bitwise operators?
Edit due to comment
How bitwise AND operation between 0xa1fc03 and 0xFF00FF gives 0xa10003
? Just I need to see how to do this calculation
Bitwise AND means that you AND all the bits of both numbers.
1 AND 1 -> 1
0 AND 1 -> 0
1 AND 0 -> 0
0 AND 0 -> 0
So
0xa1fc03 -> 0b101000011111110000000011
0xff00ff -> 0b111111110000000011111111
AND -> 0b101000010000000000000011
0b101000010000000000000011 -> 0xa10003
With a bit more expierience you know that 0xFF is 0b11111111 so you instantly know that 0xa1fc03 AND 0xff00ff is 0xa1003 becaue you keep everything that is masked with FF and set everything 0 that is masked with 00.
There are countless resources available. You should not have to ask me how to bitwise AND two numbers. Please do your own research.

I am busy designing a new barcode symbology for real-life applications. It uses a checksum value, which is computed on slices of k bits of large numbers. Hence intense bit manipulation.

Related

Sending Float/Double Values Across CANbus

I am preparing a code that sends various electrical parameters across a CANbus (power, voltage, current, etc.). Precision is necessary for this circumstance.
There are two options I see:
Send the value as an integer, but scale it on sending and receiver side (Sending x100, Receiving /100, for example). If I do this, then I can send 2.12V as 212 (or 0xD4).
Send it as a float value, which would require 32 bits but no scaling.
My questions are as follows:
Can float values be sent across CANbus?
If yes to Q1, is that a common practice? Or do most CANbus communication programmers use scaled integers?
Thank you in advance.
First of all, please read this answer since it answers all the floating point concerns:
Does the "Avoid using floating-point" rule of thumb apply to a microcontroller with a floating point unit (FPU)?
Can float values be sent across CANbus?
Yes, though like any large number they depend on endianess, so you will have to specify a network endianess for floating point on your CAN application layer. Sending 4 or 8 bytes in a single package is obviously not a problem.
If yes to Q1, is that a common practice? Or do most CANbus communication programmers use scaled integers?
It is not common practice, unscaled raw data is by far the most common and convenient. For example if you wish to express a voltage between 0 and 5V, you could for example send a raw data value between 0 and 65535 then scale it to mean Volt if you actually need to use that unit at some point (for example when showing voltage for humans on a display).
The raw data form also has the advantage of being technology forwards/backwards compatible. Suppose you read voltage with a 10 bit ADC. Your values will be in the range of 0 to 1023. You can express that as 16 bit raw data by multiplying with (65536/1024) = 64. Some years later you switch to a 12 bit ADC with values up to 4095. You can then keep the very same CAN protocol and format, just with increased resolution smaller "steps" of raw data.

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?
It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)
The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.
Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...

Fast way to swap endianness using opencl

I'm reading and writing lots of FITS and DNG images which may contain data of an endianness different from my platform and/or opencl device.
Currently I swap the byte order in the host's memory if necessary which is very slow and requires an extra step.
Is there a fast way to pass a buffer of int/float/short having wrong endianess to an opencl-kernel?
Using an extra kernel run just for fixing the endianess would be ok; using some overheadless auto-fixing-read/-write operation would be perfect.
I know about the variable attribute ((endian(host/device))) but this doesn't help with a big endian FITS file on a little endian platform using a little endian device.
I thought about a solution like this one (neither implemented nor tested, yet):
uint4 mask = (uint4) (3, 2, 1, 0);
uchar4 swappedEndianness = shuffle(originalEndianness, mask);
// to be applied on a float/int-buffer somehow
Hoping there's a better solution out there.
Thanks in advance,
runtimeterror
Sure. Since you have a uchar4 - you can simply swizzle the components and write them back.
output[tid] = input[tid].wzyx;
swizzling is very also performant on SIMD architectures with very little cost, so you should be able to combine it with other operations in your kernel.
Hope this helps!
Most processor architectures perform best when using instructions to complete the operation which can fit its register width, for example 32/64-bit width. When CPU/GPU performs such byte-wise operators, using subscripts .wxyz for uchar4, they needs to use a mask to retrieve each byte from the integer, shift the byte, and then using integer add or or operator to the result. For the endianness swaping, the processor needs to perform above integer and, shift, add/or for 4 times because there are 4 bytes.
The most efficient way is as follows
#define EndianSwap(n) (rotate(n & 0x00FF00FF, 24U)|(rotate(n, 8U) & 0x00FF00FF)
n could be in any gentype, for example, an uint4 variable. Because OpenCL does not allow C++ type overloading, so the best choice is macro.

The importance of using a 16bit integer

How seriously do developers think about using a 16bit integer when writing code? I've been using 32bit integers ever since I've been programming and I don't really think about using 16bit.
Its so easy to declare a 32bit int because its the default for most languages.
Whats the upside of using a 16bit integer apart from a little memory saved?
Now that we have cars, we don't walk or ride horses as much, but we still do walk and ride horses.
There is less need to use shorts these days. In a lot of situations the cost of disk space and availability of RAM mean that we no longer need to squeeze every last bit of storage out of computers as we did 20 years ago, so we can sacrifice a bit of storage efficiency in order to save on development/maintenance costs.
However, where large amounts of data are used, or we are working with systems with small memories (e.g. embedded controllers) or when we are transmitting data over networks, using 32 or 64 bits to represent a 16-bit value is just a waste of memory/bandwidth. It doesn't matter how much memory you have, wasting half or three quarters of it would just be stupid.
APIs/interfaces (e.g. TCP/IP port numbers) and algorithms that require manipulation (e.g. rotation) of 16-bit values.
I was interested in the relative performance so I wrote this small test program to perform a very simple test of the speed of allocating, using, and freeing a significant amount of data in both int and short format.
I run the tests several times in case caching and so on are affected.
#include <iostream>
#include <windows.h>
using namespace std;
const int DATASIZE = 1000000;
template <typename DataType>
long long testCount()
{
long long t1, t2;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
DataType* data = new DataType[DATASIZE];
for(int i = 0; i < DATASIZE; i++) {
data[i] = 0;
}
delete[] data;
QueryPerformanceCounter((LARGE_INTEGER*)&t2);
return t2-t1;
}
int main()
{
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
}
and here are the results on my system (64 bit quad core system running windows7 64 bit, but the program is a 32 bit program built using VC++ express 2010 beta in release mode)
Test using short : 3672 ticks.
Test using int : 7903 ticks.
Test using short : 4321 ticks.
Test using int : 7936 ticks.
Test using short : 3697 ticks.
Test using int : 7701 ticks.
Test using short : 4222 ticks.
This seems to show that there are significant performance advantages at least in some cases to using short instead of int when there is a large amount of data. I realise that this is far from being a comprehensive test, but it's some evidence that not only do they use less space but they can be faster to process too at least in some applications.
when there is memory constraints short can help u lot. for e.g. while coding for embedded systems, u need to consider the memory.
16-bit values are still in great demand (though unsigned would do - don't really need signed).
For example,
16 bit Unicode - UTF-16/UCS-2.
16 bit graphics - especially for embedded devices.
16 bit checksums - for UDP headers and similar.
16 Bit devices - e.g. many norflash devices are 16 bit.
You might need to wrap at 65535.
You might need to work with a message sent from a device which includes fields which are 16 bit. Using 32 bit integers in this case would cause you to be accessing bits at the wrong offset in the message.
You might be working on an embedded 16 bit micro, or an embedded 8 bit micro. Hint: not all processors are x86, 32 bit.
This is really important in database development, because sometimes people are using a lot more space than is really needed (e.g. using int when small would have been sufficient). When you have tables with millions of rows this can be important factor in e.g. database size and queries. I would recommend people using always the appropriate datatype for columns.
I also try to use the correct datatype for other development, I know it can be a pain dealing with long and small (pretty convenient to have everyting int) but I think it pays off in the end, for example when serializing objects.
you ask: Any good reason to keep them around?
Since you say 'language-agnostic' the answer is a 'certainly yes'.
The computer CPU still works with bytes, words, full registers and whatnot, no matter how much these 'data types' are abstracted by some programming languages. There will always be situations where the code needs to 'touch the metal'.
It's hardly a little memory saved [read: 50%] when you allocate memory for a large number of numeric values. Common uses are:
COM and external device interop
Reducing memory consumption for large arrays where each number will never exceed a couple thousands in magnitude
Unique hashes for pairs of objects, where no more than ~65K objects are needed (hash values can only be 32-bit ints, but note that hash table types must transform the value for internal representations so collisions are still likely, but equality can be based on exact hash matches)
Speed up algorithms that rely on structs (smaller sized value types translates to increased performance when they are copied around in memory)
In large arrays, "little memory saved" could instead be "much memory saved".
The use of 16 bit integers is primarily for when you need to encode things for transmission over a network, for saving on hard disk, etc. without using up any more space than necessary. It might also occasionally be useful to save memory if you have a very large array of integers, or a lot of objects that contain integers.
Use of 16 bit integers without there being a good memory saving reason is pretty pointless. And 16 bit local variables are most often silently implemented with 32 or 64 bit integers anyway.
you have probably been using the 16 bit datatype more often than you knew. The char datatype in both C# and Java are 16 bit. Unicode is typically stored in a 16bit datatype.
The question should really be why we need a 16-bit primitive data type, and the answer would be that there is an awful lot of data out there which is naturally represented in 16 bits. One ubiquitous example is audio, e.g. CD audio is represented as streams of 16 bit signed integers.
16 bits is still plenty big enough to hold pixel channel values (e.g. R, G, or B). Most pixels only use 8 bits to store a channel, but Photoshop has a 16-bit mode that professionals use.
In other words, a pixel might be defined as struct Pixel16 { short R, G, B, A; } or an image might be defined as separate channels of struct Channel16 { short channel[]; }
I think most people use the default int on their platform. However there are times when you have to communicate with older systems or libraries that are expecting 16 bit or even eight bit integers (thank god we don't have to worry about 12 bit integers any more). This is especially true for databases. Also, if you're doing bit masking or bit shifting, you might have an algorithm that specifies the length of the integer. By default, and on platforms where memory is cheap, you should probably use integers sized to your processor.
Those 2 bytes add up. Your data types eventually become part of array or databases or messages, they go into data files. It adds up to a lot of wasted space and on embedded systems it can make a huge difference.
When we do peer review of our code at work, if something is sized incorrectly, it will be written as a discrepancy and must be corrected. If we find something that has a range of 1-1000 using an int32_t, it has to be corrected. The range must also be documented in a comment. Our department does not allow use of int, long, etc, we must use int32_t, int16_t, uint16_t, etc. so that the expected size is documented.
uint16_t conicAngle; // angle in tenths of a degree (range 0..3599)
or in Ada:
type Amplitude is range 0 .. 255; // signal amplitude from FPGA
Get in the habit of using what you need and no more and documenting what you need (if the language doesn't support it).
We are currently in the process of fixing a performance problem by resizing the data types in several messages, they have 32 bit fields that could be 8 or 16 bit. By resizing them appropriately we can reduce the message rate in half and improve our data throughput to meet the requirements.
Once upon a time, in the land of Earth, there existed devices called computers.
In the early days following the invention of "computers," there was limited storage in memory for fancy things like numbers and strings.
Billy, a programmer, was encouraged by the evil Wizard (his boss) to use the least amount of memory that he could!
Then one day, memory sizes got large enough that everyone could use 32-bit numbers if they wanted!
I could continue on, but all the other obvious things were already covered.

Resources