How to calculate CRC32 of two CRC32?

How to calculate CRC32 of two CRC32? - checksum

Imagine that we have CRC32 values(Cycle Redundancy Check) of two different messages. How to calculate their common CRC32?
For example:
CRC32 of "hello" = 3610a686
CRC32 of "world" = 3a771143
CRC32 of "helloworld" = f9eb20ad

CRC32 uses an initial value of 0xFFFFFFFF, and post complements the CRC by xoring it with 0xFFFFFFFF (or using not). If you had a modified CRC32 that took these values as parameters, then the first call for "hello" would use initial value = 0xFFFFFFFF, xorout = 0x00000000, such as CRC = CRC32X(0xFFFFFFFF, 0x00000000, "hello", 5), where the 3rd parameter is a pointer to a string, and the 4th parameter the number of bytes in the string. The second call would be CRC32X(CRC, 0xFFFFFFFF, "world", 5), where CRC is the value returned by the first call.

Combining the separately calculated CRCs of two (or more) blocks in order to get the CRC for the concatenation of these blocks is possible, but it is not trivial and involves quite a bit of linear algebra.
zlib contains a function crc32_combine() that does the work for you, and many reasonably good (and reasonably current) libraries offer similar functionality. Mark Adler (yes, the Mark Adler of the homonymous checksum) has posted a good explanation in the topic CRC Calculation Of A Mostly Static Data Stream.
The Intel document Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction explains the process in gory detail.

Related

Lua dissection functions definition

This code is part of a Lua dissection script. Could you explain the meaning of this code please , especially the functions
add_le and le_uint. Thanks
-- Function: Upload functions request
function upload_function_req(buffer, subtree)
subtree:add_le(buffer(14,2), "func_id:", buffer(14,2):le_uint())
subtree:add_le(buffer(16,4), "fixed_values:", buffer(16,4):le_uint())
subtree:add_le(buffer(20,2), "offset:", buffer(20,2):le_uint())
end

The function adds 3 fields to the protocol tree. The buffer(n,m) is a tvbrange, with n indicating the offset into the buffer and m indicating the length. All 3 fields are unsigned integers in little-endian format. The 1st and 3rd fields are 2-byte integers; the 2nd is a 4 byte integer. The function does some unnecessary work though and could be simplified like so:
function upload_function_req(buffer, subtree)
subtree:add_le(buffer(14,2), "func_id:")
subtree:add_le(buffer(16,4), "fixed_values:")
subtree:add_le(buffer(20,2), "offset:")
end
If you want to learn more about the Lua API in Wireshark, you should have a look at the Wireshark Developer's Guide. Under Chapter 11. Wireshark's Lua API Reference Manual, you will find the relevant sub-chapters.
In particular:
The treeitem:add_le() is described in 11.7.1.3 treeitem:add_le([protofield], [tvbrange], [value],[label]).
The tvbrange:le_uint() is described in 11.8.3.3 tvbrange:le_uint().

add_le is the same as add, but "le" means little endian, since network protocols uses mostly big endian, changing to "le" can be used in place where little endian is necessary. the same is the int a le_uint (takes bytes from buffer and combines them in to the int according to the big or little endian).

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?

It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

The importance of using a 16bit integer

How seriously do developers think about using a 16bit integer when writing code? I've been using 32bit integers ever since I've been programming and I don't really think about using 16bit.
Its so easy to declare a 32bit int because its the default for most languages.
Whats the upside of using a 16bit integer apart from a little memory saved?

Now that we have cars, we don't walk or ride horses as much, but we still do walk and ride horses.
There is less need to use shorts these days. In a lot of situations the cost of disk space and availability of RAM mean that we no longer need to squeeze every last bit of storage out of computers as we did 20 years ago, so we can sacrifice a bit of storage efficiency in order to save on development/maintenance costs.
However, where large amounts of data are used, or we are working with systems with small memories (e.g. embedded controllers) or when we are transmitting data over networks, using 32 or 64 bits to represent a 16-bit value is just a waste of memory/bandwidth. It doesn't matter how much memory you have, wasting half or three quarters of it would just be stupid.

APIs/interfaces (e.g. TCP/IP port numbers) and algorithms that require manipulation (e.g. rotation) of 16-bit values.

I was interested in the relative performance so I wrote this small test program to perform a very simple test of the speed of allocating, using, and freeing a significant amount of data in both int and short format.
I run the tests several times in case caching and so on are affected.
#include <iostream>
#include <windows.h>
using namespace std;
const int DATASIZE = 1000000;
template <typename DataType>
long long testCount()
{
long long t1, t2;
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
DataType* data = new DataType[DATASIZE];
for(int i = 0; i < DATASIZE; i++) {
data[i] = 0;
}
delete[] data;
QueryPerformanceCounter((LARGE_INTEGER*)&t2);
return t2-t1;
}
int main()
{
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
cout << "Test using int : " << testCount<int>() << " ticks.\n";
cout << "Test using short : " << testCount<short>() << " ticks.\n";
}
and here are the results on my system (64 bit quad core system running windows7 64 bit, but the program is a 32 bit program built using VC++ express 2010 beta in release mode)
Test using short : 3672 ticks.
Test using int : 7903 ticks.
Test using short : 4321 ticks.
Test using int : 7936 ticks.
Test using short : 3697 ticks.
Test using int : 7701 ticks.
Test using short : 4222 ticks.
This seems to show that there are significant performance advantages at least in some cases to using short instead of int when there is a large amount of data. I realise that this is far from being a comprehensive test, but it's some evidence that not only do they use less space but they can be faster to process too at least in some applications.

when there is memory constraints short can help u lot. for e.g. while coding for embedded systems, u need to consider the memory.

16-bit values are still in great demand (though unsigned would do - don't really need signed).
For example,
16 bit Unicode - UTF-16/UCS-2.
16 bit graphics - especially for embedded devices.
16 bit checksums - for UDP headers and similar.
16 Bit devices - e.g. many norflash devices are 16 bit.

You might need to wrap at 65535.
You might need to work with a message sent from a device which includes fields which are 16 bit. Using 32 bit integers in this case would cause you to be accessing bits at the wrong offset in the message.
You might be working on an embedded 16 bit micro, or an embedded 8 bit micro. Hint: not all processors are x86, 32 bit.

This is really important in database development, because sometimes people are using a lot more space than is really needed (e.g. using int when small would have been sufficient). When you have tables with millions of rows this can be important factor in e.g. database size and queries. I would recommend people using always the appropriate datatype for columns.
I also try to use the correct datatype for other development, I know it can be a pain dealing with long and small (pretty convenient to have everyting int) but I think it pays off in the end, for example when serializing objects.

you ask: Any good reason to keep them around?
Since you say 'language-agnostic' the answer is a 'certainly yes'.
The computer CPU still works with bytes, words, full registers and whatnot, no matter how much these 'data types' are abstracted by some programming languages. There will always be situations where the code needs to 'touch the metal'.

It's hardly a little memory saved [read: 50%] when you allocate memory for a large number of numeric values. Common uses are:
COM and external device interop
Reducing memory consumption for large arrays where each number will never exceed a couple thousands in magnitude
Unique hashes for pairs of objects, where no more than ~65K objects are needed (hash values can only be 32-bit ints, but note that hash table types must transform the value for internal representations so collisions are still likely, but equality can be based on exact hash matches)
Speed up algorithms that rely on structs (smaller sized value types translates to increased performance when they are copied around in memory)

In large arrays, "little memory saved" could instead be "much memory saved".

The use of 16 bit integers is primarily for when you need to encode things for transmission over a network, for saving on hard disk, etc. without using up any more space than necessary. It might also occasionally be useful to save memory if you have a very large array of integers, or a lot of objects that contain integers.
Use of 16 bit integers without there being a good memory saving reason is pretty pointless. And 16 bit local variables are most often silently implemented with 32 or 64 bit integers anyway.

you have probably been using the 16 bit datatype more often than you knew. The char datatype in both C# and Java are 16 bit. Unicode is typically stored in a 16bit datatype.

The question should really be why we need a 16-bit primitive data type, and the answer would be that there is an awful lot of data out there which is naturally represented in 16 bits. One ubiquitous example is audio, e.g. CD audio is represented as streams of 16 bit signed integers.

16 bits is still plenty big enough to hold pixel channel values (e.g. R, G, or B). Most pixels only use 8 bits to store a channel, but Photoshop has a 16-bit mode that professionals use.
In other words, a pixel might be defined as struct Pixel16 { short R, G, B, A; } or an image might be defined as separate channels of struct Channel16 { short channel[]; }

I think most people use the default int on their platform. However there are times when you have to communicate with older systems or libraries that are expecting 16 bit or even eight bit integers (thank god we don't have to worry about 12 bit integers any more). This is especially true for databases. Also, if you're doing bit masking or bit shifting, you might have an algorithm that specifies the length of the integer. By default, and on platforms where memory is cheap, you should probably use integers sized to your processor.

Those 2 bytes add up. Your data types eventually become part of array or databases or messages, they go into data files. It adds up to a lot of wasted space and on embedded systems it can make a huge difference.
When we do peer review of our code at work, if something is sized incorrectly, it will be written as a discrepancy and must be corrected. If we find something that has a range of 1-1000 using an int32_t, it has to be corrected. The range must also be documented in a comment. Our department does not allow use of int, long, etc, we must use int32_t, int16_t, uint16_t, etc. so that the expected size is documented.
uint16_t conicAngle; // angle in tenths of a degree (range 0..3599)
or in Ada:
type Amplitude is range 0 .. 255; // signal amplitude from FPGA
Get in the habit of using what you need and no more and documenting what you need (if the language doesn't support it).
We are currently in the process of fixing a performance problem by resizing the data types in several messages, they have 32 bit fields that could be 8 or 16 bit. By resizing them appropriately we can reduce the message rate in half and improve our data throughput to meet the requirements.

Once upon a time, in the land of Earth, there existed devices called computers.
In the early days following the invention of "computers," there was limited storage in memory for fancy things like numbers and strings.
Billy, a programmer, was encouraged by the evil Wizard (his boss) to use the least amount of memory that he could!
Then one day, memory sizes got large enough that everyone could use 32-bit numbers if they wanted!
I could continue on, but all the other obvious things were already covered.

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!

In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"

Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.

If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.

Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}

I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.

What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.

Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

How could I guess a checksum algorithm?

Let's assume that I have some packets with a 16-bit checksum at the end. I would like to guess which checksum algorithm is used.
For a start, from dump data I can see that one byte change in the packet's payload totally changes the checksum, so I can assume that it isn't some kind of simple XOR or sum.
Then I tried several variations of CRC16, but without much luck.
This question might be more biased towards cryptography, but I'm really interested in any easy to understand statistical tools to find out which CRC this might be. I might even turn to drawing different CRC algorithms if everything else fails.
Backgroud story: I have serial RFID protocol with some kind of checksum. I can replay messages without problem, and interpret results (without checksum check), but I can't send modified packets because device drops them on the floor.
Using existing software, I can change payload of RFID chip. However, unique serial number is immutable, so I don't have ability to check every possible combination. Allthough I could generate dumps of values incrementing by one, but not enough to make exhaustive search applicable to this problem.
dump files with data are available if question itself isn't enough :-)
Need reference documentation? A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS is great reference which I found after asking question here.
In the end, after very helpful hint in accepted answer than it's CCITT, I
used this CRC calculator, and xored generated checksum with known checksum to get 0xffff which led me to conclusion that final xor is 0xffff instread of CCITT's 0x0000.

There are a number of variables to consider for a CRC:
Polynomial
No of bits (16 or 32)
Normal (LSB first) or Reverse (MSB first)
Initial value
How the final value is manipulated (e.g. subtracted from 0xffff), or is a constant value
Typical CRCs:
LRC: Polynomial=0x81; 8 bits; Normal; Initial=0; Final=as calculated
CRC16: Polynomial=0xa001; 16 bits; Normal; Initial=0; Final=as calculated
CCITT: Polynomial=0x1021; 16 bits; reverse; Initial=0xffff; Final=0x1d0f
Xmodem: Polynomial=0x1021; 16 bits; reverse; Initial=0; Final=0x1d0f
CRC32: Polynomial=0xebd88320; 32 bits; Normal; Initial=0xffffffff; Final=inverted value
ZIP32: Polynomial=0x04c11db7; 32 bits; Normal; Initial=0xffffffff; Final=as calculated
The first thing to do is to get some samples by changing say the last byte. This will assist you to figure out the number of bytes in the CRC.
Is this a "homemade" algorithm. In this case it may take some time. Otherwise try the standard algorithms.
Try changing either the msb or the lsb of the last byte, and see how this changes the CRC. This will give an indication of the direction.
To make it more difficult, there are implementations that manipulate the CRC so that it will not affect the communications medium (protocol).
From your comment about RFID, it implies that the CRC is communications related. Usually CRC16 is used for communications, though CCITT is also used on some systems.
On the other hand, if this is UHF RFID tagging, then there are a few CRC schemes - a 5 bit one and some 16 bit ones. These are documented in the ISO standards and the IPX data sheets.
IPX: Polynomial=0x8005; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6B: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6C: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
Data must be padded with zeroes to make a multiple of 8 bits
ISO CRC5: Polynomial=custom; 5 bits; Reverse; Initial=0x9; Final=shifted left by 3 bits
Data must be padded with zeroes to make a multiple of 8 bits
EPC class 1: Polynomial=custom 0x1021; 16 bits; Reverse; Initial=0xffff; Final=post processing of 16 zero bits
Here is your answer!!!!
Having worked through your logs, the CRC is the CCITT one. The first byte 0xd6 is excluded from the CRC.

It might not be a CRC, it might be an error correcting code like Reed-Solomon.
ECC codes are often a substantial fraction of the size of the original data they protect, depending on the error rate they want to handle. If the size of the messages is more than about 16 bytes, 2 bytes of ECC wouldn't be enough to be useful. So if the message is large, you're most likely correct that its some sort of CRC.

I'm trying to crack a similar problem here and I found a pretty neat website that will take your file and run checksums on it with 47 different algorithms and show the results. If the algorithm used to calculate your checksum is any of these algorithms, you would simply find it among the list of checksums produced with a simple text search.
The website is https://defuse.ca/checksums.htm

You would have to try every possible checksum algorithm and see which one generates the same result. However, there is no guarantee to what content was included in the checksum. For example, some algorithms skip white spaces, which lead to different results.
I really don't see why would somebody want to know that though.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart