Deflate and fixed Huffman codes - huffman-code

I'm trying to implement a deflate compressor and I have to decide whether to
compress a block using the static huffman code or create a dynamic one.
What is the rationale behind the length associated with the static code?
(this is the table included in the rfc)
Lit Value Bits
--------- ----
0 - 143 8
144 - 255 9
256 - 279 7
280 - 287 8
I thought static code was more biased towards plain ascii text, instead it
looks like it prefers by a tiny bit the compression of the rle length
What is a good heuristic to decide whether to use static code?
I was thinking to build a distribution of probabilities from a sample of the
input data and calculate a distance (maybe EMD?) from the probabilities derived
from the static code.

I would guess that the creator of the code took a large sample of literals and lengths from compressed data, likely including executables along with text, and found typical code lengths over the large set. They were then approximated with the table shown. However the author passed away many years ago, so we'll never know for sure.
You don't need a heuristic. Once you have done the work to find matching strings, it is comparatively very fast to compute the number of bits in the block for both a dynamic and static representation. Then simply pick the smaller one. Or the static one if equal (decodes faster).

I don't know about rationale, but there was a small amount of irrationale in choosing the static code lengths:
In the table in your question, the maximum static code number there is 287, but the DEFLATE specification only allows up to code 285, meaning code lengths have wastefully been assigned to two invalid codes. (And not even the longest ones either!) It's a similar story with the table for distance codes, with 32 codes having lengths assigned, but only 30 valid.
So there are some easy improvements that could have been made, but that said, without some prior knowledge of the data, it's not really possible to produce anything that's massively more efficient generally. The "flatness" of the table (no code longer than 9 bits) reduces the worst-case performance to 1 extra bit per byte of uncompressable data.
I think the main rationale behind the groupings is that by keeping group sizes to a multiple of 8, it's possible to tell which group a code belongs to by looking at the 5 most significant bits, which also tells you its length, along with what value to add to immediately get the code value itself
00000 00 .. 00101 11 7 bits + 256 -> (256..279)
00110 000 .. 10111 111 8 bits - 48 -> ( 0..144)
11000 000 .. 11000 111 8 bits + 78 -> (280..287)
11001 0000 .. 11111 1111 9 bits - 256 -> (144..255)
So in theory you could set up a lookup table with 32 entries to quickly read in the codes, but it's an uncommon case and probably not worth optimising for.
There are only really two cases (with some overlap) where Fixed Huffman blocks are likely to be the most efficient:
where the input size in bytes is very small, Static Huffman can be more efficient than Uncompressed, because Uncompressed uses a 32-bit header, while Fixed Huffman needs only a 7-bit footer, plus 1 bit potential overhead per byte.
where the output size is very small (ie. small-ish, highly compressible data), Static Huffman can be more efficient than Dynamic Huffman - again because Dynamic Huffman uses a certain amount of space for an additional header. (A practical minimum header size is difficult to calculate, but I'd say at least 64 bits, probably more.)
That said, I've found they are actually helpful from a developer's perspective, because it's very easy to implement a Deflate-compatible function using Static Huffman blocks, and to iteratively improve from there to get more efficient algorithms working.

Related

32 bit multiplication on 24 bit ALU

I want to port a 32 by 32 bit unsigned multiplication on a 24-bit dsp (it's a Linear Congruential Generator, so I'm not allowed to truncate, also I don't want to replace yet the current LCG with a 24 bit one). The available data types are 24 and 48 bit ints.
Only the last 32 LSB are needed. Do you know any hacks to implement this in fewer multiplies, masks and shifts than the usual way?
The line looks like this:
//val is an int(32 bit)
val = (1664525 * val) + 1013904223;
An outline would be (in my current compiler style):
static uint48_t val = SEED;
...
val = 0xFFFFFFFFUL & ((1664525UL * val) + 1013904223UL);
and hopefully the compiler will recognise:
it can use a multiply and accumulate command
it only needs a reduced multiply algorithim due to the "high word" of the constant being zero
the AND could be effected by resetting the upper bits or multiplying a constant and restoring
...other stuff depends on your {mystery dsp} target
Note
if you scale up the coefficients by 2^16, you can get truncation for free, but due to lack of info
you will have to explore/decide if it is better overall.
(This is more an elaboration why two multiplications 24×24→n, 31<n are enough for 32×32→min(n, 40).)
The question discloses amazingly little about the capabilities to build a method
32×21→32 in fewer [24×24] multiplies, masks and shifts than the usual way on:
24 and 48 bit ints & DSP (I read high throughput, non-high latency 24×24→48).
As far as there indeed is a 24×24→48 multiply (or even 24×24+56→56 MAC) and one factor is less than 24 bits, the question is pointless, a second multiply being the compelling solution.
The usual composition of a 24<n<48×24<m<48→24<p multiply from 24×24→48 uses three of the latter; a compiler should know as well as a coder that "the fourth multiply" would yield bits with a significance/position exceeding the combined lengths of the lower parts of the factors.
So, is it possible to generate "the long product" using just a second 24×24→48?
Let the (bytes of the) factors be w_xyz and W_XYZ, respectively; the underscores suggesting "the Ws" being the lower significance bits in the higher significance words/ints if interpreted as 24bit ints. The first 24×24→48 gives the sum of
  zX
 yXzY
xXyYzZ
 xYyZ
  xZ, what is needed (fat) is
 wZ +
 zW.
This can be computed using one combined multiplication of
((w<<16)|(z & 0xff)) × ((W<<16)|(Z & 0xff)). (Never mind the 17th bit of wZ+zW "running" into wW.)
(In the first revision of this answer, I foolishly produced wZ and zW separately - their sum is wanted in the end, anyway.)
(Annoyingly, this is about all you can do for 24×24→24 as a base operation too - beyond this "combining multiplication", you need four instead of one.)
Another angle to explore is choosing a different PRNG.
It may have to be >24 bits (tell!).
On a 24 bit machine, XorShift* (or even XorShift+) 48/32 seems worth a look.

How much storage would be required to store a human genome?

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.
An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.
I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.
If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):
The 2.9 billion base pairs of the haploid human genome correspond to a
maximum of about 725 megabytes of data, since every base pair can be
coded by 2 bits. Since individual genomes vary by less than 1% from
each other, they can be losslessly compressed to roughly 4 megabytes.
You do not store all the DNA in one stream, rather most the time it is store by chromosomes.
A large chromosome take about 300 MB and a small one about 50 MB.
Edit:
I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...
1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.
Another point is that the data isn't as simple as you get told.
e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...
Another example is the DNA methylation because you can't store this Information in a 2-bit representation.
Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.
I'm no expert, however, the Human Genome page on Wikipedia states the following:
Raw MB:
Male (XY): 770MB
Female (XX): 756MB
I'm not sure where their variance comes from, but I'm sure you can figure it out.
Yes, the minimum storage space needed for whole human DNA is about 770 MB.
However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore, some mathematicians designed more effective way to store those sequencies of bases and use them in searching and comparation algorithms. One such example is GARLI.
This application runs on my PC right now, and I have the human genome stored in 1563 MB.
The human genome contains over 3 billion base pairs. So if you represented each base pair as two bits then it would take over 6.15 × 10⁹ bits or approximately 770 MB.
just did it too. the raw sequence is ~700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated ~120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.
There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence?!?
All answers are leaving off the fact that nuDNA is not the only DNA that defines a human genome. mtDNA is also inherited and it contributes an additional 16,500 base pairs to a human genome, bringing it more in line with the Wikipedia guess of 770MB for males, and 756MB for females.
This does not mean that a human genome can easily be stored on an 4GB USB stick. Bits do not represent information by themselves, it is the combination of bits that represent information. So in the case of nuDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes that in themselves would requires many MBs of raw data to represent, especially in terms of functionality.
Food for thought: 80% of the human genome is called "non-coding" DNA, so did you actually really believe that the entire human body and brain can be represented in a mere 151 to 154MBs of raw data?
Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.
The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not ~750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.
Haploid = single copy of a chromosome.
Diploid = two versions of haploid.
Humans have 22 unique chromosomes x 2 = 44.
Male 23rd chromosome is X, Y and makes 46 in total.
Females 23rd chrom. is X, X and thus makes 46 in total.
For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.
Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC"; 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.
If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:
A = 00
C = 01
G = 10
T = 11
Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb... ~860MB including a then required conversion program (23kb-4mb).
But... in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.
There is only 2 types of base pairs, Cytosine can only bind to Guanine, and Adenine can only bind to thymine,
So each base pair can be considered a single bit.
This means that an entire strand of Human DNA ~3 billion "Bits" would be right around ~350 megabytes.
One base -- T, C, A, G (in the base-4 number system: 0, 1, 2, 3) -- is encoded as two bits (not one), so one base pair is encoded by four bits.

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?
The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.
Because computers address memory at the byte level, so anything smaller than a byte is not addressable.
The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.
Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Memory Units, calculating sizes, help!

I am preparing for a quiz in my computer science class, but I am not sure how to find the correct answers. The questions come in 4 varieties, such as--
Assume the following system:
Auxiliary memory containing 4 gigabytes,
Memory block equivalent to 4 kilobytes,
Word size equivalent to 4 bytes.
How many words are in a block,
expressed as 2^_? (write the
exponent)
What is the number of bits needed to
represent the address of a word in
the auxiliary memory of this system?
What is the number of bits needed to
represent the address of a byte in a
block of this system?
If a file contains 32 megabytes, how
many blocks are contained in the
file, expressed as 2^_?
Any ideas how to find the solutions? The teacher hasn't given us any examples with solutions so I haven't been able to figure out how to do this by working backwards or anything and I haven't found any good resources online.
Any thoughts?
Questions like these basically boil down to working with exponents and knowing how the different pieces fit together. For example, from your sample questions, we would do:
How many words are in a block, expressed as 2^_? (write the exponent)
From your description we know that a word is 4 bytes (2^2 bytes) and that a block is 4 kilobytes (2^12 bytes). To find the number of words in one block we simply divide the size of a block by the size of a word (2^12 / 2^2) which tells us that there are 2^10 words per block.
What is the number of bits needed to represent the address of a word in the auxiliary memory of this system?
This type of question is essentially an extension of the previous one. First you need to find the number of words contained in the memory. And from that you can get the number of bits required to represent a word in the memory. So we are told that memory contains 4 gigabytes (2^32 bytes) and that the word is 4 bytes (2^2 bytes); therefore the number words in memory is 2^32/2^2 = 2^30 words. From this we can deduce that 30 bits are required to represent a word in memory because each bit can represent two locations and we need 2^30 locations.
Since this is tagged as homework I will leave the remaining questions as exercises :)
Work backwards. This is actually pretty simple mathematics. (Ignore the word "auxilliary".)
How much is a kilobyte? How much is 4 kilobytes? Try putting in some numbers in 2^x, say x == 4. How much is 2^4 words? 2^8?
If you have 4GB of memory, what is the highest address? How large numbers can you express with 8 bits? 16 bits? Hint: 4GB is an even power of 2. Which?
This is really the same question as 2, but with different input parameters.
How many kilobytes is a megabyte? Express 32 megabytes in kilobytes. Division will be useful.

How could I guess a checksum algorithm?

Let's assume that I have some packets with a 16-bit checksum at the end. I would like to guess which checksum algorithm is used.
For a start, from dump data I can see that one byte change in the packet's payload totally changes the checksum, so I can assume that it isn't some kind of simple XOR or sum.
Then I tried several variations of CRC16, but without much luck.
This question might be more biased towards cryptography, but I'm really interested in any easy to understand statistical tools to find out which CRC this might be. I might even turn to drawing different CRC algorithms if everything else fails.
Backgroud story: I have serial RFID protocol with some kind of checksum. I can replay messages without problem, and interpret results (without checksum check), but I can't send modified packets because device drops them on the floor.
Using existing software, I can change payload of RFID chip. However, unique serial number is immutable, so I don't have ability to check every possible combination. Allthough I could generate dumps of values incrementing by one, but not enough to make exhaustive search applicable to this problem.
dump files with data are available if question itself isn't enough :-)
Need reference documentation? A PAINLESS GUIDE TO CRC ERROR DETECTION ALGORITHMS is great reference which I found after asking question here.
In the end, after very helpful hint in accepted answer than it's CCITT, I
used this CRC calculator, and xored generated checksum with known checksum to get 0xffff which led me to conclusion that final xor is 0xffff instread of CCITT's 0x0000.
There are a number of variables to consider for a CRC:
Polynomial
No of bits (16 or 32)
Normal (LSB first) or Reverse (MSB first)
Initial value
How the final value is manipulated (e.g. subtracted from 0xffff), or is a constant value
Typical CRCs:
LRC: Polynomial=0x81; 8 bits; Normal; Initial=0; Final=as calculated
CRC16: Polynomial=0xa001; 16 bits; Normal; Initial=0; Final=as calculated
CCITT: Polynomial=0x1021; 16 bits; reverse; Initial=0xffff; Final=0x1d0f
Xmodem: Polynomial=0x1021; 16 bits; reverse; Initial=0; Final=0x1d0f
CRC32: Polynomial=0xebd88320; 32 bits; Normal; Initial=0xffffffff; Final=inverted value
ZIP32: Polynomial=0x04c11db7; 32 bits; Normal; Initial=0xffffffff; Final=as calculated
The first thing to do is to get some samples by changing say the last byte. This will assist you to figure out the number of bytes in the CRC.
Is this a "homemade" algorithm. In this case it may take some time. Otherwise try the standard algorithms.
Try changing either the msb or the lsb of the last byte, and see how this changes the CRC. This will give an indication of the direction.
To make it more difficult, there are implementations that manipulate the CRC so that it will not affect the communications medium (protocol).
From your comment about RFID, it implies that the CRC is communications related. Usually CRC16 is used for communications, though CCITT is also used on some systems.
On the other hand, if this is UHF RFID tagging, then there are a few CRC schemes - a 5 bit one and some 16 bit ones. These are documented in the ISO standards and the IPX data sheets.
IPX: Polynomial=0x8005; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6B: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
ISO 18000-6C: Polynomial=0x1021; 16 bits; Reverse; Initial=0xffff; Final=as calculated
Data must be padded with zeroes to make a multiple of 8 bits
ISO CRC5: Polynomial=custom; 5 bits; Reverse; Initial=0x9; Final=shifted left by 3 bits
Data must be padded with zeroes to make a multiple of 8 bits
EPC class 1: Polynomial=custom 0x1021; 16 bits; Reverse; Initial=0xffff; Final=post processing of 16 zero bits
Here is your answer!!!!
Having worked through your logs, the CRC is the CCITT one. The first byte 0xd6 is excluded from the CRC.
It might not be a CRC, it might be an error correcting code like Reed-Solomon.
ECC codes are often a substantial fraction of the size of the original data they protect, depending on the error rate they want to handle. If the size of the messages is more than about 16 bytes, 2 bytes of ECC wouldn't be enough to be useful. So if the message is large, you're most likely correct that its some sort of CRC.
I'm trying to crack a similar problem here and I found a pretty neat website that will take your file and run checksums on it with 47 different algorithms and show the results. If the algorithm used to calculate your checksum is any of these algorithms, you would simply find it among the list of checksums produced with a simple text search.
The website is https://defuse.ca/checksums.htm
You would have to try every possible checksum algorithm and see which one generates the same result. However, there is no guarantee to what content was included in the checksum. For example, some algorithms skip white spaces, which lead to different results.
I really don't see why would somebody want to know that though.

Resources