Extended Huffman code - huffman-code

I have this homework: finding the code words for the symbols in any given alphabet. It says I have to use binary Huffman on groups of three symbols. What does that mean exactly? Do i use regular Huffman on [alphabet]^3? If so, how do I then tell the difference between the 3 symbols in a group?

I can't quite tell, because your description of the problem isn't all that detailed, but I would guess that they mean that instead of encoding each symbol in your alphabet individually, you are supposed to tread each triple of symbols as a group.
So, for instance, if your alphabet consists of a, b, and c, instead of generating an encoding for each of those individually, you would create an encoding for aaa, aab, aac, etc. Each one of these strings would be treated as a separate symbol in the Huffman algorithm; you can tell them apart simply by doing string comparison on them. If you need to accept input of arbitrary length, you will also need to include in your alphabet symbols that are strings of length 1 or 2. For instance, if you're encoding the string aabacab, you would need to break that down into the symbols aab, aca, and b.
Does that help answer your question? I wasn't quite sure what you're looking for, so please feel free to edit your question or reply in a comment if this hasn't cleared anything up.

Food for thought: what about shorter strings, and permutations of "block boundaries"? What about 1 and 2 character strings? Do you just count off 3, 6, 9, 12, ... chars into your input text and then null pad any uneven lengths at the end?
If the chunks can be of variable size, then it gets really interesting to find the best fit. I suspect it degenerates into a traveling salesman kind of problem, but maybe there's a neat "theorem" or other tool out there for this kind of thing.
Perhaps try all permutations of 3 chars, saving the most frequently used, then try to come up with a good fit for the 1 and 2 char long gaps? Hmm, sounds like it might be really slow, but doable using some kind of recursive divide and counquer approach: pull out the long string of block length N, then recurse into encoding the gaps as length N - 1.
More questions than answers, I'm afraid.

Related

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

Extended Huffman Coding

I know this is not a coding issue but since I found some Huffman questions here I am posting here since I still need this for my implementation. When doing extended Huffman coding, I understand that you do for example a1a1,a1a2,a1a3 etc and you do their probabilities times, however, how do you get the codeword? For example from the image below how do you get that 0.6400 = 0 and 0.0160 = 10101, etc?
First, let me describe how a Huffman tree works, then I will explain how extended Huffman encoding works.
Some terms, codeword means a sequence of bits in our encoded output, that has been compressed.
Terms like a1, a2 or a3 are our input characters, we can think of them as letters for now.
We have the two rules,
More common letters map to shorter code words than less likely to appear letters.
The two least likely letters have the same length code word.
These two requirements lead to a simple way of building a binary
tree describing an optimum prefix code - THE Huffman Code.
Start with the two most unlikely letters, we know their codewords will be p0 and p1 for some prefix p, now we merge them and consider them as one super-letter, and find the two least common
letters again.
Repeat until the prefix is empty.
Right, now for the extended code, we just group a sequence of letters, pairs in your example, and treat them as one letter in a much larger alphabet.
Source: http://www.ws.binghamton.edu/fowler/fowler%20personal%20page/EE523_files/Ch_03%20Huffman%20&%20Extended%20Huffman%20%28PPT%29.pdf

Huffman code for a single character?

Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?
jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.
If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.
You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.

Is The Effectiveness Of Huffman Coding Limited?

My problem is that I have a 100,000+ different elements and as I understand it Huffman works by assigning the most common element a 0 code, and the next 10, the next 110, 1110, 11110 and so on. My question is, if the code for the nth element is n-bits long then surely once I have passed the 32nd term it is more space efficient to just sent 32-bit data types as they are, such as ints for example? Have I missed something in the methodology?
Many thanks for any help you can offer. My current implementation works by doing
code = (code << 1) + 2;
to generate each new code (which seems to be correct!), but the only way I could encode over 100,000 elements would be to have an int[] in a makeshift new data type, where to access the value we would read from the int array as one continuous long symbol... that's not as space efficient as just transporting a 32-bit int? Or is it more a case of Huffmans use being with its prefix codes, and being able to determine each unique value in a continuous bit stream unambiguously?
Thanks
Your understanding is a bit off - take a look at http://en.wikipedia.org/wiki/Huffman_coding. And you have to pack the encoded bits into machine words in order to get compression - Huffman encoded data can best be thought of as a bit-stream.
You seem to understand the principle of prefix codes.
Could you tell us a little more about these 100,000+ different elements you mention?
The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.
What you describe is one particular kind of prefix code: unary coding.
Another popular variant of the unary coding system assigns elements in order of frequency to the fixed codes
"1", "01", "001", "0001", "00001", "000001", etc.
Some compression programs use another popular prefix code: Elias gamma coding.
The Elias gamma coding assigns elements in order of frequency to the fixed set of codewords
1
010
011
00100
00101
00110
00111
0001000
0001001
0001010
0001011
0001100
0001101
0001110
0001111
000010000
000010001
000010010
...
The 32nd Elias gamma codeword is about 10 bits long, about half as long as the 32nd unary codeword.
The 100,000th Elias gamma codeword will be around 32 bits long.
If you look carefully, you can see that each Elias gamma codeword can be split into 2 parts -- the first part is more or less the unary code you are familiar with. That unary code tells the decoder how many more bits follow afterward in the rest of that particular Elias gamma codeword.
There are many other kinds of prefix codes.
Many people (confusingly) refer to all prefix codes as "Huffman codes".
When compressing some particular data file, some prefix codes do better at compression than others.
How do you decide which one to use?
Which prefix code is the best for some particular data file?
The Huffman algorithm -- if you neglect the overhead of the Huffman frequency table -- chooses exactly the best prefix code for each data file.
There is no singular "the" Huffman code that can be pre-generated without regard to the actual symbol frequencies.
The prefix code choosen by the Huffman algorithm is usually different for different files.
The Huffman algorithm doesn't compress very well when we really do have 100,000+ unique elements --
the overhead of the Huffman frequency table becomes so large that we often can find some other "suboptimal" prefix code that actually gives better net compression.
Or perhaps some entirely different data compression algorithm might work even better in your application.
The "Huffword" implementation seems to work with around 32,000 or so unique elements,
but the overwhelming majority of Huffman code implementations I've seen work with around 257 unique elements (the 256 possible byte values, and the end-of-text indicator).
You might consider somehow storing your data on a disk in some raw "uncompressed" format.
(With 100,000+ unique elements, you will inevitably end up storing many of those elements in 3 or more bytes).
Those 257-value implementations of Huffman compression will be able to compress that file;
they re-interpret the bytes of that file as 256 different symbols.
My question is, if the code for the nth element is n-bits long then
surely once I have passed the 32nd term it is more space efficient to
just sent 32-bit data types as they are, such as ints for example?
Have I missed something in the methodology?
One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- that frees up the compressor to use less than 8 bits to store the more-frequent symbols.
related:
Maximum number of different numbers, Huffman Compression

Avoiding real English words in "short URLs", without sacrificing too much headroom

Assuming here that the language in question is English, and the character sets used are basic ASCII / latin alphabet.
When generating "Short URLs", the first thought is often to use a large "code set"/alphabet to convert an integer (possibly an ID referencing the long URL in your database) to a high "base" (URL-friendly Base-64, for example). In my specific case, I first opted to normalize to Base-36 (numbers, latin letters, not case-sensitive).
However, upon closer inspection, one might find their Short URL generator eventually spitting out naughty words, or other common words, which may be quite undesirable.
One option to avoid generating "real words" would be to just strip out all of the common vowels.
Are there other/better workarounds that don't sacrifice too much headroom?
I think your idea to strip out the vowels will be your best best here.
Anything else, like blacklists, dictionary lookups, etc, will just be incredibly tedious, require a lot of maintenance and, ultimately, falible.
You could normalize to base-30 [0-9bcdfghj-np-tvwxz], which will simply never generate vowels and thus not generate real words.
You could separate your vowels and consonants (xxxddd_eeeaaa). If it's always longer than three letters you're probably safe with curse words.
Or you could insert numbers randomly.
Or you could create a filter.
of the three I'd probably stick with the first.
In order to sacrifice only little information per digit but at the same time prevent as much meaning as possible, you should probably leave out the most frequent letters in english. This will be slightly more efficient than simply skipping all vowels.

Resources