Huffman tree, is this correct? - huffman-code

I'm trying to make create a correct huffman tree and was wondering if this was correct. The top number is the frequency/weight and the bottom number is the ASCII code. The string is
"hhiiiisssss". If I entered this into a text file, there would be only one LF correct? I'm not sure why my program is reading in two.
14
-1
/ \
9 5
-1 s(115)
/ \
5 4
-1 i(105)
/ \
3 2
h(104) LF(10)

In a text file there would only be one LF if there is only one line of text, correct.
Something else is wrong though. There are only two 'h' in your string but your tree shows three, and a total of 14 characters. I'm guessing it's a typo?
Aside from that it looks ok and your huffman codes would be (depending on whether you pick '0' for left or right):
s: 1
i: 01
LF: 001
h: 000

Related

Lua patterns - why does custom set '[+-_]' match alphanumeric characters?

I was playing around with some patterns today to try to match some specific characters in a string, and ran into something unusual that I'm hoping someone can explain.
I had created a set looking for a list of characters within some strings, and noticed I was getting back some unexpected results. I eliminated the characters in the set until I got down to just three, and it seems to be these three that are responsible:
string = "alpha.5dc1704B40bc7f.beta.123456789.gamma.987654321.delta.abc123ABC321"
result = ""
for a in string.gmatch(string, '[+-_]') do
result = result .. a .. " "
end
> print(result)
. 5 1 7 0 4 B 4 0 7 . . 1 2 3 4 5 6 7 8 9 . . 9 8 7 6 5 4 3 2 1 . . 1 2 3 A B C 3 2 1
Why are these characters getting returned here (looks like any number or uppercase letter, plus dots)? I note that if I change up the order of the set, I don't get the same output - '[_+-]' or '[-_+]' or '[+_-]' or '[-+_]' all return nothing, as expected.
What is it about '[+-_]' that's causing a match here? I can't figure out what I'm telling lua that is being interpreted as instructions to match these characters.
When a - is between other characters inside square brackets, it means everything between those two. For example, [a-z] is all of the lowercase letters, and [A-F] is A, B, C, D, E, and F. [+-_] means every ASCII character between + and _, which includes all the numbers, all the uppercase letters, and a lot of punctuation.

Adding tabs to non delimited text file with empty and variable length columns

I have a non-delimited text file and want to parse it to add tabs at specific spots to delimit columns. The columns are sometimes empty or vary in length, which is why I need to add tabs to those specific spots. I had found the answer to this once a couple of years ago on the net using batch, but now can't find it or the code. I already have the following code to replace more than 2 spaces in the file, but this doesn't account for when the columns are empty.
gc $FileToOpen | % { $_ -replace ' +',"`t" } | set-content $FileToSave
So, I need to read each line, but be able to only read a portion (certain number of characters) of it and add the tabs after each portion to itself.
Here is a sample of the data file, the top row is the header and the data rows have no blank lines in between them:
MRUN Number Name X Exception Reason Data CDM# Quantity D.O.S
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 08/13/2015 0000000 0 08/13/2015
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 0000000 0 08/13/2015
The second data row is missing Data.
Using Ansgar's answer, my code that does find empty fields:
gc $FileToOpen |
? { $_ -match '^(.{8})(.{12})(.{20})(.{3})(.{34})(.{62})(.{10})(.{22})(.{10})$' } |
% { "{0}`t{1}`t{2}`t{3}`t{4}`t{5}`t{6}`t{7}`t{8}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim(), $matches[4].Trim(), $matches[5].Trim(), $matches[6].Trim(), $matches[7].Trim(), $matches[8].Trim(), $matches[9].Trim() } |
Set-Content $FileToSave
Thanks for your patience Ansgar, I know I tried it! I really do appreciate the help!
Since you seem to have an input file with fixed-width columns, you should probably use a regular expression for transforming the input into a tab-delimited format.
Assume the following input file:
A B C
foo 13 22
bar 4 17
baz 142 23
The file has 3 columns. The first column is 6 characters wide, the other two columns 4 characters each.
The transformation could be done with a regular expression like this:
Get-Content 'C:\path\to\input.txt' |
? { $_ -match '^(.{6})(.{4})(.{4})$' } |
% { "{0}`t{1}`t{2}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim() } |
Set-Content 'C:\path\to\output.txt'
The regular expression defines the columns by character count and captures them in groups (parentheses). The groups can then be accessed as the indexes 1 and above of the resulting $matches collection. Trimming removes the leading/trailing whitespace. The format operator (-f) then inserts the trimmed values into the tab-separated format string.
If the last column has a variable width (because its values are aligned to the left and don't have trailing spaces) you may need to change the regular expression to ^(.{6})(.{4})(.{,4})$ to take care of that. The quantifier {,4} (or {0,4}) means up to four times the preceding expression.

Why 255 is the limit

I've seen lots of places say:
The maximum number of characters is 255.
where characters are ASCII. Is there a technical reason for that?
EDIT: I know ASCII is represented by 8 bits and so there're 256 different characters. The question is why do they specify the maximum NUMBER of characters (with duplicates) is 255.
I assume the limit you're referring to is on the length of a string of ASCII characters.
The limit occurs due to an optimization technique where smaller strings are stored with the first byte holding the length of the string. Since a byte can only hold 256 different values, the maximum string length would be 255 since the first byte was reserved for storing the length.
Some older database systems and programming languages therefore had this restriction on their native string types.
Extended ASCII is an 8-bit character set. (Original ASCII is 7-bit, but that's not relevant here.)
8 bit means that 2^8 different characters can be referenced.
2^8 equals 256, and as counting starts with 0, the maximum ASCII char code has the value 255.
Thus, the statement:
The maximum number of characters is 255.
is wrong, it should read:
The maximum number of characters is 256, the highest possible character code is 255.
To understand better how characters are mapped to the numbers from 0 to 255, see the 8-bit ASCII table.
the limit is 255 because 9+36+84+126 = 255. the 256th character (which is really the first character) is zero.
using the combinatoric formula Ck(n) = n/k = n!/(k!(n-k)!) to find the number of non-repeating combinations for 1,2,3,4,5,6,7,8 digits you get this:
of digits: 1 2 3 4 5 6 7 8
of combinations: 9 36 84 126 126 84 36 9
it is unnecessary to include 5-8 digits since it's a symmetric group of M. in other words, a 4 element generator is a group operation for an octet and its group action has 255 permutations.
interestingly, it only requires 3 digits to "count" to 1000 (after 789 the rest of the numbers are repetitions of previous combinations).
The total number of Character in ASCII table is 256 (0 to 255). 0 to 31(total 32 character ) is called as ASCII control characters (character code 0-31). 32 to 127 character is called as ASCII printable characters (character code 32-127). 128 to 255 is called as The extended ASCII codes (character code 128-255).
The ASCII value of a-z = 97-122
The ASCII value of A-Z = 65-90
The ASCII value of 0-9 = 48-57
Is there a technical reason for that?
Yes there is. Early ASCII encoding standard is 7 bit log, which can represent 2^7 = 128 (0 .. 127) different character codes.
What you are talking about here is a variant of ASCII encoding developed later, which is 8 bit log and can hold 2^8 = 256 (0 .. 255) character codes.
See Wikipedia for more information on the same.

How to find the trailer dictionary?

Going through the PDF spec, it says that the trailer precedes the startxref. Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref. This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings. Lets get a little more wacky then.
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
15
%%EOF
Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?
Update - This trailer seems to open in Acrobat Reader
%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
15
%%EOF
As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.
Another Example:
%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
/Type /Catalog
/Pages 6 0 R
/OpenAction [ 7 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
/Type /Pages
/Kids [ 7 0 R ]
/Count 1
>>
endobj
7 0 obj <<
/Type /Page
/Parent 6 0 R
/Resources << >>
/MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 8 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 8 %\
/Root 5 0 R %\
/Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
17
%%EOF
This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?
In regard to the Resources
(Required; inheritable) A dictionary containing any resources required by the page
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page
tree.
The startxref statement usually is at the end of the file, with the trailer preceeding it.
Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."
If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.
To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):
"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets [...]"
The key expression to take into account here is "LAST cross-reference section".
If you are having in mind updated trailers, then have a look at Section 7.5.6.
Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.
Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).
Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...
Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).
Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:
trailer<<%) >>
% test test )
Which of course isn't a dictionary at all, since we don't see any key-value pair here.
His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).
So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".
Q: Doc, it hurts when I do this.
A: Don't do that.
The correct way to parse the end of a PDF goes something like this:
Find the last startxref
Back up to that byte offset and start parsing xref table entries
After the last xref table, parse out the trailer.
You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.
So why on Earth do you just want the trailer?
When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.
What I DID find, was this:
A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...
There are bullets in front of these sections, not numbers.
So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.
But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.
I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).
I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.
So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.
And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.
But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.
Jeremy has repeatedly edited his question and example code. This made my original answer and some of my original comments partially invalid and missing the point.
Fact is (and a well-known one amongst people in the prepress trade and industry): Adobe does in quite a few instances silently and without a warning process and display PDF files which do not pass a strict validity checker.
Jeremy seems to have constructed such a case. His latest example would make any PDF parser interprete the following snippet as being the trailer (I stripped comments):
trailer<<
/Size 4
/Root 2 0 R
/Info 3 0 R
>>
However, taking the info in this trailer will lead to the parser looking for the /Root at object 2 (while object 2 in fact is of /Type /Pages when it should be of /Type /Catalog for being the root object).
As a consequence, the PDF interpreter would have to
(a) either continue searching for another instance of a trailer on the chance that the next one does contain legitimate PDF info,
(b) or give up on processing the file and throw an error.
Adobe seems to follow alternative (a).
Ghostscript seems to follow alternative (b).
Note, that according to my byte-counting, Jeremy's PDF example has one more problem: its xref-table is invalid. It has only 16 bytes per line instead of 20. From the PDF spec document:
[....] the cross-reference entries themselves, one per line. Each entry shall be exactly 20 bytes long, including the end-of-line marker. There are two kinds of cross-reference entries: one for objects that are in use and another for objects that have been deleted and therefore are free. Both types of entries have similar basic formats, distinguished by the keyword n (for an in-use entry) or f (for a free entry). The format of an in-use entry shall be:
nnnnnnnnnn ggggg n eol
where:
nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence
The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.
So to make Jeremy's xref table a valid one, it should be padded with 2 more leading '0' and read:
xref
0 4
0000000000 65535 f
0000000110 00000 n
0000000250 00000 n
0000000315 00000 n
0000000576 00000 n
However, adding these 2 '0' to each xref line, also offsets each object by 10 more bytes, so the nnnnnnnnnn figures should also be corrected (being lazy, I didn't do it).
So Acrobat did open the constructed file of Jeremy (without any warning)
(1) despite the invalid trailer definition, and
(2) despite of the glaringly un-compliant xref table.
This adds two more proofs to what I stated in my second paragraph: Adobe's PDF parsing accepts files which violate Adobe's own PDF standard.
This is unfortunate. It lets get away lazy developers writing sloppy code which emits non-compliant PDF files without punishment. The fact that Adobe doesn't outright reject such crappy files may be in the interest of "user friendlyness", but promotes violations to the standard. At the very least, Adobe should always issue warnings when encountering such stuff.
Since Jeremy seems to go writing a PDF parser that wants to cover all corner cases, his users should hope that he at least warns them if it encounters shitty PDFs.
In any case: I've seen a lot of uncompliant PDF files emitted by crappy PDF generators. But so far I never encountered one which had comments sprinkled into its trailer section. So trying to cover corner cases should possibly start with lower hanging fruits than this.
I think I have found the solution. After extensive testing, and other things, with Adobe, I have found that what adobe does, is find the last known construct that can be parsed, and work from there, forward. Then it finds the last trailer that can be parsed correctly. So even if there is a correct root node that in trailer before the last valid trailer that can be parsed, if the root in the last trailer is invalid, it'll still fail. Would also be good to note, that this is still token based parsing forward. as trailers between () are ignored, so are trailers between stream/endstream's unless that stream has an invalid length, or a length specified in an obj after the stream (as these objects are not specified in the xref table). Now Adobe seems to take it that extra step further, by actually finding trailers in "gaps" in the xref table as well, this doesn't conform to the current spec model, as trailer is found at the end, and not in the body or xref table. So what I think is the best model, is to get the largest offset of the xref table, and the location of the xref table, if the xref table is after largest offset of an object, then use that, and work forward from there. This will allow me to correctly parse strings and comments without worrying. Thanks for everyone's help in this matter. Hopefully this helps people build a more robust PDF parser as well.
The trailer dictionary follows the xref section. Based on the startxref value, you jump to the beginning of xref section. After you read the xref section, you will reach the trailer dictionary. The trailer keyword is always the first on its line (white spaces are allowed in front of it). PDF files allow incremental updates, so you can encounter PDF files with multiple xref sections and trailers, but the processing rule is the same, first process the xref section and then the trailer. If the file includes incremental updates, the trailer section will include a reference to the previous xref section.

Finding the correct formula for encoded hex value in decimal

I have a case here where I am trying to figure out how a hex number is converted into a decimal number.
I had a similar case before, but found out that if I reversed the hex string, and swapped each second value (little-endian), then converting it back to a decimal value I got what I wanted, but this one is different.
here is the values we received
Value nr. 1 is
Dec: 1348916578
Hex: 0a66ab46
I just have this one decimal/hex for now but I am trying to get more values to compare results.
I hope any math genius out there will be able to see what formula might been used here :)
thanks
1348916578
= 5 0 6 6 D 5 6 2 hex
= 0101 0000 0110 0110 1101 0101 0110 0010
0a66ab46
= 0 A 6 6 A B 4 6 hex
= 0000 1010 0110 0110 1010 1011 0100 0110
So, if a number is like this, in hex digits:
AB CD EF GH
Then a possible conversion is:
rev(B) rev(A) rev(D) rev(C) rev(F) rev(E) rev(H) rev(G)
where rev reverses the order of bits in the nibble; though I can see that the reversal could be done on a byte-wise basis also.
Interesting.... I expanded the decimal and hex into binary, and this is what you get, respectively:
1010000011001101101010101100010
1010011001101010101101000110
Slide the bottom one over by padding with some 0s, then split into 8-byte blocks.
10100000 1100110 11010101 01100010
10100 1100110 10101011 01000110
It seems to start to line up. Let's make the bottom look like the top.
Pad the first block with 0s and it's equal.
The second block is ok.
Switch the 3rd block around (reverse it) and 10101011 becomes 11010101.
10100000 1100110 11010101 01000110
Likewise with the 4th.
10100000 1100110 11010101 01100010
Now they're the same.
10100000 1100110 11010101 01100010
10100000 1100110 11010101 01100010
Will this work for all cases? Impossible to know.
The decimal value of x0a66ab46 is 174500678 or 1185637898 (depending which endian you use, with any 8, 16 or 32bit access). There seems to be no direct connection between these values. Maybe you just have the pair wrong? It could help if you posted some code about how you generate these value pairs.
BTW, Delphi has a fine little method for this: SysUtils.IntToHex
What we found was that our min USB reader that gave 10 bit decimal format is actually not showing the whole binary code. The hexadecimal reader finds the full binary code. so essentially it is possible to convert from hexadecimal value to 10 bit decimal by taking off 9 characters after binary conversion.
But this does not work the other way around (unless we strip away 2 characters from the hexadecimal value the 10 bit decimal code will only show part of the full binary code).
So case closed.

Resources