How to consider syntax-whitespace while parsing PDF data? - parsing

My question is about the syntax used in pdf files. Documentation (PDF32000_2008.pdf,pdf_reference_1-7.pdf) states what whitespace is:
white-space character
characters that separate PDF syntactic
constructs such as names and numbers from each other; white space
characters are HORIZONTAL TAB (09h), LINE FEED (0Ah), FORM FEED (0Ch),
CARRIAGE RETURN (0Dh), SPACE (20h); (see Table 1 in 7.2.2, “Character
Set”)
Note: Be adviced that whitespace refers to the data/content of the pdf file (i.e. when opened with an editor vim) and not the rendered presentation (i.e. when viewed in pdf-reader)
To my understanding that would mean that this is a valid PDF object
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj
where between the two objects of type (name): /Type and /Catalog there is a SPACE (20h) character, fulfilling the quoted purpose to "separate [those two] PDF syntactic constructs".
However it turns out that I am able to omit the whitespace while still producing the same rendered results (in pdf.js and evince programs). Hence my question is this an equivalent alternative to the code shown above
1 0 obj
<< /Type/Catalog/Pages 2 0 R>>
endobj

Yes, it is legal.
Right after the description of whitespace characters, you will find the following: (emphasis added)
The delimiter characters (, ), <, >, [, ], {, }, / and % are special. They delimit syntactic entities such as strings, arrays, names, and comments. Any of these characters terminates the entity preceding it and is not included in the entity.
So you don't need whitespace before the /.

Related

Space delimited, except inside braces in a log file - Python

I'm a long time reader, first time asker (please be gentle).
I've been doing this with a pretty messy WHILE READ in Unix Bash, but I'm learning python and would like to try to make a more effective parser routine.
So I have a bunch of log files which are space delimited mostly, but contain square braces where there may also be spaces. How to ignore content within the braces, when looking for delimiters?
(I'm assuming that RE library is necessary to do this)
i.e. sample input:
[21/Sep/2014:13:51:12 +0000] serverx 192.0.0.1 identity 200 8.8.8.8 - 500 unavailable RESULT 546 888 GET http ://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium [somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; colon:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy) DoesanyonerememberAOL/1.0]
Desired output:
'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'
If you'll notice the first and last fields (the ones that were in the square braces) still have spaces intact.
Bonus points
The 14th field (URL) is always in one of these formats:
htp://google.com/path-data-might-be-here-and-can-contain-special-characters
google.com/path-data-might-be-here-and-can-contain-special-characters
xyz.abc.www.google.com/path-data-might-be-here-and-can-contain-special-characters
google.com:443
google.com
I'd like to add an additional column to the data which includes just the domain (i.e. xyz.abc.www.google.com or google.com).
Until now, I've been taking the parsed output using a Unix AWK with an IF statement to split this field by '/' and check to see if the third field is blank. If it is, then return first field (up until the : if it is present), otherwise return third field). If there is a better way to do this--preferably in the same routine as above, I'd love to hear it--so my final output could be:
'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'; **'www.google.com'**
Footnote: I changed http to htp in the sample, so it wouldn't create a bunch of distracting links.
The regular expression pattern \[[^\]]*\]|\S+ will tokenize your data, though it doesn't strip off the brackets from the multi-word values. You'll need to do that in a separate step:
import re
def parse_line(line):
values = re.findall(r'\[[^\]]*\]|\S+', line)
values = [v.strip("[]") for v in values]
return values
Here's a more verbose version of the regular expression pattern:
pattern = r"""(?x) # turn on verbose mode (ignores whitespace and comments)
\[ # match a literal open bracket '['
[^\]]* # match zero or more characters, as long as they are not ']'
\] # match a literal close bracket ']'
| # alternation, match either the section above or the section below
\S+ # match one or more non-space characters
"""
values = re.findall(pattern, line) # findall returns a list with all matches it finds
If your server logs have JSON in them you can include a match for curly braces as well:
\[[^\]]*\]|\{[^\}]*\}|\S+
https://regexr.com/

How to mask specific elements in HL7?

Currently I am learning how to work with HL7 and how to parse it in python. Now I was wondering what happens if a value in a HL7 segment contains a pipe sign, e.g. '|'. How is this sign handled? If there is no masking, it would lead to a crash of the HL7 parser. Is there a masking possibility?
\F\
You should read the relevant sections of chapter 2 of the version 2 standard about how escaping works in version 2.
The HL7 structure has defined escape sequences for the separators like |.
When you look at a HL7 message, the used five delimiters are right after the MSH:
MSH|^~\&
| is the Field separator F
^ the component separator S
~ is the repetition separator (for the second level elements) R
\ is the escape character E
& is the sub-component separator T
So to escape one of the special characters like |, you have to take the escape character and then add the defined letter (F,S, etc.)
So in above case, to escape the | you would have to put \F\. Or escaping the escape character is \E\.
If you like you can also change the delimiters after the MSH completely, but I don't recommend that.

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

RAILS 3 CSV "Illegal quoting" is a lie

I've hit a problem during parsing of a CSV file where I get the following error:
CSV::MalformedCSVError: Illegal quoting on line 3.
RAILS code in question:
csv = CSV.read(args.local_file_path, col_sep: "\t", headers: true)
Line 3 in the CSV file is:
A-067067 VO VIA CE 0 8 8 SWCH Ter 4, Loc Is Here, Mne, Per Fl Auia/Sey IMAC NEK_HW 2011-03-09 09:47:44 2011-03-09 11:50:26 2011-01-13 10:49:17 2011-02-14 14:02:43 2011-02-14 14:02:44 0 0 771 771 46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT SOME_TEXT N/A Name Here RESOLVED_CLOSED RESOLVED_CLOSED
UPDATE: Tabs don't appear to have come across above. See pastebin RAW TEXT: http://pastebin.com/4gj7iUpP
I've read numerous threads all over StackOverflow and Google about why this is and I understand that. But the CSV row above has perfectly legal quoting does it not?
The CSV is tab delimited and there is only a tab followed by the quote on either side of the column in question. There is 1 quote in that field and it is double quoted to escape it. So what gives? I can't work it out. :(
Assuming I've got something wrong here, I'd like the solution to include a way to work around the issue as I don't have control over how the CSV is constructed.
This part of your CSV is at fault:
46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT
At least one of these parts has a stray space:
46273 "
" SOME_TEXT
I'd guess that the "3" and the double are supposed to be separated by one or more tabs but there is a space before the quote. Or, there is a space after the quote on the other end when there are only supposed to be tabs between the closing quote and the "S".
CSV escapes double quotes by double them so this:
"[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC"
is supposed to be a single filed that contains an embedded quote:
[O/H 15/02] B270 W31 "TEXT TEXT 2 X TEXT SWITC
If you have a space before the first quote or after the last quote then, since your fields are tab delimited, you have an unescaped double quote inside a field and that's where your "illegal quoting" error comes from.
Try sending your CSV file through cat -t (which should represent tabs as ^I) to find where the stray space is.

How to find the trailer dictionary?

Going through the PDF spec, it says that the trailer precedes the startxref. Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref. This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings. Lets get a little more wacky then.
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
15
%%EOF
Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?
Update - This trailer seems to open in Acrobat Reader
%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
15
%%EOF
As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.
Another Example:
%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
/Type /Catalog
/Pages 6 0 R
/OpenAction [ 7 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
/Type /Pages
/Kids [ 7 0 R ]
/Count 1
>>
endobj
7 0 obj <<
/Type /Page
/Parent 6 0 R
/Resources << >>
/MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 8 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 8 %\
/Root 5 0 R %\
/Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
17
%%EOF
This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?
In regard to the Resources
(Required; inheritable) A dictionary containing any resources required by the page
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page
tree.
The startxref statement usually is at the end of the file, with the trailer preceeding it.
Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."
If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.
To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):
"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets [...]"
The key expression to take into account here is "LAST cross-reference section".
If you are having in mind updated trailers, then have a look at Section 7.5.6.
Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.
Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).
Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...
Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).
Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:
trailer<<%) >>
% test test )
Which of course isn't a dictionary at all, since we don't see any key-value pair here.
His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).
So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".
Q: Doc, it hurts when I do this.
A: Don't do that.
The correct way to parse the end of a PDF goes something like this:
Find the last startxref
Back up to that byte offset and start parsing xref table entries
After the last xref table, parse out the trailer.
You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.
So why on Earth do you just want the trailer?
When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.
What I DID find, was this:
A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...
There are bullets in front of these sections, not numbers.
So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.
But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.
I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).
I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.
So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.
And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.
But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.
Jeremy has repeatedly edited his question and example code. This made my original answer and some of my original comments partially invalid and missing the point.
Fact is (and a well-known one amongst people in the prepress trade and industry): Adobe does in quite a few instances silently and without a warning process and display PDF files which do not pass a strict validity checker.
Jeremy seems to have constructed such a case. His latest example would make any PDF parser interprete the following snippet as being the trailer (I stripped comments):
trailer<<
/Size 4
/Root 2 0 R
/Info 3 0 R
>>
However, taking the info in this trailer will lead to the parser looking for the /Root at object 2 (while object 2 in fact is of /Type /Pages when it should be of /Type /Catalog for being the root object).
As a consequence, the PDF interpreter would have to
(a) either continue searching for another instance of a trailer on the chance that the next one does contain legitimate PDF info,
(b) or give up on processing the file and throw an error.
Adobe seems to follow alternative (a).
Ghostscript seems to follow alternative (b).
Note, that according to my byte-counting, Jeremy's PDF example has one more problem: its xref-table is invalid. It has only 16 bytes per line instead of 20. From the PDF spec document:
[....] the cross-reference entries themselves, one per line. Each entry shall be exactly 20 bytes long, including the end-of-line marker. There are two kinds of cross-reference entries: one for objects that are in use and another for objects that have been deleted and therefore are free. Both types of entries have similar basic formats, distinguished by the keyword n (for an in-use entry) or f (for a free entry). The format of an in-use entry shall be:
nnnnnnnnnn ggggg n eol
where:
nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence
The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.
So to make Jeremy's xref table a valid one, it should be padded with 2 more leading '0' and read:
xref
0 4
0000000000 65535 f
0000000110 00000 n
0000000250 00000 n
0000000315 00000 n
0000000576 00000 n
However, adding these 2 '0' to each xref line, also offsets each object by 10 more bytes, so the nnnnnnnnnn figures should also be corrected (being lazy, I didn't do it).
So Acrobat did open the constructed file of Jeremy (without any warning)
(1) despite the invalid trailer definition, and
(2) despite of the glaringly un-compliant xref table.
This adds two more proofs to what I stated in my second paragraph: Adobe's PDF parsing accepts files which violate Adobe's own PDF standard.
This is unfortunate. It lets get away lazy developers writing sloppy code which emits non-compliant PDF files without punishment. The fact that Adobe doesn't outright reject such crappy files may be in the interest of "user friendlyness", but promotes violations to the standard. At the very least, Adobe should always issue warnings when encountering such stuff.
Since Jeremy seems to go writing a PDF parser that wants to cover all corner cases, his users should hope that he at least warns them if it encounters shitty PDFs.
In any case: I've seen a lot of uncompliant PDF files emitted by crappy PDF generators. But so far I never encountered one which had comments sprinkled into its trailer section. So trying to cover corner cases should possibly start with lower hanging fruits than this.
I think I have found the solution. After extensive testing, and other things, with Adobe, I have found that what adobe does, is find the last known construct that can be parsed, and work from there, forward. Then it finds the last trailer that can be parsed correctly. So even if there is a correct root node that in trailer before the last valid trailer that can be parsed, if the root in the last trailer is invalid, it'll still fail. Would also be good to note, that this is still token based parsing forward. as trailers between () are ignored, so are trailers between stream/endstream's unless that stream has an invalid length, or a length specified in an obj after the stream (as these objects are not specified in the xref table). Now Adobe seems to take it that extra step further, by actually finding trailers in "gaps" in the xref table as well, this doesn't conform to the current spec model, as trailer is found at the end, and not in the body or xref table. So what I think is the best model, is to get the largest offset of the xref table, and the location of the xref table, if the xref table is after largest offset of an object, then use that, and work forward from there. This will allow me to correctly parse strings and comments without worrying. Thanks for everyone's help in this matter. Hopefully this helps people build a more robust PDF parser as well.
The trailer dictionary follows the xref section. Based on the startxref value, you jump to the beginning of xref section. After you read the xref section, you will reach the trailer dictionary. The trailer keyword is always the first on its line (white spaces are allowed in front of it). PDF files allow incremental updates, so you can encounter PDF files with multiple xref sections and trailers, but the processing rule is the same, first process the xref section and then the trailer. If the file includes incremental updates, the trailer section will include a reference to the previous xref section.

Resources