How to find the trailer dictionary? - parsing

Going through the PDF spec, it says that the trailer precedes the startxref. Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref. This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings. Lets get a little more wacky then.
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
15
%%EOF
Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?
Update - This trailer seems to open in Acrobat Reader
%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 4 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 4 %\
/Root 2 0 R %\
/Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
15
%%EOF
As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.
Another Example:
%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n
1 0 obj <<
/Type /Catalog
/Pages 2 0 R
/OpenAction [ 3 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>>
endobj
3 0 obj <<
/Type /Page
/Parent 2 0 R
/Resources << >>
/MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
/Type /Catalog
/Pages 6 0 R
/OpenAction [ 7 0 R /XYZ null null null ]
/PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
/Type /Pages
/Kids [ 7 0 R ]
/Count 1
>>
endobj
7 0 obj <<
/Type /Page
/Parent 6 0 R
/Resources << >>
/MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
/Producer (Me)
/CreationDate (D:20110626000000Z)
>>
endobj
trailer<< %\
/Size 8 %\
/Root 1 0 R %\
/Info 4 0 R %\
/Key (\
trailer<< %\
/Size 8 %\
/Root 5 0 R %\
/Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
17
%%EOF
This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?
In regard to the Resources
(Required; inheritable) A dictionary containing any resources required by the page
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page
tree.

The startxref statement usually is at the end of the file, with the trailer preceeding it.
Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."
If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.
To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):
"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets [...]"
The key expression to take into account here is "LAST cross-reference section".
If you are having in mind updated trailers, then have a look at Section 7.5.6.
Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.
Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).
Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...
Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).
Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:
trailer<<%) >>
% test test )
Which of course isn't a dictionary at all, since we don't see any key-value pair here.
His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).
So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".

Q: Doc, it hurts when I do this.
A: Don't do that.
The correct way to parse the end of a PDF goes something like this:
Find the last startxref
Back up to that byte offset and start parsing xref table entries
After the last xref table, parse out the trailer.
You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.
So why on Earth do you just want the trailer?
When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.
What I DID find, was this:
A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...
There are bullets in front of these sections, not numbers.
So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.
But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.
I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).
I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.
So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.
And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.
But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.

Jeremy has repeatedly edited his question and example code. This made my original answer and some of my original comments partially invalid and missing the point.
Fact is (and a well-known one amongst people in the prepress trade and industry): Adobe does in quite a few instances silently and without a warning process and display PDF files which do not pass a strict validity checker.
Jeremy seems to have constructed such a case. His latest example would make any PDF parser interprete the following snippet as being the trailer (I stripped comments):
trailer<<
/Size 4
/Root 2 0 R
/Info 3 0 R
>>
However, taking the info in this trailer will lead to the parser looking for the /Root at object 2 (while object 2 in fact is of /Type /Pages when it should be of /Type /Catalog for being the root object).
As a consequence, the PDF interpreter would have to
(a) either continue searching for another instance of a trailer on the chance that the next one does contain legitimate PDF info,
(b) or give up on processing the file and throw an error.
Adobe seems to follow alternative (a).
Ghostscript seems to follow alternative (b).
Note, that according to my byte-counting, Jeremy's PDF example has one more problem: its xref-table is invalid. It has only 16 bytes per line instead of 20. From the PDF spec document:
[....] the cross-reference entries themselves, one per line. Each entry shall be exactly 20 bytes long, including the end-of-line marker. There are two kinds of cross-reference entries: one for objects that are in use and another for objects that have been deleted and therefore are free. Both types of entries have similar basic formats, distinguished by the keyword n (for an in-use entry) or f (for a free entry). The format of an in-use entry shall be:
nnnnnnnnnn ggggg n eol
where:
nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence
The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.
So to make Jeremy's xref table a valid one, it should be padded with 2 more leading '0' and read:
xref
0 4
0000000000 65535 f
0000000110 00000 n
0000000250 00000 n
0000000315 00000 n
0000000576 00000 n
However, adding these 2 '0' to each xref line, also offsets each object by 10 more bytes, so the nnnnnnnnnn figures should also be corrected (being lazy, I didn't do it).
So Acrobat did open the constructed file of Jeremy (without any warning)
(1) despite the invalid trailer definition, and
(2) despite of the glaringly un-compliant xref table.
This adds two more proofs to what I stated in my second paragraph: Adobe's PDF parsing accepts files which violate Adobe's own PDF standard.
This is unfortunate. It lets get away lazy developers writing sloppy code which emits non-compliant PDF files without punishment. The fact that Adobe doesn't outright reject such crappy files may be in the interest of "user friendlyness", but promotes violations to the standard. At the very least, Adobe should always issue warnings when encountering such stuff.
Since Jeremy seems to go writing a PDF parser that wants to cover all corner cases, his users should hope that he at least warns them if it encounters shitty PDFs.
In any case: I've seen a lot of uncompliant PDF files emitted by crappy PDF generators. But so far I never encountered one which had comments sprinkled into its trailer section. So trying to cover corner cases should possibly start with lower hanging fruits than this.

I think I have found the solution. After extensive testing, and other things, with Adobe, I have found that what adobe does, is find the last known construct that can be parsed, and work from there, forward. Then it finds the last trailer that can be parsed correctly. So even if there is a correct root node that in trailer before the last valid trailer that can be parsed, if the root in the last trailer is invalid, it'll still fail. Would also be good to note, that this is still token based parsing forward. as trailers between () are ignored, so are trailers between stream/endstream's unless that stream has an invalid length, or a length specified in an obj after the stream (as these objects are not specified in the xref table). Now Adobe seems to take it that extra step further, by actually finding trailers in "gaps" in the xref table as well, this doesn't conform to the current spec model, as trailer is found at the end, and not in the body or xref table. So what I think is the best model, is to get the largest offset of the xref table, and the location of the xref table, if the xref table is after largest offset of an object, then use that, and work forward from there. This will allow me to correctly parse strings and comments without worrying. Thanks for everyone's help in this matter. Hopefully this helps people build a more robust PDF parser as well.

The trailer dictionary follows the xref section. Based on the startxref value, you jump to the beginning of xref section. After you read the xref section, you will reach the trailer dictionary. The trailer keyword is always the first on its line (white spaces are allowed in front of it). PDF files allow incremental updates, so you can encounter PDF files with multiple xref sections and trailers, but the processing rule is the same, first process the xref section and then the trailer. If the file includes incremental updates, the trailer section will include a reference to the previous xref section.

Related

How to consider syntax-whitespace while parsing PDF data?

My question is about the syntax used in pdf files. Documentation (PDF32000_2008.pdf,pdf_reference_1-7.pdf) states what whitespace is:
white-space character
characters that separate PDF syntactic
constructs such as names and numbers from each other; white space
characters are HORIZONTAL TAB (09h), LINE FEED (0Ah), FORM FEED (0Ch),
CARRIAGE RETURN (0Dh), SPACE (20h); (see Table 1 in 7.2.2, “Character
Set”)
Note: Be adviced that whitespace refers to the data/content of the pdf file (i.e. when opened with an editor vim) and not the rendered presentation (i.e. when viewed in pdf-reader)
To my understanding that would mean that this is a valid PDF object
1 0 obj
<< /Type /Catalog
/Pages 2 0 R
>>
endobj
where between the two objects of type (name): /Type and /Catalog there is a SPACE (20h) character, fulfilling the quoted purpose to "separate [those two] PDF syntactic constructs".
However it turns out that I am able to omit the whitespace while still producing the same rendered results (in pdf.js and evince programs). Hence my question is this an equivalent alternative to the code shown above
1 0 obj
<< /Type/Catalog/Pages 2 0 R>>
endobj
Yes, it is legal.
Right after the description of whitespace characters, you will find the following: (emphasis added)
The delimiter characters (, ), <, >, [, ], {, }, / and % are special. They delimit syntactic entities such as strings, arrays, names, and comments. Any of these characters terminates the entity preceding it and is not included in the entity.
So you don't need whitespace before the /.

Read a single number from a text file and advance stream position in Julia

I understand that Julia has a complete set of low level tools for interfacing with binary files on one hand and some powerfull utilities such as readdlm to load text files containing rectangular data into Array structures on the other hand.
What I cannot discover in the standard library docs, however, is how to easily get input from less structured text files. In particular, what would be the Julia equivalent of the c++ idiom
some_input_stream >> a_variable_int_perhaps;
Given this is such a common usage scenario I am surprised something like this does not feature prominently in the standard library...
You can use readuntil http://docs.julialang.org/en/latest/stdlib/io-network/#Base.readuntil
shell> cat test.txt
1 2 3 4
julia> i,j = open("test.txt") do f
parse(Int, readuntil(f," ")), parse(Int, readuntil(f," "))
end
(1,2)
EDIT: To address comments
To get the last integer in an irregularly formatted ascii file you could use split if you know the character preceding the integer (I've use a blank space here)
shell> cat test.txt
1.0, two five:$#!() + 4
last line 3
julia> i = open("test.txt") do f
parse(Int, split(readline(f), " ")[end])
end
4
As far as code length is concerned, the above examples are completely self contained and the file is opened and closed in an exception safe manner (i.e. wrapped in a try-finally block). To do the same in C++ would be quite verbose.

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

Huffman tree, is this correct?

I'm trying to make create a correct huffman tree and was wondering if this was correct. The top number is the frequency/weight and the bottom number is the ASCII code. The string is
"hhiiiisssss". If I entered this into a text file, there would be only one LF correct? I'm not sure why my program is reading in two.
14
-1
/ \
9 5
-1 s(115)
/ \
5 4
-1 i(105)
/ \
3 2
h(104) LF(10)
In a text file there would only be one LF if there is only one line of text, correct.
Something else is wrong though. There are only two 'h' in your string but your tree shows three, and a total of 14 characters. I'm guessing it's a typo?
Aside from that it looks ok and your huffman codes would be (depending on whether you pick '0' for left or right):
s: 1
i: 01
LF: 001
h: 000

parsing input file in fortran

This is a continuation of my older thread.
I have a file from different code, that I should parse to use as my input.
A snippet from it looks like:
GLOBAL SYSTEM PARAMETER
NQ 2
NT 2
NM 2
IREL 3
*************************************
BEXT 0.00000000000000E+00
SEMICORE F
LLOYD F
NE 32 0
IBZINT 2
NKTAB 936
XC-POT VWN
SCF-ALG BROYDEN2
SCF-ITER 29
SCF-MIX 2.00000000000000E-01
SCF-TOL 1.00000000000000E-05
RMSAVV 2.11362995016878E-06
RMSAVB 1.25411205586140E-06
EF 7.27534671479201E-01
VMTZ -7.72451391270293E-01
*************************************
And so on.
Currently I am reading it line by line, as:
Program readpot
use iso_fortran_env
Implicit None
integer ::i,filestat,nq
character(len=120):: rdline
character(10)::key!,dimension(:),allocatable ::key
real,dimension(:),allocatable ::val
i=0
open(12,file="FeRh.pot_new",status="old")
readline:do
i=i+1
read(12,'(A)',iostat=filestat) rdline!(i)
if (filestat /= 0) then
if (filestat == iostat_end ) then
exit readline
else
write ( *, '( / "Error reading file: ", I0 )' ) filestat
stop
endif
end if
if (rdline(1:2)=="NQ") then
read(rdline(19:20),'(i)'),nq
write(*,*)nq
end if
end do readline
End Program readpot
So, I have to read every line, manually find the value column corresponding to the key, and write that(For brevity, I have shown for one value only).
My question is, is this the proper way of doing this? or there is other simpler way? Kindly let me know.
If the file has no variability you scarcely need to parse it at all. Let's suppose that you have declared variables for all the interesting data items in the file and that those variables have the names shown on the lines of the file. For example
INTEGER :: nq , nt, nm, irel
REAL:: scf_mix, scf_tol ! '-' not allowed in Fortran names
CHARACTER(len=48) :: label, text
LOGICAL :: semicore, lloyd
! Complete this as you wish
Then write a block of code like this
OPEN(12,file="FeRh.pot_new",status="old")
READ(12,*) ! Not interested in the 1st line
READ(12,*) label, nq
READ(12,*) label, nt
READ(12,*) label, nm
READ(12,*) label, irel
READ(12,*) ! Not interested in this line
READ(12,*) label, bext
READ(12,*) label, semicore
! Other lines to write
CLOSE(12)
Fortran's list-directed input understands blanks in lines to separate values. It will not read those blanks as part of a character variable. That behaviour can be changed but in your case you don't need to. Note that it will also understand the character F to mean .false. when read into a logical variable.
My code snippet just ignores the labels and lines of explanation. If you are of a nervous disposition you could process them, perhaps
IF (label/='NE') STOP
or whatever you wish.

Resources