How to parse a binary PDF stream of unknown length? - parsing

From the PDF docs: "The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes."
As the contents may be binary, an occurrence of endstream does not necessarily indicate the end of the stream. Now when considering this stream:
%PDF-1.4
%307쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x234+T03203T0^#A(235234˥^_d256220^314^U310^E^#[364^F!endstream
endobj
6 0 obj
30
endobj
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect. While it may help PDF writers to output PDFs sequentially, it makes parsing for PDF readers quite difficult. Considering that a PDF file is read more frequently than being written, I don't understand this.
So how can such a stream be parsed correctly?

The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
This is an understandable conclusion if one assumes that the file is to be read sequentially beginning to end.
This assumption is incorrect, though, because parsing a PDF from the front and determining the PDF objects on the run is not the recommended way of parsing a PDF.
While ISO 32000-1 is a bit vague here and merely says
Conforming readers should read a PDF file from its end.
(ISO 32000-1, section 7.5.5 File Trailer)
ISO 32000-2 clearly specifies:
With the exception of linearized PDF files, all PDF files should be read using the trailer and cross-reference table as described in the following subclauses. Reading a non-linearized file in a serial manner is not reliable because of the way objects are to be processed after an incremental update. (See 6.3.2, "Conformance of PDF processors".)
(ISO 32000-2, section 7.5 File structure)
Thus, in case of your PDF excerpt, a PDF processor trying to read object 5 0
looks up object 5 0 in the cross references and gets its offset in the file,
goes to that offset and starts reading the object, first parsing the stream dictionary,
at the stream keyword recognizes that the object is a stream and retrieves its Length value which happens to be an indirect reference to 6 0,
looks up object 6 0 in the cross references and gets its offset in the file,
goes to that offset and reads the object, the number 30,
reads the stream content of the stream object 5 0 knowing its length is 30.
An approach as yours is explicitly considered "not reliable".
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect.
If there were no cross references, you'd be correct. That also is why the FDF format (which does not have mandatory cross references) specifies:
FDF is based on PDF; it uses the same syntax and has essentially the same file structure (7.5, "File structure"). However, it differs from PDF in the following ways:
[...]
The length of a stream shall not be specified by an indirect object.
(ISO 32000-2, section 12.7.8 Forms data format)
Concerning the comments:
So I'm correct that PDF cannot be parsed sequentially,
While the very original design of PDF probably was meant for sequential parsing, it has been further developed with only access via cross references in mind. PDF simply is not meant to be parsed sequentially anymore. And that was already the case when I started dealing with PDFs in the late 90s.
and the only reason is that the required length of binary streams may be defined after the stream.
That's by far not the only reason, there are more situations requiring a cross reference lookup to parse correctly.
As #mkl indicated, a parser has to read somewhere before the end of the PDF file to get startxref, hoping that it does not start parsing in the middle of a binary stream.
That's not correct. The PDF must end with "%%EOF" plus optionally an end-of-line. Before that there must be an end-of-line, before that a number, before that an end-of-line, before that startxref.
This is already expressed clearly in ISO 32000-1:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
(ISO 32000-1, section 7.5.5 File Trailer)
Thus, no danger of being "in the middle of a binary stream" if the PDF is valid.
The other thing I dislike about the format of PDF is this: When developing a parser, you usually create test files with some elements you are working on. This approach seems to work with everything but streams. The absolute file positions of syntax elements and the requirement for multiple random accesses makes this task harder.
You seem to be subject to the misconception that the PDF format is a tagged text format like HTML. This is not the case. Even though numerous syntactical elements are defined using some ASCII keyword and there are "lines", PDF is a binary format, the cross reference tables are not a gimmick but the central access hub to the objects, and optimization for random access is done by design.

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

Mixing ASCII and Binary for record delimiters

My requirements are to write binary records inside a file. The binary records can be thought of as raw bytes in memory. I need a way to delimit each record, so that i can do something similar to binary search on the file. For example start in middle of file, find the next record delimited and start the search.
My question is that can ASCII such "START-RECORD" be used to delimit the binary record ?
START-RECORD, data-length, .......binary data...........START-RECORD, data-length, .......binary data...........
When starting from an arbitrary position within a file, i can simply search for ASCII String "START-DATA". Is this approach feasible?
Not in a single pass, since you're reading in binary mode or not. If you insert some strings or another pattern as "delimiter", you'd need to search for the binary representation of it while reading the file.

Determine if received data is PostScript or PCL

I have a service that receives printer data via tcp/ip. When the data is received, is there reliable, efficient way to examine the data stream and determine if the data is PostScript vs PCL data? For example, are there characters I could look for at the beginning of the data stream to indicate the format?
I would probably just count the number of escape characters in the file. PCL will have gobs of them. Postscript will have gobs of % signs. That isn't a perfect solution, but it's dead simple and I'll bet it would actually be quite reliable.
The only "real" way I can see doing this is to actually parse the PCL and parse the postscript and see which one works.
I'll add my 2¢.
Like others have mentioned here, your first stab at programmatically identifying the document would be to look at the first two characters. If it starts with %!, it is PostScript, if it starts with an escape character (hex 1B, oct 033, ascii 27), as very likely PCL will start with PCL commands, then it is PCL. This will likely resolve 99% of the documents you need to process. If it still isn't known, then you can search the document for a showpage string. If it's PostScript, it has to have a showpage to render the page. If you can't find one, and if there are any escape characters in the file, you know it is PCL, and you can err on the side of PCL if there is no showpage, and there is no escape characters, because raw text files are valid PCL and printers can blort them out as they come.
Postscript data must begin with "%!ps" or "%!PS" - it may be a longer readable string like "%!PS-Adobe-3.0" - but that is basically this.
Most likely PCL have a similar signature - I remember seeing it in the past.
According to the PCL 5 General Printing FAQs PCL files should start with ESC "E". I assume another ESC sequence must follow. So my guess is that files starting with bytes 1B 45 1B are most likely PCL files.
This leaves PCL files unrecognized which don't adhere to this rule.
In my use case it's macOS that always produces PCL with the ESC E at the beginning.

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

Erlang, reading a file with character offset

I have code to find a specific occurance of text in a file and give me an offset so I know where this occurance end. Now I want to read the file from that offset to the end of the file. The file contains binary data as well as text. How do I do this in Erlang?
Use pread. (See Erlang documentation on the file module). You have to take care of any character encoding yourself as the function deals with only bytes.

Resources