OCaml file reader - parsing

i am defining a language in ocaml with ocamllex and ocamlyacc. The input for this language is a stream of ints from a file, for example:
1
2
3

open_in takes a file name and returns a channel for this file. Here you give it stdin which is already a channel.

Related

How to parse a binary PDF stream of unknown length?

From the PDF docs: "The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes."
As the contents may be binary, an occurrence of endstream does not necessarily indicate the end of the stream. Now when considering this stream:
%PDF-1.4
%307쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x234+T03203T0^#A(235234˥^_d256220^314^U310^E^#[364^F!endstream
endobj
6 0 obj
30
endobj
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect. While it may help PDF writers to output PDFs sequentially, it makes parsing for PDF readers quite difficult. Considering that a PDF file is read more frequently than being written, I don't understand this.
So how can such a stream be parsed correctly?
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
This is an understandable conclusion if one assumes that the file is to be read sequentially beginning to end.
This assumption is incorrect, though, because parsing a PDF from the front and determining the PDF objects on the run is not the recommended way of parsing a PDF.
While ISO 32000-1 is a bit vague here and merely says
Conforming readers should read a PDF file from its end.
(ISO 32000-1, section 7.5.5 File Trailer)
ISO 32000-2 clearly specifies:
With the exception of linearized PDF files, all PDF files should be read using the trailer and cross-reference table as described in the following subclauses. Reading a non-linearized file in a serial manner is not reliable because of the way objects are to be processed after an incremental update. (See 6.3.2, "Conformance of PDF processors".)
(ISO 32000-2, section 7.5 File structure)
Thus, in case of your PDF excerpt, a PDF processor trying to read object 5 0
looks up object 5 0 in the cross references and gets its offset in the file,
goes to that offset and starts reading the object, first parsing the stream dictionary,
at the stream keyword recognizes that the object is a stream and retrieves its Length value which happens to be an indirect reference to 6 0,
looks up object 6 0 in the cross references and gets its offset in the file,
goes to that offset and reads the object, the number 30,
reads the stream content of the stream object 5 0 knowing its length is 30.
An approach as yours is explicitly considered "not reliable".
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect.
If there were no cross references, you'd be correct. That also is why the FDF format (which does not have mandatory cross references) specifies:
FDF is based on PDF; it uses the same syntax and has essentially the same file structure (7.5, "File structure"). However, it differs from PDF in the following ways:
[...]
The length of a stream shall not be specified by an indirect object.
(ISO 32000-2, section 12.7.8 Forms data format)
Concerning the comments:
So I'm correct that PDF cannot be parsed sequentially,
While the very original design of PDF probably was meant for sequential parsing, it has been further developed with only access via cross references in mind. PDF simply is not meant to be parsed sequentially anymore. And that was already the case when I started dealing with PDFs in the late 90s.
and the only reason is that the required length of binary streams may be defined after the stream.
That's by far not the only reason, there are more situations requiring a cross reference lookup to parse correctly.
As #mkl indicated, a parser has to read somewhere before the end of the PDF file to get startxref, hoping that it does not start parsing in the middle of a binary stream.
That's not correct. The PDF must end with "%%EOF" plus optionally an end-of-line. Before that there must be an end-of-line, before that a number, before that an end-of-line, before that startxref.
This is already expressed clearly in ISO 32000-1:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
(ISO 32000-1, section 7.5.5 File Trailer)
Thus, no danger of being "in the middle of a binary stream" if the PDF is valid.
The other thing I dislike about the format of PDF is this: When developing a parser, you usually create test files with some elements you are working on. This approach seems to work with everything but streams. The absolute file positions of syntax elements and the requirement for multiple random accesses makes this task harder.
You seem to be subject to the misconception that the PDF format is a tagged text format like HTML. This is not the case. Even though numerous syntactical elements are defined using some ASCII keyword and there are "lines", PDF is a binary format, the cross reference tables are not a gimmick but the central access hub to the objects, and optimization for random access is done by design.

How to print numbers read from a binary file in lua?

I have a binary file and I want to read its contents with lua. I know that it contains float numbers represented as 4 bytes with no delimeters between them. So I open the file and do t=file:read(4). Now I want to print the non-binary representation of the number, but if I do print(t), I only get sth like x98xC1x86. What should I do?
If you're running Lua 5.3, try this code:
t=file:read(4)
t=string.unpack(t,"f")
print(t)
The library function string.unpack converts binary data to Lua types.

Is it possible to pipe HDF5 formated data?

It is possible to write HDF5 to stdout and read from stdin (via H5::File file("/dev/stdout",H5F_ACC_RDONLY) or otherwise)?
What I want is to have a program foo to write to an HDF5 file (taken to be its first argument, say) and another program bar to read from an HDF5 file and then instead of
command_prompt> foo temp.h5
command_prompt> bar temp.h5
command_prompt> rm temp.h5
simply say
command_prompt> foo - | bar -
where the programs foo and bar understand the special file name - to mean stdout and stdin respectively. In order to write those programs, I want to know 1) whether this is at all possible and 2) how I implement this, i.e. what to pass to H5Fcreate() and H5Fopen(), respectively, in case file name = -.
I tried and it seems impossible (not a big surprise). HDF5 only has H5Fcreate(), H5Fopen(), and H5Freopen(), neither of which seems to support I/O to stdin/stdout.
I do not think you can use stdin as an hdf5 input file. The library needs to seek around between the header contents and the data, and you cannot do that with stdin.

Read/Parse Binary files with Powershell

I'm trying to parse a binary file, and I need some help on where to go. I've looking online for "parsing binary files", "reading binary files", "reading text inside binaries", etc. and I haven't had any luck.
For example, how would I read this text out of this binary file? Any help would be MUCH appreciated. I am using powershell.
It seems that you have a binary file with text on a fixed or otherwise deducible position. Get-Content might help you but... It'll try to parse the entire file to an array of strings and thus creating an array of "garbage". Also, you wouldn't know from what file position a particular "rope of characters" was.
You can try .NET classes File to read and Encoding to decode. It's just a line for each call:
# Read the entire file to an array of bytes.
$bytes = [System.IO.File]::ReadAllBytes("path_to_the_file")
# Decode first 12 bytes to a text assuming ASCII encoding.
$text = [System.Text.Encoding]::ASCII.GetString($bytes, 0, 12)
In your real case you'd probably go through the array of bytes in a loop finding the start and end of a particular string sequence and using those indices to specify the range of bytes you want to extract the text from by the GetString.
The .NET methods I mentioned are available in .NET Framework 2.0 or higher. If you installed PowerShell 2.0 you already have it.
If you're just looking for strings, check out the strings.exe utility from SysInternals.
You can read in the file via Get-Content -Encoding byte . I'm not sure how to parse it though.

Erlang, reading a file with character offset

I have code to find a specific occurance of text in a file and give me an offset so I know where this occurance end. Now I want to read the file from that offset to the end of the file. The file contains binary data as well as text. How do I do this in Erlang?
Use pread. (See Erlang documentation on the file module). You have to take care of any character encoding yourself as the function deals with only bytes.

Resources