Is it possible to parse big file with ANTLR? - parsing

Is it possible to instruct ANTLR not to load entire file into memory? Can it apply rules one by one and generate topmost list of nodes sequentially, along with reading file? Also may be it is possible to drop analyzed nodes somehow?

Yes, you can use:
UnbufferedCharStream for your character stream (passed to lexer)
UnbufferedTokenStream for your token stream (passed to parser)
This token stream implementation doesn't differentiate on token channels, so make sure to use ->skip instead of ->channel(HIDDEN) as the command in your lexer rules that shouldn't be sent to the parser.
Make sure to call setBuildParseTree(false) on your parser or a giant parse tree will be created for the entire file.
Edit with some additional commentary:
I put quite a bit of work into making sure UnbufferedCharStream and UnbufferedTokenStream operate in the most "sane" manner possible, especially in relation to the mark, release, seek, and getText methods. My goal was to preserve as much of the functionality of those methods as possible without compromising the ability of the stream to release unused memory.
ANTLR 4 allows for true unlimited lookahead. If your grammar requires lookahead to EOF to make a decision, then you would not be able to avoid loading the entire input into memory. You'll have to take great care to avoid this situation when writing your grammar.

There is a Wiki page buried somewhere on Antlr.org that speaks to your question; cannot seem to find in just now.
In substance, the lexer reads data using a standard InputStream interface, specifically ANTLRInputStream.java. The typical implementation is ANTLRFileStream.java that preemptively reads the entire input data file into memory. What you need to do is to write your own buffered version -"ANTLRBufferedFileStream.java"- that reads from the source file as needed. Or, just set a standard BufferedInputStream/FileInputStream as the data source to the AntlrInputStream.
One caveat is that Antlr4 has the potential for doing an unbounded lookahead. Not likely a problem for a reasonably sized buffer in normal operation. More likely when the parser attempts error recovery. Antlr4 allows for tailoring of the error recovery strategy, so the problem is manageable.
Additional detail:
In effect, Antlr implements a pull-parser. When you call the first parser rule, the parser requests tokens from the lexer, which requests character data from the input stream. The parser/lexer interface is implemented by a buffered token stream, nominally BufferedTokenStream.
The parse tree is little more than a tree data structure of tokens. Well, a lot more, but not in terms of data size. Each token is an INT value backed typically by a fragment of the input data stream that matched the token definition. The lexer itself does not require a full copy of the lex'd input character stream to be kept in memory. And, the token text fragments could be zero'd out. The critical memory requirement for the lexer is the input character stream lookahead scan, given a buffered file input stream.
Depending on your needs, the in-memory parse tree can be small even given a 100GB+ input file.
To help further, you need to explain more what it is you are trying to do in Antlr and what defines your minimum critical memory requirement. That will guide which additional strategies can be recommended. For example, if the source data is amenable, you can use multiple lexer/parser runs, each time subselecting in the lexer different portions of the source data to process. Compared to file reads and DB writes, even with fast disks, Antlr execution will likely be barely noticeable.

Related

Why would I use a lexer and not directly parse code?

I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.

Scanner and parser interaction

I am new to flex/bison. Reading books, it seems that in nearly all compiler implementations, the parser interacts with the scanner in a "coroutine" manner, that whenever the parser needs a token, it calls the scanner to get one, and left the scanner aside when it's busy on shift/reduce. A natural question is that why not let the scanner produces the token-stream (from the input byte-stream) as a whole, and then pass the entire token-stream to the parser, thus there is no explicit interaction betw. the two? Well, I can image that there are some drawbacks in this manner, and I can also see some benefits of doing so.
My question is, is there a sort of "comprehensive" discussion on that aspect, or is there any compiler implementation uses different scanner/parser interaction scheme other than "coroutine" manner?
In the traditional arrangement, the parser calls the scanner whenever it needs a token.
That's the same logic as used in the scanner (or many other programs) which call the I/O library every time they need more input. That's not usually described as a coroutine, and I'm not convinced it's an accurate description of the parser/scanner interaction either.
In coroutine control flow, two functions call each other in tandem. That's not usually the way I/O is handled. The fread() interface does maintain state for the next call (the file position, at least, and maybe a buffer) but it the calls are self contained.
In a sense, there is no difference between calling yylex() to get the next token and calling scanf() to get the next data value.
This is not always the most convenient architecture for a scanner. Sometimes, it would be convenient for the scanner to be able to feed tokens into the parser. A typical use case is when the scanner is generating tokens, for exanple through macro expansion, but sometimes it is just that the match of a single scanner pattern contains more than one token.
Many parser generators, including Bison, can generate callable parsers, usually called "push parsers". In this model, the scanner calls the parser with each succesive token. This is still not a coroutine model, really; it is just control-flow inversion. In the analogy with ordinary I/O, it's the equivalent of taking a data processor which called fgets() to read each input line and rewriting it as a process_line() function which is given a line of data to process (and thus does not interact with the I/O library). An early implementation of push parsing can be found in the Lemon parser generator.
Coroutine-like control flow could be useful for creating a parser whose eventual input stream must be handled asynchronously. But that doesn't really require coroutining between the parser and the scanner; rather, it requires coroutining between the scanner and the input stream. Again, coroutining is not really necessary and might be overkill: inverting control flow should suffice. Flex does not provide a "push scanner" interface, but other scanner generators do. I believe this feature is supported by Re2c, for example.

Designing a Language Lexer

I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.

How to parse a very large file in F# using FParsec

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Resources