How to parse an input piece by piece with ANTLR?

How to parse an input piece by piece with ANTLR? - parsing

I am parsing an input that is unknown, so the parser might fail, but I want to parse it as much as possible.
Also, the input can be very big (> 1 Go).
Let's say the parser parses items (represented by letters) and this is the input :
A
B
C
D
E
I want to parse this input piece by piece. I can't give it the entire input because :
it can be too big
a failure at an item could provoke failures on the item(s) following.
I don't want to cut the input arbitrarily because :
If I cut at the wrong place, it will create errors (cutting in the middle of B for example).
If I try to not cut at the wrong place, I end up "preparsing" the input. (preparsing means that I will have the same issues for the preparsing as I have for the parsing, and the grammar is complex, items can be nested, so preparsing is complicated)
My current solution is to setup my grammar this way :
blind_parsing
: blind_statement swallow_to_eof
;
swallow_to_eof
: ~(EOF)*
;
The parser parses ONE item and swallows the rest in the rule swallow_to_eof.
I give the parser a partial input and complete it, piece by piece.
I don't like this solution :
the items can vary greatly (thousands to millions of characters), so I give the parser big pieces to make sure I don't accidentally cut the biggest items in two.
the performances are poor :
the size of the parsing inputs are big (previous point)
we keep parsing the same elements, dumping them in the swallow_to_eof rule (the example above is parsed in 5 times if everything goes well, and that sounds very inefficient :
A BCDE
B CDE
C DE
D E
E
Maybe there is an obvious solution to this issue and I missed it.
How do you solve this problem ?
Thank you :)

This is known as incremental parsing and no, ANTLR4 does not support this out of the box. In the past there were a number of discussions about this matter, but I don't remember having seen a reliable solution yet.

Related

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!

Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.

This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Searching/predicting next terminal/non-terminal by CFG/Tree?

I'm looking for algorithm to help me predict next token given a string/prefix and Context free grammar.
First question is what is the exact structure representing CFG. It seems it is a tree, but what type of tree ? I'm asking because the leaves are always ordered , is there a ordered-tree ?
May be if i know the correct structure I can find algorithm for bottom-up search !
If it is not exactly a Search problem, then the next closest thing it looks like Parsing the prefix-string and then Generating the next-token ? How do I do that ?
any ideas
my current generated grammar is simple it has no OR rules (except when i decide to reuse the grammar for new sequences, i will be). It is generated by Sequitur algo and is so called SLG(single line grammar) .. but if I generate it using many seq's the TOP rule will be Ex:>
S : S1 z S3 | u S2 .. S5 S1 | S4 S2 .. |... | Sn
S1 : a b
S2 : h u y
...
..i.e. top-heavy SLG, except the top rule all others do not have OR |
As a side note I'm thinking of a ways to convert it to Prolog and/or DCG program, where may be there is easier way to do what I want easily ?! what do you think ?

TL;DR: In abstract, this is a hard problem. But it can be pretty simple for given grammars. Everything depends on the nature of the grammar.
The basic algorithm indeed starts by using some parsing algorithm on the prefix. A rough prediction can then be made by attempting to continue the parse with each possible token, retaining only those which do not produce immediate errors.
That will certainly give you a list which includes all of the possible continuations. But the list may also include tokens which cannot appear in a correct input. Indeed, it is possible that the correct list is empty (because the given prefix is not the prefix of any correct input); this will happen if the parsing algorithm is unable to correctly verify whether a token sequence is a possible prefix.
In part, this will depend on the grammar itself. If the grammar is LR(1), for example, then the LR(1) parsing algorithm can precisely identify the continuation set. If the grammar is LR(k) for some k>1, then it is theoretically possible to produce an LR(1) grammar for the same language, but the resulting grammar might be impractically large. Otherwise, you might have to settle for "false positives". That might be acceptable if your goal is to provide tab-completion, but in other circumstances it might not be so useful.
The precise datastructure used to perform the internal parse and exploration of alternatives will depend on the parsing algorithm used. Many parsing algorithms, including the standard LR parsing algorithm whose internal data structure is a simple stack, feature a mutable internal state which is not really suitable for the exploration step; you could adapt such an algorithm by making a copy of the entire internal data structure (that is, the stack) before proceeding with each trial token. Alternatively, you could implement a copy-on-write stack. But the parser stack is not usually very big, so copying it each time is generally feasible. (That's what Bison does to produce expanded error messages with an "expected token" list, and it doesn't seem to trigger unacceptable runtime overhead in practice.)
Alternatively, you could use some variant of CYK chart parsing (or a GLR algorithm like the Earley algorithm), whose internal data structures can be implemented in a way which doesn't involve destructive modification. Such algorithms are generally used for grammars which are not LR(1), since they can cope with any CFG although highly ambiguous grammars can take a long time to parse (proportional to the cube of the input length). As mentioned above, though, you will get false positives from such algorithms.
If false positives are unacceptable, then you could use some kind of heuristic search to attempt to find an input sequence which completes the trial prefix. This can in theory take quite a long time, but for many grammars a breadth-first search can find a completion within a reasonable time, so you could terminate the search after a given maximum time. This will not produce false positives, but the time limit might prevent it from finding the complete set of possible continuations.

How to read through an LL parsing ?

I read that the LL parser is a Top down parser. So logically I suppose that we read throughout from the top to the down.
However, there's many way for read from the top to the down.
I found on wikipedia a page which talk about the depth first which speak of the course in an tree data structure (binary tree).
Otherwise, there is 3 kind of depth first: Pre-order, In-order, Post-order.
In my mind, I suppose that I need to use the Post-order one but how to be sure ?
how to know which kind of depth first need I to use for the LL parsing ?
depth first : https://en.wikipedia.org/wiki/Tree_traversal
Thank's

There's usually an infinite number of ways to traverse a grammar, just like there's an infinite number of possible inputs that adhere to the grammar.
When you walk the grammar you typically don't do it like you would a traditional tree or graph structure. Rather, your walk is dictated by an input stream of tokens coming from the lexer.
E.g. if you find yourself at a place in the grammar where it has has a production where either an identifier or an integer literal may occur, the branch taken is dictated by whether the current token is one or the other (or something else, which would then be a syntax error for that input).

LALR parsers and look-ahead

I'm implementing the automatic construction of an LALR parse table for no reason at all. There are two flavors of this parser, LALR(0) and LALR(1), where the number signifies the amount of look-ahead.
I have gotten myself confused on what look-ahead means.
If my input stream is 'abc' and I have the following production, would I need 0 look-ahead, or 1?
P :== a E
Same question, but I can't choose the correct P production in advance by only looking at the 'a' in the input.
P :== a b E
| a b F
I have additional confusion in that I don't think the latter P-productions really happen in when building a LALR parser generator. The reason is that the grammar is effectively left-factored automatically as we compute the closures.
I was working through this page and was ok until I got to the first/follow section. My issue here is that I don't know why we are calculating these things, so I am having trouble abstracting this in my head.
I almost get the idea that the look-ahead is not related to shifting input, but instead in deciding when to reduce.
I've been reading the Dragon book, but it is about as linear as a Tarantino script. It seems like a great reference for people who already know how to do this.

The first thing you need to do when learning about bottom-up parsing (such as LALR) is to remember that it is completely different from top-down parsing. Top-down parsing starts with a nonterminal, the left-hand-side (LHS) of a production, and guesses which right-hand-side (RHS) to use. Bottom-up parsing, on the other hand, starts by identifying the RHS and then figures out which LHS to select.
To be more specific, a bottom-up parser accumulates incoming tokens into a queue until a right-hand side is at the right-hand end of the queue. Then it reduces that RHS by replacing it with the corresponding LHS, and checks to see whether an appropriate RHS is at the right-hand edge of the modified accumulated input. It keeps on doing that until it decides that no more reductions will take place at that point in the input, and then reads a new token (or, in other words, takes the next input token and shifts it onto the end of the queue.)
This continues until the last token is read and all possible reductions are performed, at which point if what remains is the single non-terminal which is the "start symbol", it accepts the parse.
It is not obligatory for the parser to reduce a RHS just because it appears at the end of the current queue, but it cannot reduce a RHS which is not at the end of the queue. That means that it has to decide whether to reduce or not before it shifts any other token. Since the decision is not always obvious, it may examine one or more tokens which it has not yet read ("lookahead tokens", because it is looking ahead into the input) in order to decide. But it can only look at the next k tokens for some value of k, typically 1.
Here's a very simple example; a comma separated list:
1. Start -> List
2. List -> ELEMENT
3. List -> List ',' ELEMENT
Let's suppose the input is:
ELEMENT , ELEMENT , ELEMENT
At the beginning, the input queue is empty, and since no RHS is empty the only alternative is to shift:
queue remaining input action
---------------------- --------------------------- -----
ELEMENT , ELEMENT , ELEMENT SHIFT
At the next step, the parser decides to reduce using production 2:
ELEMENT , ELEMENT , ELEMENT REDUCE 2
Now there is a List at the end of the queue, so the parser could reduce using production 1, but it decides not to based on the fact that it sees a , in the incoming input. This goes on for a while:
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT SHIFT
List , ELEMENT , ELEMENT REDUCE 3
List , ELEMENT SHIFT
List , ELEMENT SHIFT
List , ELEMENT -- REDUCE 3
Now the lookahead token is the "end of input" pseudo-token. This time, it does decide to reduce:
List -- REDUCE 1
Start -- ACCEPT
and the parse is successful.
That still leaves a few questions. To start with, how do we use the FIRST and FOLLOW sets?
As a simple answer, the FOLLOW set of a non-terminal cannot be computed without knowing the FIRST sets for the non-terminals which might follow that non-terminal. And one way we can decide whether or not a reduction should be performed is to see whether the lookahead is in the FOLLOW set for the target non-terminal of the reduction; if not, the reduction certainly should not be performed. That algorithm is sufficient for the simple grammar above, for example: the reduction of Start -> List is not possible with a lookahead of ,, because , is not in FOLLOW(Start). Grammars whose only conflicts can be resolved in this way are SLR grammars (where S stands for "Simple", which it certainly is).
For most grammars, that is not sufficient, and more analysis has to be performed. It is possible that a symbol might be in the FOLLOW set of a non-terminal, but not in the context which lead to the current stack configuration. In order to determine that, we need to know more about how we got to the current configuration; the various possible analyses lead to LALR, IELR and canonical LR parsing, amongst other possibilities.

"Sub-parsers" in pipes-attoparsec

I'm trying to parse binary data using pipes-attoparsec in Haskell. The reason pipes (proxies) are involved is to interleave reading with parsing to avoid high memory use for large files. Many binary formats are based on blocks (or chunks), and their sizes are often described by a field in the file. I'm not sure what a parser for such a block is called, but that's what i mean by "sub-parser" in the title. The problem I have is to implement them in a concise way without a potentially large memory footprint. I've come up with two alternatives that each fail in some regard.
Alternative 1 is to read the block into a separate bytestring and start a separate parser for it. While concise, a large block will cause high memory use.
Alternative 2 is to keep parsing in the same context and track the number of bytes consumed. This tracking is error-prone and seems to infest all the parsers that compose into the final blockParser. For a malformed input file it could also waste time by parsing further than indicated by the size field before the tracked size can be compared.
import Control.Proxy.Attoparsec
import Control.Proxy.Trans.Either
import Data.Attoparsec as P
import Data.Attoparsec.Binary
import qualified Data.ByteString as BS
parser = do
size <- fromIntegral <$> anyWord32le
-- alternative 1 (ignore the Either for simplicity):
Right result <- parseOnly blockParser <$> P.take size
return result
-- alternative 2
(result, trackedSize) <- blockparser
when (size /= trackedSize) $ fail "size mismatch"
return result
blockParser = undefined
main = withBinaryFile "bin" ReadMode go where
go h = fmap print . runProxy . runEitherK $ session h
session h = printD <-< parserD parser <-< throwParsingErrors <-< parserInputD <-< readChunk h 128
readChunk h n () = runIdentityP go where
go = do
c <- lift $ BS.hGet h n
unless (BS.null c) $ respond c *> go

I like to call this a "fixed-input" parser.
I can tell you how pipes-parse will do it. You can see a preview of what I'm about to describe in pipes-parse in the parseN and parseWhile functions of the library. Those are actually for generic inputs, but I wrote similar ones for example String parsers as well here and here.
The trick is really simple, you insert a fake end of input marker where you want the parser to stop, run the parser (which will fail if it hits the fake end of input marker), then remove the end of input marker.
Obviously, that's not as easy as I make it sound, but it's the general principle. The tricky parts are:
Doing it in such a way that it still streams. The one I linked doesn't do that, yet, but the way you do this in a streaming way is to insert a pipe upstream that counts bytes flowing through it and then inserts the end-of-input marker at the correct spot.
Not interfering with existing end of input markers
This trick can be adapted for pipes-attoparsec, but I think the best solution would be for attoparsec to directly include this feature. However, if that solution is not available, then we can restrict the input that is fed to the attoparsec parser.

Ok, so I finally figured out how to do this and I've codified this pattern in the pipes-parse library. The pipes-parse tutorial explains how to do this, specifically in the "Nesting" section.
The tutorial only explains this for datatype-agnostic parsing (i.e. a generic stream of elements), but you can extend it to work with ByteStrings to work, too.
The two key tricks that make this work are:
Fixing StateP to be global (in pipes-3.3.0)
Embedding the sub-parser in a transient StateP layer so that it uses a fresh leftovers context
The pipes-attoparsec is going to release an update soon that builds on pipes-parse so that you can use these tricks in your own code.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart