Parsing ASN1 with ANTLR - parsing

I have an assignment for my formal languages course. The teacher asked us to develop a lexer for ASN1 DER encoded file using a lexer generator. I chose ANTLR 3.5.2 since the lexer is basically a pushdown automata, instead of a FSA. I know that ASN1 is context-sensitive because of the length field, hence our teacher gives an upper bound on the length of the parsable ASN1 files of 512 bytes. Even if the counting is finite, I'm still stuck on it, because the only way I know to count using a context-free grammar is using a grammar with n rules, if I have to count n bytes. Moreover, 512 bytes is just an upper bound, hence in a length field there can be every number between 1 and 512, meaning that I have to count every number in that range. Thus I would need 512^2 rules in the grammar, which seems way too big. Do you know a smarter way to perform this task?

Related

How can I span all vaild input space using antlr4 instead of parsing a input?

In the other words, how can i get ast tree but with control of recursion depth.
e.g.
grammar: Expr
start: expression EOF;
expression
: expression (+|-) expression
| NUMBER
;
NUMBER: [0-9]+;
all valid input space with depth = 3:
depth 1: 0, 10, ...
depth 2: 0 + 2, 3 - 4, ...
depth 3: 0 + 2 - 4, ....
There's no feature built-in to Antlr parsers to do this, as far as I know, and a general solution would be complicated.
Even in your simple example, your question is hard to interpret. What is the recursion depth of a parse of a NUMBER? Always 1? One per iteration as though the repetition operator were expanded into BNF? Or was your intention to ignore the lexical analysis? (Remember that Antlr does not require that grammars be divided into lexical and syntactic components.)
Your grammar does not include any predicates but it's hard to write an Antlr grammar for real-world problems without using predicates; they are what give Antlr much of its power. But inverting a grammar which uses predicates is probably only possible by generating all candidates and then testing each one.
If we remove Antlr from the equation leaving only BNF, and don't attempt to generate tokens beyond their token type, then it's pretty simple to enumerate possible sentences using a breadth-first search, which will naturally work in order by recursion depth. The result is rarely useful because of exponential explosion in the number of possibilities, but there are certainly implementations.
A more common problem is that of generating useful test cases for a parser or language processor. The classic algorithm was presented in 1972 by Paul Purdom; although it is not entirely satisfactory, it forms the basis for many grammar testers. Open source implementations are available; if you want to write your own, try to find a copy of Malloy & Power's An interpretation of Purdom's Algorithm for Automatic Generation of Test Cases, first published in 2001.

Parsing special cases

If I understand correctly, parsing turns a sequence of symbols into a tree. My question is, is it possible to use some standard procedure (LR, LL, PEG, ..?) to parse the following two examples or is it necessary to write a specialized parser by hand?
Python source code, i.e. the whitespace-indented blocks
I think I read somewhere that the parser keeps track of the number of leading spaces, and pretends to replace them with curly brackets to delimitate the blocks. Is it fundamentally required because the standard parsing techniques are not powerful enough or is it for performance reasons?
PNG image format, where a block starts with a header and block size, after which there is the content of the block
The content could contain bytes which resemble some header so it is necessary to "know" that the next x bytes are not to be "parsed", i.e. they should be skipped. How to express this, say, with PEG? In other words, the "closing bracket" is represented by the length of the content.
Neither of the examples in the question are context-free, so strictly speaking they cannot be parsed with context-free grammars. But in practical terms, they are both pretty easy to parse.
The python algorithm is well-described in the Python reference manual (although you need to read that in context.) What's described there is a pre-processing step in which whitespace at the beginning of a line is systematically replaced with INDENT and DEDENT tokens.
To clarify: It's not really a preprocessing step, and it's important to observe that it happens after implicit and explicit line joining. (There are previous sections in the reference manual which describe these procedures.) In particular, lines are implicitly joined inside parentheses, braces and brackets, so the process is intertwined with parsing.
In practical terms, both the line-joining and indentation algorithms can be accomplished programmatically; typically, these would be done inside a custom scanner (tokenizer) which maintains both a stack of parentheses and indent levels. The token stream can then be parsed with normal context-free algorithms, but the tokenizer -- although it might use regular expressions -- needs context-sensitive logic (counting spaces, for example). [Note 1]
Similarly, formats which contain explicit sizes (such as most serialization formats, including PNG files, Google protobufs, and HTTP chunked encoding) are not context-free, but are obviously easy to tokenize since the tokenizer simply has to read the length and then read that many bytes.
There are a variety of context-sensitive formalisms, and these definitely have their uses, but in practical parsing the most common strategy is to use a Turing-equivalent formalism (such as any programming language, possibly augmented with a scanner-generator like flex) for the tokenizer and a context-free formalism for the parser. [Note 2]
Notes:
It may not be immediately obvious that Python indenting is not context-free, since context-free grammars can accept some categories of agreement. For example, {ωω-1 | ω∈Σ*} (the language of all even-length palindromes) is context-free, as is {anbn}.
However, these examples can't be extended, because the only count-agreement possible in a context-free language is bracketing. So while palindromes are context-free (you can implement the check with a single stack), the apparently very similar {ωω | ω∈Σ*} is not, and neither is {anbncn}
One such formalism is back-references in "regular" expressions, which might be available in some PEG implementation. Back-references allow the expression of a variety of context-sensitive languages, but do not allow the expression of all context-free languages. Unfortunately, regular expressions with back-references really suck in practice, because the problem of determining whether a string matches a regex with back-references is NP complete. You might find this question on a sister SE site interesting. (And you might want to reformulate your question in a way that could be asked on that site, http://cs.stackexchange.com.)
As a practical matter, almost all parser construction requires some clever hacks around the edges to overcome the limitations of the parsing machinery.
Pure context free parsers can't do Python; all the parser technologies you have listed are weaker than pure-context free, so they can't do it either. A hack in the lexer to keep track of indentation, and generate INDENT/DEDENT tokens, turns the indenting problem into explicit "parentheses", which are easily handled by context-free parsers.
Most binary files can't be processed either, as they usually contain, somewhere, a list of length N, where N is provided before the list body is encountered (this is kind of the example you gave). Again, you can get around this, with a more complicated hack; something must keep a stack of nested list lengths, and the parser has to signal when it moves from one list element to the next. The top-most length counter gets decremented, and the parser gets back a signal "reduce" or "shift". Other more complex linked structures are generally pretty hard to parse this way. Getting the parser to cooperate this way isn't always easy.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Two level grammar

I am trying to determine whether suggested changes to the EcmaScript grammar introduce ambiguities.
The grammar is odd in a few ways
There is no regular or context free lexical grammar meaning there is no way to break the input into a series of tokens which can be fed to a tree builder, though at a given parser state there is a context free grammar which can be used to fetch the next token.
Some tokens are implicit. Specifically semicolons are inserted in some places when not present in the source text. This only requires one non-ignorable token of lookahead but since ignorable tokens can be of arbitrary length prevents non-finite lookahead.
There is no translation simpler than a full parse that allows removal or collapsing of ignorable tokens.
Line terminators tokens (and multiline comments that are equivalent to line terminators) are ignorable in most contexts but are significant in some.
I know that proving no ambiguity is not doable in general, but I'd like to be able to achieve a simpler goal:
A test that is true if and only if there is no string such that two different paths through the candidate grammar might produce two different trees where each path involves breaking the string into less than k tokens.
I would be very happy if I could prove such a thing for a candidate grammar to k of 50.
Is there any literature on detecting ambiguity within such limits?

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers

I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

Resources