Testing grammar for ambiguities

Testing grammar for ambiguities - parsing

I'm writing a grammar for a formal language. Ideally I'd want that grammar to be unambiguous, but that might not be possible. In either case, I want to know about all possible ambiguities while developing the grammar. How can I do that?
So far, most of the time when I developed a language, I'd turn to Bison, write a LR(1) grammar for it, run Bison in verbose mode and have a look at all the shift-reduce and reduce-reduce conflicts it tells me about. Make sure that I agree with its choice in each case.
But now I'm in a project where Bison doesn't have a code generator for one of the required target languages, and where ANTLR is already being used. Plus the language isn't LR(1), and rewriting it as LR(1) would entail additional syntax checks after the parser is done, thus reducing the expressiveness of the grammar as a tool to describe the language.
So I'm now working with ANTLR, fed it my grammar, and all seems to be working well. But ANTLR doesn't seem to check for ambiguities at compile time. For example, the following grammar is ambiguous:
grammar test;
lst: '(' ')' {System.out.println("a");}
| '(' elts ')' {System.out.println("b");} ;
elts: elt (',' elt)* ;
elt: 'x' | /* empty */ ;
The input () could be interpreted as the empty list, or it could be interpreted as a list consisting of a single empty element. The generated parser chooses the former interpretation, but I'd like to be able to manually verify that choice.
Is there command line switch I can use to get ANTLR to tell me about ambiguities?
Or perhaps an option I can set in the grammar file?
Or should I use some other tool to check the grammar for ambiguities?
If so, is there one which can already read ANTLR syntax, or do I have to strip out all the actions and boil this down to BNF myself?
The ANTLRErrorListener.reportAmbiguity method suggests that ANTLR might be able to perform some ambiguity testing at runtime. But I guess that's only going to tell you whether the parsing of a given input is ambiguous. Is there some strategy how I could leverage this to detect all ambiguities, using a carefully selected set of inputs?

Well, as far as I know, ANTLR has no real option to check for ambiguity, other than the errors it produced IF you write an ambiguous grammar and feed an input that triggers the ambiguity. I do, however know a few tools which can check for ambiguity. They all have different syntax, and I don't know any tool which uses ANTLR grammar.
A software called AtoCC has a tool called KfG which can check for ambiguity.
ACLA (Ambiguity Checking with Language Approximations).
Context Free Grammar Tool.
Personally, I find tool 3 easiest to use, but is the most limited as well. It is important, however to note that none of the tools can be 100% sure; if the tools says you're grammar is ambiguous, it is ambiguous, however if they say you're grammar is unambiguous, they might still be ambiguous, as they have no way of testing an infinite number of ways, that your language can be written.
Hope this helps.

Related

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).

Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.

Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

Corner cases with GLR, precedence, and non-associativity: Intended Semantics?

Yes, I'm one of those insane people who have a parser-generator project. Minimal-LR(1) with operator-precedence was fairly straightforward. GLR support is next, preferably without making a mess of the corner cases around precedence and associativity (P&A).
Suppose you have an R/R conflict between rules with different precedence levels. A deterministic parser can safely choose the (first) rule of highest precedence. A parser designed to handle local ambiguity might not be sure, especially if the involved rules reduce to different non-terminals.
Suppose you have a R/R conflict between rules with- and without- precedence characteristics. A deterministic parser can reasonably choose the former. If you ask for GLR, do you mean to entertain both, or should the former clearly dominate the latter? Or is this scenario sufficiently weird as to justify rejecting the grammar?
Suppose you have an S/R/R conflict where only some of the participating rules have precedence, and maybe the look-ahead token does or doesn't have precedence. If P&A is all about what to do in front of the lookahead, then a non-precedent token should perhaps mean all options stay viable. But is that really the intended semantic here?
Suppose you have a nonassoc declaration on a terminal, and an S/R/R conflict where only ONE of the participating production rules hits the same non-associative precedence level. Then the other rule is clearly still viable to reduce, but what of the shift? Should we take it? What if we're mid-rule in a manner that doesn't trigger the same non-associativity problem? What if the look-ahead token is higher precedence than the remaining reduce, or the remaining reduce doesn't have precedence? How can we avoid accidentally constructing an invalid parse this way? Is there some trick with the parse-items to construct a shift-state that can't go wrong, or is this kind of thing beyond the scope of GLR parsing?
Also, how should semantic predicates interact with such ugly corner cases?
The simplest-thing-that-might-work is to treat anything involving operator-precedence in the same manner as a deterministic table-generator. But is that the intended semantic? Or perhaps: what kinds of declarations might grammar authors want to exert control over these weird cases?

Traditional yacc-style precedence rules cannot be used to resolve reduce/reduce conflicts.
Yacc/bison "resolve" reduce/reduce conflicts by choosing the first production in the grammar file. This has nothing to do with precedence, and in the grammars where you would want to use a GLR parser, it is almost certainly not correct; you want the GLR parser to pursue all possible paths.
The bison GLR parser requires that ambiguity be resolved; that is, that the grammar be unambiguous. However, it has two "outs": first, it lets you use "dynamic precedence" declarations (which is a completely different concept, although it happens to use the same word); second, if that's not enough, it lets you provide your own resolution function.
Amongst other possibilities, a custom resolution function can accept both reductions, for example by inserting a branch in the AST. There are some theoretical issues with this approach for general parsing, but it works fine with real programming languages, which tend to not be ambiguous, or at least "not very ambiguous".
A typical case for dynamic precedence is implementing a (textual) rule like C++'s §9.8/1:
There is an ambiguity in the grammar involving expression-statements and declarations: An expression-statement with a function-style explicit type conversion (8.2.3) as its leftmost subexpression can be indistinguishable from a declaration where the first declarator starts with a (. In those cases the statement is a declaration.
This rule cannot be expressed by a context-free grammar -- or, at least not in a way which would be readable -- but it is trivially expressible as a dynamic precedence rule.
As its name implies, dynamic precedence is dynamic; it's a rule applied at parse time by the parser. Bison's GLR algorithm only applies these rules if forced to; the parser handles multiline possible reductions normally (by maintaining all of them as possibilities). It is forced to apply dynamic precedence only when both possible reductions in a reduce/reduce conflict reduce to the same non-terminal.
By contrast, the yacc precedence algorithm, which as I mentioned only resolves shift/reduce conflicts, is static: it is compiled at generation time into the parse automaton (in effect, by removing actions from the transition tables), so the parser no longer sees the conflict.
This algorithm has been (justifiably) criticised for a variety of reasons, one of which is the odd behaviour of non-associative declarations in corner cases. Also, precedence rules do not compose well; because they are not scoped, they might end up accidentally applying to productions for which they were not intended. Not infrequently, they facilitate grammar bugs by hiding a conflict which should have been resolved by the grammar writer.
Best practice, therefore, is to avoid corner cases :-) Static precedence should be restricted to its originally-intended use cases: simple operator precedence and, possibly, documenting the "shift preferred" heuristic which resolves dangling-else resolution and certain grouped operator parses (iirc, there's a good example of this in the dragon book).
If you implement dynamic precedence -- and, honestly, there are good reasons not to -- then it should be applied to simple easily expressed rules like the C++ rule cited above: "if it looks like a declaration, it's a declaration." Even better would be to avoid writing ambiguous grammars; that particular C++ feature leads to the infamous "most vexatious parse", which has probably at some point bitten every one of us who have tried writing C++ programs.

Can this be parsed by a LALR(1) parser?

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?

Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

Can CSP(Communicating sequential processes) parser be generated using ANTLR?

Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?

In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.

In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.

CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.

I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful

There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.

So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart