Can I write a parser for Communicating sequential processes(CSP) in ANTLR? I think it uses left recursion like in statement
VMS = (coin → (choc → VMS))
complete language specification can be found at CSPM : A Reference Manua
so it is not a LL grammer. Am I right?
In general, even if you have a grammar with left recursion, you can refactor the grammar to remove it. So ANTLR is reasonably likely to be able to process your grammar. There's no a priori reason you can't write a CSP grammar for ANTLR.
Whether the one you have is suitable is another question.
If your quoted phrase is a grammar rule, it doesn't have left recursion. (If it is, I don't understand the syntax of your grammar rules, especially why the parentheses [terminals?] would be unbalanced; that's pretty untradional.)
So ANTLR should be able to process it, modulo converting it to ANTRL grammar rule syntax.
You didn't show the rest of the grammar, so one can't have an opinion about the rest of it.
In the case above does not have left recursion. It would looks something like. Note this is a simplified version, CSP is much more complicated. I'm just showing it is possible.
assignment : PROCNAME EQ process
;
process : LPAREN EVENT ARROW process RPAREN
| PROCNAME
;
Besides, you can factor out left recursion with ANTLRWorks 'Remove Left Recursion' function.
CSP's are definitely possible in ANTLR, check http://www.carnap.info/ for an example.
Related
Is there an easy way to tell whether a simple grammar is suitable for recursive descent? Is eliminating left recursion and left factoring the grammar enough to achieve this ?
Not necessarily.
To build a recursive descent parser (without backtracking), you need to eliminate or resolve all predict conflicts. So one definitive test is to see if the grammar is LL(1); LL(1) grammars have no predict conflicts, by definition. Left-factoring and left-recursion elimination are necessary for this task, but they might not be sufficient, since a predict conflict might be hiding behind two competing non-terminals:
list ::= item list'
list' ::= ε
| ';' item list'
item ::= expr1
| expr2
expr1 ::= ID '+' ID
expr2 ::= ID '(' list ')
The problem with the above (or, at least, one problem) is that when the parser expects an item and sees an ID, it can't know which of expr1 and expr2 to try. (That's a predict conflict: Both non-terminals could be predicted.) In this particular case, it's pretty easy to see how to eliminate that conflict, but it's not really left-factoring since it starts by combining two non-terminals. (And in the full grammar this might be excerpted from, combining the two non-terminals might be much more difficult.)
In the general case, there is no algorithm which can turn an arbitrary grammar into an LL(1) grammar, or even to be able to say whether the language recognised by that grammar has an LL(1) grammar as well. (However, it's easy to tell whether the grammar itself is LL(1).) So there's always going to be some art and/or experimentation involved.
I think it's worth adding that you don't really need to eliminate left-recursion in a practical recursive descent parser, since you can usually turn it into a while-loop instead of recursion. For example, leaving aside the question of the two expr types above, the original grammar in an extended BNF with repetition operators might be something like
list ::= item (';' item)*
Which translates into something like:
def parse_list():
parse_item()
while peek(';'):
match(';')
parse_item()
(Error checking and AST building omitted.)
I'm writing a grammar for a formal language. Ideally I'd want that grammar to be unambiguous, but that might not be possible. In either case, I want to know about all possible ambiguities while developing the grammar. How can I do that?
So far, most of the time when I developed a language, I'd turn to Bison, write a LR(1) grammar for it, run Bison in verbose mode and have a look at all the shift-reduce and reduce-reduce conflicts it tells me about. Make sure that I agree with its choice in each case.
But now I'm in a project where Bison doesn't have a code generator for one of the required target languages, and where ANTLR is already being used. Plus the language isn't LR(1), and rewriting it as LR(1) would entail additional syntax checks after the parser is done, thus reducing the expressiveness of the grammar as a tool to describe the language.
So I'm now working with ANTLR, fed it my grammar, and all seems to be working well. But ANTLR doesn't seem to check for ambiguities at compile time. For example, the following grammar is ambiguous:
grammar test;
lst: '(' ')' {System.out.println("a");}
| '(' elts ')' {System.out.println("b");} ;
elts: elt (',' elt)* ;
elt: 'x' | /* empty */ ;
The input () could be interpreted as the empty list, or it could be interpreted as a list consisting of a single empty element. The generated parser chooses the former interpretation, but I'd like to be able to manually verify that choice.
Is there command line switch I can use to get ANTLR to tell me about ambiguities?
Or perhaps an option I can set in the grammar file?
Or should I use some other tool to check the grammar for ambiguities?
If so, is there one which can already read ANTLR syntax, or do I have to strip out all the actions and boil this down to BNF myself?
The ANTLRErrorListener.reportAmbiguity method suggests that ANTLR might be able to perform some ambiguity testing at runtime. But I guess that's only going to tell you whether the parsing of a given input is ambiguous. Is there some strategy how I could leverage this to detect all ambiguities, using a carefully selected set of inputs?
Well, as far as I know, ANTLR has no real option to check for ambiguity, other than the errors it produced IF you write an ambiguous grammar and feed an input that triggers the ambiguity. I do, however know a few tools which can check for ambiguity. They all have different syntax, and I don't know any tool which uses ANTLR grammar.
A software called AtoCC has a tool called KfG which can check for ambiguity.
ACLA (Ambiguity Checking with Language Approximations).
Context Free Grammar Tool.
Personally, I find tool 3 easiest to use, but is the most limited as well. It is important, however to note that none of the tools can be 100% sure; if the tools says you're grammar is ambiguous, it is ambiguous, however if they say you're grammar is unambiguous, they might still be ambiguous, as they have no way of testing an infinite number of ways, that your language can be written.
Hope this helps.
I want to solve this Grammar.
S->SS+
S->SS*
S->a
I want to construct SLR sets of items and parsing table with action and goto.
Can this grammar parse without eliminate left recursion.
Is this Grammar SLR.
No, this grammar is not SLR. It is ambiguous.
Left recursion is not a problem for LR parsers. Left recursion elimination is only necessary for LL parsers.
I am not entirely sure about this, but I think this grammar is actually SLR(1). I constructed by hand the SLR(1) table and I obtained one with no conflicts (having added a 0-transition from S' (new start symbol) -> S).
Can somebody provide a sentence that can be derived in two different ways from this grammar? I was able to get a parser for it in Bison without any warning. Are you sure it is ambiguous?
I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.
I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful
There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.
So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.
What's the difference between this grammar:
...
if_statement : 'if' condition 'then' statement 'else' statement 'end_if';
...
and this:
...
if_statement : IF condition THEN statement ELSE statement END_IF;
...
IF : 'if';
THEN: 'then';
ELSE: 'else';
END_IF: 'end_if';
....
?
If there is any difference, as this impacts on performance ...
Thanks
In addition to Will's answer, it's best to define your lexer tokens explicitly (in your lexer grammar). In case you're mixing them in your parser grammar, it's not always clear in what order the tokens are tokenized by the lexer. When defining them explicitly, they're always tokenized in the order they've been put in the lexer grammar (from top to bottom).
The biggest difference is one that may not matter to you. If your Lexer rules are in the lexer then you can use inheritance to have multiple lexer's share a common set of lexical rules.
If you just use strings in your parser rules then you can not do this. If you never plan to reuse your lexer grammar then this advantage doesn't matter.
Additionally I, and I'm guessing most Antlr veterans, are more accustom to finding the lexer rules in the actual lexer grammar rather than mixed in with the parser grammar, so one could argue the readability is increased by putting the rules in the lexer.
There is no runtime performance impact after the Antlr parser has been built to either approach.
The only difference is that in your first production rule, the keyword tokens are defined implicitly. There is no run-time performance implication for tokens defined implicitly vs. explicitly.
Yet another difference: when you explicitly define your lexer rules you can access them via the name you gave them (e.g. when you need to check for a specific token type). Otherwise ANTLR will use arbitrary numbers (with a prefix).