How to remove Left recursive in this ANTLR grammar? - parsing

I am trying to parse CSP(Communicating Sequential Processes) CSP Reference Manual. I have defined following grammar rules.
assignment
: IDENT '=' processExpression
;
processExpression
: ( STOP
| SKIP
| chaos
| prefix
| prefixWithValue
| seqComposition
| interleaving
| externalChoice
....
seqComposition
: processExpression ';' processExpression
;
interleaving
: processExpression '|||' processExpression
;
externalChoice
: processExpression '[]' processExpression
;
Now ANTLR reports that
seqComposition
interleaving
externalChoice
are left recursive . Is there any way to remove this or I should better used Bison Flex for this type of grammar. (There are many such rules)

Define a processTerm. Then write rules looking like
assignment
: IDENT '=' processExpression
;
processTerm
: ( STOP
| SKIP
| chaos
| prefix
...
processExpression
: ( processTerm
| processTerm ';' processExpression
| processTerm '|||' processExpression
| processTerm '[]' processExpression
....
If you want to have things like seqComposition still defined, I think that would be OK as well. But you need to make sure that the parsing of processExpansion is going to always consume more text as you proceed through your rules.

Read the guide to removing left recursion in on the ANTLR wiki. It helped me a lot.

Related

ANTLR grammar for multi-level text segmentation

I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.

How to write non-ambiguous grammar for LTL formula in bison

I am writing a CFG grammar for LTL formula, where atomic proposition are directly expressed by logic formulas. However I am getting amgigouity in my grammar, when I try to implement parenthesis for both - logic and LTL formula (parenthesis for logic formula should have higher priority). Here's my grammar; when I uncomment the parenthesis rule in ltl nonterminal, I got shift/reduce conflict. How to solve it?
%left TPLUS TMINUS
%left TMUL TDIV
%left TAND TOR TIMP
%left TRSHIFT TLSHIFT
%left TEQUAL TCNE TCGE TCGT TCLE TCLT
%left TUNTIL TWEAK TFUT TGLOB TREL TNEG
%start ltlformula
%%
ltlformula
: ltl {}
formula
: lexpr {}
;
lterm
: TLPAREN lexpr TRPAREN {}
| arexpr binary_la_oper arexpr {}
;
lnterm
: lterm {}
| TNEG lnterm {}
;
lexpr
: lterm {}
| lexpr binary_ll_oper lnterm {}
;
ltl
: formula {}
| TFUT ltl {}
| TGLOB ltl {}
| ltl TUNTIL ltl {}
| ltl TREL ltl {}
| ltl TWEAK ltl {}
| TNEG ltl {}
// | TLPAREN ltl TRPAREN { } - here comes the trouble...
;
The basic problem is that an ltl can match a parenthesized lexpr in two ways:
ltl ltl
/ | \ |
TLPAREN ltl TRPAREN formula
| |
formula lexpr
| |
lexpr lterm
/ | \
TLPAREN lexpr TRPAREN
If you want to fix this so so that the second parse is not possible, you need to un-factor the grammar so that an ltl cannot expand into a lterm that expands into a parenthesized expression. This involves splitting (duplicating) all the rules along that path:
ltl: formula_no_paren
| ..other ltl rules
formula_no_paren: lexpr_no_paren ;
lexpr_no_paren
: lterm_no_paren
| ... all other lterm rules
lterm_no_paren: ... all lterm rules that don't start with TLPAREN
You can then refactor the other rules to use these no_paren rules to avoid duplicating all the actions:
lterm_paren : TLPAREN lexpr TRPAREN ;
lterm : lterm_paren | lterm_no_paren ;
lexpr_paren : lterm_paren ;
lexpr : lexpr_paren | lexpr_no_paren ;
You can make this a bit simpler by getting rid of the useless formula rule first.
Alternately, you can (ab)use bison's precedence resolution rules by giving the formula: lexpr rule an explicit precedence (with %prec) that is higher than the precedence of TRPAREN.
If you want to prefer the second parse, you don't need to do anything, as that is what the default prefer shift over reduce conflict resolution will do. You can shut up the warning message by giving the formula: lexpr rule an explicit precedence that is lower than the precedence of TPAREN
I have not used Bison, so my answer is predicated on my general knowledge of parsing.
The shift/reduce conflict has to do with that the ltl production can match on TLPAREN in two possible ways. The first is the rule you're attemping to add. The other is when the parser follows these non-terminals: formula -> lexpr -> lterm.
This has to do with the lookahead properties of the parser. The link below is to the Bison documentation regarding lookahead and handling shift/reduce conflicts.
http://www.gnu.org/software/bison/manual/bison.html#Lookahead

ANTLR lexer disabling tokens then reenabling them not working as expected

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

ANTLR doesn't find the defined start rule

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi
There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

match a BEGIN and END in antlr

how can I say to antlr if you see a 'BEGIN' then at this line you must see an 'END'?
here is my code ( i only need the BEGIN/END when i have multiple statements)
whileStatement
: 'WHILE' expression 'DO'
'BEGIN'?
statement
'END'?
;
and my statements
statement
: assignmentStatement
| ifStatement
| doLoopStatement
| whileStatement
| procedureCallStatement
;
No experience with ANTLR, but generally in BNF/context-free grammars you'd express this as
whileStatement
: 'WHILE' expr 'DO'
statementBlock
;
statementBlock
: statement
| 'BEGIN' statement* 'END'
;
or add statementBlock as an alternative in the definition of statement.

Resources