ANTLR assignment expression disambiguation - parsing

The following grammar works, but also gives a warning:
test.g
grammar test;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
program
: expr ';'!
;
term: ID | INT
;
assign
: term ('='^ expr)?
;
add : assign (('+' | '-')^ assign)*
;
expr: add
;
// T O K E N S
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS :
( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
DOT : '.' ;
fragment
LETTER : ('a'..'z'|'A'..'Z') ;
fragment
DIGIT : '0'..'9' ;
Warning
[15:08:20] warning(200): C:\Users\Charles\Desktop\test.g:21:34:
Decision can match input such as "'+'..'-'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Again, it does produce a tree the way I want:
Input: 0 + a = 1 + b = 2 + 3;
ANTLR produces | ... but I think it
this tree: | gives the warning
| because it _could_
+ | also be parsed this
/ \ | way:
0 = |
/ \ | +
a + | / \
/ \ | + 3
1 = | / \
/ \ | + =
b + | / \ / \
/ \ | 0 = b 2
2 3 | / \
| a 1
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?

Charles wrote:
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
You shouldn't create two separate rules for assign and add. As your rules are now, assign has precedence over add, which you don't want: they should have equal precedence by looking at your desired AST. So, you need to wrap all operators +, - and = in one rule:
program
: expr ';'!
;
expr
: term (('+' | '-' | '=')^ expr)*
;
But now the grammar is still ambiguous. You'll need to "help" the parser to look beyond this ambiguity to assure there really is operator expr ahead when parsing (('+' | '-' | '=') expr)*. This can be done using a syntactic predicate, which looks like this:
(look_ahead_rule(s)_in_here)=> rule(s)_to_actually_parse
(the ( ... )=> is the predicate syntax)
A little demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
program
: expr ';'!
;
expr
: term ((op expr)=> op^ expr)*
;
op
: '+'
| '-'
| '='
;
term
: ID
| INT
;
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "0 + a = 1 + b = 2 + 3;";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.program().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the output of the Main class corresponds to the following AST:
which is created without any warnings from ANTLR.

Related

How to consume the minimal input with fuzzy parsing with ANTLR 4.4+

I am trying to extract condition between two keywords (IF & THEN in this example) without specifying the full grammar.
The input to the parser begin with the first keyword.
Input example could be : "IF A < 10 OR B> 5 THEN A = A + 1; B=6; ENDIF; IF A < 10 THEN A = 100 ENDIF"
From that input, i want to extract the condition : "A < 10 OR B> 5".
We did it with ANTLR 3.5 but unable to make it work with ANLTR 4.4 & 4.5.
** 3.5 Grammar **
grammar FuzzyTest3;
options
{
output=AST;
language=Java;
}
#header
{package fuzzytest;}
#lexer::header
{package fuzzytest;}
ifrule: IF .* THEN;
IF : 'IF';
THEN : 'THEN';
IDENTIFIER : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SEPARATOR : ( '<' | '>' | ':' '(' | ')' | '-' | '+' | '=' | ';' );
WS : ( ' ' | '\t' | '\r' | '\n' | '\u000C')+
{
{ $channel = HIDDEN; }
};
** 4.4 Grammar **
grammar FuzzyTest4;
ifrule: IF (.)*? THEN;
//ifrule: IF .* THEN; //same result
IF : 'IF';
THEN : 'THEN';
IDENTIFIER : ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
SEPARATOR : ( '<' | '>' | ':' '(' | ')' | '-' | '+' | '=' | ';' );
WS : ( ' ' | '\t' | '\r' | '\n' | '\u000C') -> channel(HIDDEN);
With ANTLR 3.5:
ParserRuleReturnScope rulereturn = parser.ifrule();
result = parser.input.toString(rulereturn.start, rulereturn.stop);
System.out.println("TOKENS: "+result);
My output is :
"TOKENS: IF A < 10 OR B> 5 THEN"
With ANLTR 4.4:
ParserRuleContext rulereturn = parser.ifrule();
result = parser.getInputStream().getText(rulereturn.start, rulereturn.stop);
System.out.println("TOKENS: "+result);
My output is :
"line 2:76 no viable alternative at input '<EOF>'
TOKENS: IF A < 10 OR B> 5 THEN A = A + 1; B=6; ENDIF; IF A < 10 THEN A = 100 ENDIF"
Anyone have an idea? suggestion?
One way to do it is (what that example):
ifrule: IF condition;
condition: ~(THEN|IF) condition | ~(THEN|IF);

ANTLR parse assignments

I want to parse some assignments, where I only care about the assignment as a whole. Not about whats inside the assignment. An assignment is indiciated by ':='. (EDIT: Before and after the assignments other things may come)
Some examples:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
Currently I make a difference between assignments containing a 'case' and other assignments. For simple assignments I tried something like ~('case' | 'esac' | ';') but then antlr complained about unmatched tokens (like '=').
assignment :
NAME ':='! expression ;
expression :
( simple_expression | case_expression) ;
simple_expression :
((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ;
case_expression :
'case' .+ 'esac' ';'! ;
I tried replacing with the following, because the eclipse-interpreter did not seem to like the ((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ; because of the 'and'.
(~(OPERATOR | ~NAME | ('case' | 'esac')) |
~(~OPERATOR | NAME | ('case' | 'esac')) |
~(~OPERATOR | ~NAME | ('case' | 'esac'))) ';'!
But this does not work. I get
"error(139): /AntlrTutorial/src/foo/NusmvInput.g:78:5: set complement is empty |---> ~(~OPERATOR | ~NAME | ('case' | 'esac'))) EOC! ;"
How can I parse it?
There are a couple of things going wrong here:
you're using & in your grammar while it should be with quotes around it: '&'
unless you know exactly what you're doing, don't use ~ and . (especially not .+ !) inside parser rules: use them in lexer rules only;
create lexer rules instead of defining 'case' and 'esac' in your parser rules (it's safe to use literal tokens in your parser rules if no other lexer rule can potentially match is, but 'case' and 'esac' look a lot like NAME and they could end up in your AST in which case it's better to explicitly define them yourself in the lexer)
Here's a quick demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
CASES;
CASE;
}
parse
: (assignment SCOL)* EOF -> ^(ROOT assignment*)
;
assignment
: NAME ASSIGN^ expression
;
expression
: ternary_expression
;
ternary_expression
: or_expression (QMARK^ ternary_expression COL! ternary_expression)?
;
or_expression
: unary_expression ((AND | OR)^ unary_expression)*
;
unary_expression
: NOT^ atom
| atom
;
atom
: TRUE
| FALSE
| NUMBER
| NAME
| CASE single_case+ ESAC -> ^(CASES single_case+)
| '(' expression ')' -> expression
;
single_case
: expression COL expression SCOL -> ^(CASE expression expression)
;
TRUE : 'TRUE';
FALSE : 'FALSE';
CASE : 'case';
ESAC : 'esac';
ASSIGN : ':=';
AND : '&';
OR : '|';
NOT : '!';
QMARK : '?';
COL : ':';
SCOL : ';';
NAME : ('a'..'z' | 'A'..'Z')+;
NUMBER : ('0'..'9')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse your input:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
as follows:

Antlr parsing matching fixed string length instead of rule

Below is a cut down version of a grammar that is parsing an input assembly file. Everything in my grammar is fine until i use labels that have 3 characters (i.e. same length as an OPCODE in my grammar), so I'm assuming Antlr is matching it as an OPCODE rather than a LABEL, but how do I say "in this position, it should be a LABEL, not an OPCODE"?
Trial input:
set a, label1
set b, abc
Output from a standard rig gives:
line 2:5 missing EOF at ','
(OP_BAS set a (REF label1)) (OP_SPE set b)
When I step debug through ANTLRWorks, I see it start down instruction rule 2, but at the reference to "abc" jumps to rule 3 and then fail at the ",".
I can solve this with massive left factoring, but it makes the grammar incredibly unreadable. I'm trying to find a compromise (there isn't so much input that the global backtrack is a hit on performance) between readability and functionality.
grammar TestLabel;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens {
NEGATION;
OP_BAS;
OP_SPE;
OP_CMD;
REF;
DEF;
}
program
: instruction* EOF!
;
instruction
: LABELDEF -> ^(DEF LABELDEF)
| OPCODE dst_op ',' src_op -> ^(OP_BAS OPCODE dst_op src_op)
| OPCODE src_op -> ^(OP_SPE OPCODE src_op)
| OPCODE -> ^(OP_CMD OPCODE)
;
operand
: REG
| LABEL -> ^(REF LABEL)
| expr
;
dst_op
: PUSH
| operand
;
src_op
: POP
| operand
;
term
: '('! expr ')'!
| literal
;
unary
: ('+'! | negation^ )* term
;
negation
: '-' -> NEGATION
;
mult
: unary ( ( '*'^ | '/'^ ) unary )*
;
expr
: mult ( ( '+'^ | '-'^ ) mult )*
;
literal
: number
| CHAR
;
number
: HEX
| BIN
| DECIMAL
;
REG: ('A'..'C'|'I'..'J'|'X'..'Z'|'a'..'c'|'i'..'j'|'x'..'z') ;
OPCODE: LETTER LETTER LETTER;
HEX: '0x' ( 'a'..'f' | 'A'..'F' | DIGIT )+ ;
BIN: '0b' ('0'|'1')+;
DECIMAL: DIGIT+ ;
LABEL: ( '.' | LETTER | DIGIT | '_' )+ ;
LABELDEF: ':' ( '.' | LETTER | DIGIT | '_' )+ {setText(getText().substring(1));} ;
STRING: '\"' .* '\"' {setText(getText().substring(1, getText().length()-1));} ;
CHAR: '\'' . '\'' {setText(getText().substring(1, 2));} ;
WS: (' ' | '\n' | '\r' | '\t' | '\f')+ { $channel = HIDDEN; } ;
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT: '0'..'9' ;
fragment PUSH: ('P'|'p')('U'|'u')('S'|'s')('H'|'h');
fragment POP: ('P'|'p')('O'|'o')('P'|'p');
The parser has no influence on what tokens the lexer produces. So, the input "abc" will always be tokenized as a OPCODE, no matter what the parser tries to match.
What you can do is create a label parser rules that matches either a LABEL or OPCODE and then use this label rule in your operand rule:
label
: LABEL
| OPCODE
;
operand
: REG
| label -> ^(REF label)
| expr
;
resulting in the following AST for your example input:
This will only match OPCODE, but will not change the type of the token. If you want the type to be changed as well, add a bit of custom code to the rule that changes it to type LABEL:
label
: LABEL
| t=OPCODE {$t.setType(LABEL);}
;

Assignment as expression in Antlr grammar

I'm trying to extend the grammar of the Tiny Language to treat assignment as expression. Thus it would be valid to write
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
a = 1 = 2; // invalid
Assignment differs from other operators in two aspects. It's right associative (not a big deal), and its left-hand side is has to be a variable. So I changed the grammar like this
statement: assignmentExpr | functionCall ...;
assignmentExpr: Identifier indexes? '=' expression;
expression: assignmentExpr | condExpr;
It doesn't work, because it contains a non-LL(*) decision. I also tried this variant:
assignmentExpr: Identifier indexes? '=' (expression | condExpr);
but I got the same error. I am interested in
This specific question
Given a grammar with a non-LL(*) decision, how to find the two paths that cause the problem
How to fix it
I think you can change your grammar like this to achieve the same, without using syntactic predicates:
statement: Expr ';' | functionCall ';'...;
Expr: Identifier indexes? '=' Expr | condExpr ;
condExpr: .... and so on;
I altered Bart's example with this idea in mind:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
And for the input:
a=b=4;
a = 2 * (b = 1);
you get following parse tree:
The key here is that you need to "assure" the parser that inside an expression, there is something ahead that satisfies the expression. This can be done using a syntactic predicate (the ( ... )=> parts in the add and mult rules).
A quick demo:
grammar TL;
options {
output=AST;
}
tokens {
ROOT;
ASSIGN;
}
parse
: stat* EOF -> ^(ROOT stat+)
;
stat
: expr ';' -> expr
;
expr
: add
;
add
: mult ((('+' | '-') mult)=> ('+' | '-')^ mult)*
;
mult
: atom ((('*' | '/') atom)=> ('*' | '/')^ atom)*
;
atom
: (Id -> Id) ('=' expr -> ^(ASSIGN Id expr))?
| Num
| '(' expr ')' -> expr
;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse the input:
a = b = 1; // -> a = (b = 1)
a = 2 * (b = 1); // contrived but valid
into the following AST:

BibTex grammar for ANTLR

I'm looking for a bibtex grammar in ANTLR to use in a hobby project. I don't want to spend my time for writing ANTLR grammar (this may take some time for me because it will involve a learning curve). So I'd appreciate for any pointers.
Note: I've found bibtex grammars for bison and yacc but couldn't find any for antlr.
Edit: As Bart pointed the I don't need to parse the preambles and tex in the quoted strings.
Here's a (very) rudimentary BibTex grammar that emits an AST (contrary to a simple parse tree):
grammar BibTex;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BIBTEXFILE;
TYPE;
STRING;
PREAMBLE;
COMMENT;
TAG;
CONCAT;
}
//////////////////////////////// Parser rules ////////////////////////////////
parse
: (entry (Comma? entry)* Comma?)? EOF -> ^(BIBTEXFILE entry*)
;
entry
: Type Name Comma tags CloseBrace -> ^(TYPE Name tags)
| StringType Name Assign QuotedContent CloseBrace -> ^(STRING Name QuotedContent)
| PreambleType content CloseBrace -> ^(PREAMBLE content)
| CommentType -> ^(COMMENT CommentType)
;
tags
: (tag (Comma tag)* Comma?)? -> tag*
;
tag
: Name Assign content -> ^(TAG Name content)
;
content
: concatable (Concat concatable)* -> ^(CONCAT concatable+)
| Number
| BracedContent
;
concatable
: QuotedContent
| Name
;
//////////////////////////////// Lexer rules ////////////////////////////////
Assign
: '='
;
Concat
: '#'
;
Comma
: ','
;
CloseBrace
: '}'
;
QuotedContent
: '"' (~('\\' | '{' | '}' | '"') | '\\' . | BracedContent)* '"'
;
BracedContent
: '{' (~('\\' | '{' | '}') | '\\' . | BracedContent)* '}'
;
StringType
: '#' ('s'|'S') ('t'|'T') ('r'|'R') ('i'|'I') ('n'|'N') ('g'|'G') SP? '{'
;
PreambleType
: '#' ('p'|'P') ('r'|'R') ('e'|'E') ('a'|'A') ('m'|'M') ('b'|'B') ('l'|'L') ('e'|'E') SP? '{'
;
CommentType
: '#' ('c'|'C') ('o'|'O') ('m'|'M') ('m'|'M') ('e'|'E') ('n'|'N') ('t'|'T') SP? BracedContent
| '%' ~('\r' | '\n')*
;
Type
: '#' Letter+ SP? '{'
;
Number
: Digit+
;
Name
: Letter (Letter | Digit | ':' | '-')*
;
Spaces
: SP {skip();}
;
//////////////////////////////// Lexer fragments ////////////////////////////////
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
fragment SP
: (' ' | '\t' | '\r' | '\n' | '\f')+
;
(if you don't want the AST, remove all -> and everything to the right of it and remove both the options{...} and tokens{...} blocks)
which can be tested with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// parse the file 'test.bib'
BibTexLexer lexer = new BibTexLexer(new ANTLRFileStream("test.bib"));
BibTexParser parser = new BibTexParser(new CommonTokenStream(lexer));
// you can use the following tree in your code
// see: http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1tree_1_1_common_tree.html
CommonTree tree = (CommonTree)parser.parse().getTree();
// print a DOT tree of our AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and the following example Bib-input (file: test.bib):
#PREAMBLE{
"\newcommand{\noopsort}[1]{} "
# "\newcommand{\singleletter}[1]{#1} "
}
#string {
me = "Bart Kiers"
}
#ComMENt{some comments here}
% or some comments here
#article{mrx05,
auTHor = me # "Mr. X",
Title = {Something Great},
publisher = "nob" # "ody",
YEAR = 2005,
x = {{Bib}\TeX},
y = "{Bib}\TeX",
z = "{Bib}" # "\TeX",
},
#misc{ patashnik-bibtexing,
author = "Oren Patashnik",
title = "BIBTEXing",
year = "1988"
} % no comma here
#techreport{presstudy2002,
author = "Dr. Diessen, van R. J. and Drs. Steenbergen, J. F.",
title = "Long {T}erm {P}reservation {S}tudy of the {DNEP} {P}roject",
institution = "IBM, National Library of the Netherlands",
year = "2002",
month = "December",
}
Run the demo
If you now generate a parser & lexer from the grammar:
java -cp antlr-3.3.jar org.antlr.Tool BibTex.g
and compile all .java source files:
javac -cp antlr-3.3.jar *.java
and finally run the Main class:
*nix/MacOS
java -cp .:antlr-3.3.jar Main
Windows
java -cp .;antlr-3.3.jar Main
You'll see some output on your console which corresponds to the following AST:
(click the image to enlarge it, generated with graphviz-dev.appspot.com)
To emphasize: I did not properly test the grammar! I wrote it a while back and never really used it in any project.

Resources