Matching of tokens with Antlr4 - parsing

I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).
I am trying to parse the string "p.A3L" in the following unit test.
#Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.
What is going wrong here and where can I learn how to fix this?
The grammar
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;

There are two problems:
Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule
The resulting grammar should look like:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).

Related

Parsing Decaf grammar in Antlr4

I am creating parser and lexer rules for Decaf programming language written in ANTLR4. There is a parser test file I am trying to run to get the parser tree for it by printing the visited nodes on the terminal window and paste them into D3_parser_tree.html class. The current parser tree is missing the right square brackets with the number 10 according to this testing file : class program { int i [10]; }
The error I am getting : mismatched input '10' expecting INT_LITERAL
I am not sure why I am getting this error although I have declared a lexer rule for INT_LITERAL and then called it in a parser rule within field_decl according to the given Decaf spec :
** Parser rules **
<program> → class Program ‘{‘ <field_decl>* <method_decl>* ‘}’
<field_decl> → <type> { <id> | <id> ‘[‘ <int_literal> ‘]’ }+, ;
<method_decl> → { <type> | void } <id> ( [ { <type> <id> }+, ] ) <block>
<digit> → 0 | 1 | 2 | … | 9
<block> → ‘{‘ <var_decl>* <statement>* ‘}’
<literal> → <int_literal> | <char_literal> | <bool_literal>
<hex_digit> → <digit> | a | b | c | … | f | A | B | C | … | F
<int_literal> → <decimal_literal> | <hex_literal>
<decimal_literal> → <digit> <digit>*
<hex_literal> → 0x <hex_digit> <hex_digit>*
Related Lexer rules :
NUMBER : [0-9]+;
fragment ALPHA : [_a-zA-Z0-9];
fragment DIGIT : [0-9];
fragment DECIMAL_LITERAL : DIGIT+;
CHAR_LITERAL : '\'' CHAR '\'';
STRING_LITERAL : '"' CHAR+ '"' ;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
Related Parser rules :
program : CLASS VAR LCURLYBRACE field_decl*method_decl* RCURLYBRACE EOF;
field_decl : data_type field ( COMMA field )* SEMICOLON;
Please let me know if you need further details & I appreciate your help a lot.
The following rules conflict:
VAR : ALPHA+;
...
NUMBER : [0-9]+;
...
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
They all match 10, but the lexer will always choose VAR since that is the rule defined first.
This is just how ANTLR's lexer works: it tries to match the most characters as possible, and when two (or more) rules all match the same amount of characters, the one defined first "wins".
You will see that it parses correctly if you change field into:
field : VAR | VAR LSQUAREBRACE VAR RSQUAREBRACE;

Ignore lexer rule after initial match

thanks for taking a look at my question.
So I have the grammar and lexer rules that I use to parse user input on a grocery list.
The grammar matches sentences such as '10 Pound beef' which has the tokens 'amount unit ware'. The ware token matches any valid unicode string but I cannot enter strings matched by the unit token as they are caught by the unit rule. So my question is, can i instruct my lexer to ignore the unit rule after the first match such that I can input '10 Pound Pound' with the tokens 'amount unit ware' without errors?
Grammar:
grammar Shopping;
import lexerrules;
parse : item EOF ;
item : (amount (SPACE* unit)? SPACE+)? ware | (unit (SPACE* amount)? SPACE+)? ware ;
ware : STRING (SPACE+ STRING)* ;
amount : NUM ;
unit : UNIT ;
Lexer rules:
lexer grammar lexerrules;
NUM : [0-9]+(('.'|',')[0-9]+)? ;
UNIT : WEIGHT | LENGTH | VOLUME | MISC ;
STRING : CHAR+
SPACE : ' ' ;
WS : [\u000C\f\t\r\n]+ -> skip ;
CHAR : '\u0041' .. '\uFFFF' ;
WEIGHT : [Kk]'g' | [Kk]'ilo' | [Kk]'ilogram' | [Gg] | [Gg]'ram' |
[Dd]'ecigram' | [Oo]'unce' | [Oo]'z' | [Pp]'ound' | [Ll]'b' ;
LENGTH : [Mm] | [Mm]'eter' | [Cc]'m' | [Cc]'entimeter' |
[Ii]'nch' | [Ii][Nn] ;
VOLUME : [Ll] | [Ll]'iter' | [Dd]'l' | [Dd]'eciliter' | [Cc][Ll] |
[Cc]'entiliter' ;

Flex and Yacc Grammar Issue

Edit #1: I think the problem is in my .l file. I don't think the rules are being treated as rules, and I'm not sure how to treat the terminals of the rules as strings.
My last project for a compilers class is to write a .l and a .y file for a simple SQL grammar. I have no experience with Flex or Yacc, so everything I have written I have pieced together. I only have a basic understanding of how these files work, so if you spot my problem can you also explain what that section of the file is supposed to do? I'm not even sure what the '%' symbols do.
Basically some rules just do not work when I try to parse something. Some rules hang and others reject when they should accept. I need to implement the following grammar:
start
::= expression
expression
::= one-relation-expression | two-relation-expression
one-relation-expression
::= renaming | restriction | projection
renaming
::= term RENAME attribute AS attribute
term
::= relation | ( expression )
restriction
::= term WHERE comparison
projection
::= term | term [ attribute-commalist ]
attribute-commalist
::= attribute | attribute , attribute-commalist
two-relation-expression
::= projection binary-operation expression
binary-operation
::= UNION | INTERSECT | MINUS | TIMES | JOIN | DIVIDEBY
comparison
::= attribute compare number
compare
::= < | > | <= | >= | = | <>
number
::= val | val number
val
::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
attribute
::= CNO | CITY | CNAME | SNO | PNO | TQTY |
SNAME | QUOTA | PNAME | COST | AVQTY |
S# | STATUS | P# | COLOR | WEIGHT | QTY
relation
::= S | P | SP | PRDCT | CUST | ORDERS
Here is my .l file:
%{
#include <stdio.h>
#include "p5.tab.h"
%}
binaryOperation UINION|INTERSECT|MINUS|TIMES|JOIN|DIVIDEBY
compare <|>|<=|>=|=|<>
attribute CNO|CITY|CNAME|SNO|PNO|TQTY|SNAME|QUOTA|PNAME|COST|AVQTY|S#|STATUS|P#|COLOR|WEIGHT|QTY
relation S|P|SP|PRDCT|CUST|ORDERS
%%
[ \t\n]+ ;
{binaryOperation} return binaryOperation;
{compare} return compare;
[0-9]+ return val;
{attribute} return attribute;
{relation} return relation;
"RENAME" return RENAME;
"AS" return AS;
"WHERE" return WHERE;
"(" return '(';
")" return ')';
"[" return '[';
"]" return ']';
"," return ',';
. {printf("REJECT\n");
exit(0);}
%%
Here is my .y file:
%{
#include <stdio.h>
#include <stdlib.h>
%}
%token RENAME attribute AS relation WHERE binaryOperation compare val
%%
start:
expression {printf("ACCEPT\n");}
;
expression:
oneRelationExpression
| twoRelationExpression
;
oneRelationExpression:
renaming
| restriction
| projection
;
renaming:
term RENAME attribute AS attribute
;
term:
relation
| '(' expression ')'
;
restriction:
term WHERE comparison
;
projection:
term
| term '[' attributeCommalist ']'
;
attributeCommalist:
attribute
| attribute ',' attributeCommalist
;
twoRelationExpression:
projection binaryOperation expression
;
comparison:
attribute compare number
;
number:
val
| val number
;
%%
yyerror() {
printf("REJECT\n");
exit(0);
}
main() {
yyparse();
}
yywrap() {}
Here is my makefile:
p5: p5.tab.c lex.yy.c
cc -o p5 p5.tab.c lex.yy.c
p5.tab.c: p5.y
bison -d p5.y
lex.yy.c: p5.l
flex p5.l
This works:
S RENAME CNO AS CITY
These do not:
S
S WHERE CNO = 5
I have not tested everything, but I think there is a common problem for these issues.
Your grammar is correct, the problem is that you are running interactively. When you call yyparse() it will attempt to read all input. Because the input
S
could be followed by either RENAME or WHERE it won't accept. Similarly,
S WHERE CNO = 5
could be followed by one or more numbers, so yyparse won't accept until it gets an EOF or an unexpected token.
What you want to do is follow the advice here and change p5.l to have these lines:
[ \t]+ ;
\n if (yyin==stdin) return 0;
That way when you are running interactively it will take the ENTER key to be the end of input.
Also, you want to use left recursion for number:
number:
val
| number val
;

ANTLR assignment expression disambiguation

The following grammar works, but also gives a warning:
test.g
grammar test;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
}
program
: expr ';'!
;
term: ID | INT
;
assign
: term ('='^ expr)?
;
add : assign (('+' | '-')^ assign)*
;
expr: add
;
// T O K E N S
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS :
( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
DOT : '.' ;
fragment
LETTER : ('a'..'z'|'A'..'Z') ;
fragment
DIGIT : '0'..'9' ;
Warning
[15:08:20] warning(200): C:\Users\Charles\Desktop\test.g:21:34:
Decision can match input such as "'+'..'-'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Again, it does produce a tree the way I want:
Input: 0 + a = 1 + b = 2 + 3;
ANTLR produces | ... but I think it
this tree: | gives the warning
| because it _could_
+ | also be parsed this
/ \ | way:
0 = |
/ \ | +
a + | / \
/ \ | + 3
1 = | / \
/ \ | + =
b + | / \ / \
/ \ | 0 = b 2
2 3 | / \
| a 1
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
Charles wrote:
How can I explicitly tell ANTLR that I want it to create the AST on the left, thus making my intent clear and silencing the warning?
You shouldn't create two separate rules for assign and add. As your rules are now, assign has precedence over add, which you don't want: they should have equal precedence by looking at your desired AST. So, you need to wrap all operators +, - and = in one rule:
program
: expr ';'!
;
expr
: term (('+' | '-' | '=')^ expr)*
;
But now the grammar is still ambiguous. You'll need to "help" the parser to look beyond this ambiguity to assure there really is operator expr ahead when parsing (('+' | '-' | '=') expr)*. This can be done using a syntactic predicate, which looks like this:
(look_ahead_rule(s)_in_here)=> rule(s)_to_actually_parse
(the ( ... )=> is the predicate syntax)
A little demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
program
: expr ';'!
;
expr
: term ((op expr)=> op^ expr)*
;
op
: '+'
| '-'
| '='
;
term
: ID
| INT
;
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
WS : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with the class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "0 + a = 1 + b = 2 + 3;";
testLexer lexer = new testLexer(new ANTLRStringStream(source));
testParser parser = new testParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.program().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the output of the Main class corresponds to the following AST:
which is created without any warnings from ANTLR.

Parse tree generation with Java CUP

I am using CUP with JFlex to validate expression syntax. I have the basic functionality working: I can tell if an expression is valid or not.
Next step is to implement simple arithmetic operations, such as "add 1". For example, if my expression is "1 + a", the result should be "2 + a". I need access to parse tree to do that, because simply identifying a numeric term won't do it: the result of adding 1 to "(1 + a) * b" should be "(1 + a) * b + 1", not "(2 + a) * b".
Does anyone have a CUP example that generates a parse tree? I think I will be able to take it from there.
As an added bonus, is there a way to get a list of all tokens in expression using JFlex? Seems like a typical use case, but I cannot figure out how to do it.
Edit: Found a promising clue on stack overflow:
Create abstract tree problem from parser
Discussion of CUP and AST:
http://pages.cs.wisc.edu/~fischer/cs536.s08/lectures/Lecture16.4up.pdf
Specifically, this paragraph:
The Symbol returned by the parser is associated with the grammar’s start
symbol and contains the AST for the whole source program
This does not help. How to traverse the tree given Symbol instance, if Symbol class does not have any navigation pointers to its children? In other words, it does not look or behave like a tree node:
package java_cup.runtime;
/**
* Defines the Symbol class, which is used to represent all terminals
* and nonterminals while parsing. The lexer should pass CUP Symbols
* and CUP returns a Symbol.
*
* #version last updated: 7/3/96
* #author Frank Flannery
*/
/* ****************************************************************
Class Symbol
what the parser expects to receive from the lexer.
the token is identified as follows:
sym: the symbol type
parse_state: the parse state.
value: is the lexical value of type Object
left : is the left position in the original input file
right: is the right position in the original input file
******************************************************************/
public class Symbol {
/*******************************
Constructor for l,r values
*******************************/
public Symbol(int id, int l, int r, Object o) {
this(id);
left = l;
right = r;
value = o;
}
/*******************************
Constructor for no l,r values
********************************/
public Symbol(int id, Object o) {
this(id, -1, -1, o);
}
/*****************************
Constructor for no value
***************************/
public Symbol(int id, int l, int r) {
this(id, l, r, null);
}
/***********************************
Constructor for no value or l,r
***********************************/
public Symbol(int sym_num) {
this(sym_num, -1);
left = -1;
right = -1;
value = null;
}
/***********************************
Constructor to give a start state
***********************************/
Symbol(int sym_num, int state)
{
sym = sym_num;
parse_state = state;
}
/*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*/
/** The symbol number of the terminal or non terminal being represented */
public int sym;
/*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .*/
/** The parse state to be recorded on the parse stack with this symbol.
* This field is for the convenience of the parser and shouldn't be
* modified except by the parser.
*/
public int parse_state;
/** This allows us to catch some errors caused by scanners recycling
* symbols. For the use of the parser only. [CSA, 23-Jul-1999] */
boolean used_by_parser = false;
/*******************************
The data passed to parser
*******************************/
public int left, right;
public Object value;
/*****************************
Printing this token out. (Override for pretty-print).
****************************/
public String toString() { return "#"+sym; }
}
Ok, I got it. But unfortunately I cannot publish all my code here as-is. I will try to outline solution anyway, and please ask questions if something is not clear.
JFlex uses its own Symbol class. Look here:
JFlex.jar/java_cup.runtime/Symbol.class
You will see a couple of constructors added:
public Symbol(int id, Symbol left, Symbol right, Object o){
this(id,left.left,right.right,o);
}
public Symbol(int id, Symbol left, Symbol right){
this(id,left.left,right.right);
}
The key here is Object o, which is the value of Symbol.
Define your own class to represent an AST tree node, and another one to represent lexer token. Granted, you can use the same class, but I found it more clear to use different classes to distinguish between the two. Both JFlex and CUP will generate java code, and it is easy to get your tokens and nodes mixed-up.
Then, in your parser.flex, in the lexical rules sections, you want to do something like this for each token:
{float_lit} { return symbol(sym.NUMBER, createToken(yytext(), yycolumn)); }
Do this for all your tokens. Your createToken could be something like this:
%{
private LexerToken createToken(String val, int start) {
LexerToken tk = new LexerToken(val, start);
addToken(tk);
return tk;
}
}%
Now let's move on to parser.cup. Declare all your terminals to be of type LexerToken, and all your non-terminals to be of type Node. You want to read CUP manual, but for quick refresher, a terminal would be anything recognized by the lexer (e.g. numbers, variables, operators), and non-terminal would be parts of your grammar (e.g. expression, factor, term...).
Finally, this all comes together in the grammar definition. Consider the following example:
factor ::= factor:f TIMES:times term:t
{: RESULT = new Node(times.val, f, t, times.start); :}
|
factor:f DIVIDE:div term:t
{: RESULT = new Node(div.val, f, t, div.start); :}
|
term:t
{: RESULT = t; :}
;
Syntax factor:f means you alias the factor's value to be f, and you can refer to it in the following section {: ... :}. Remember, our terminals have values of type LexerToken, and our non-terminals have values that are Nodes.
Your term in expression may have the following definition:
term ::= LPAREN expr:e RPAREN
{: RESULT = new Node(e.val, e.start); :}
|
NUMBER:n
{: RESULT = new Node(n.val, n.start); :}
;
When you successfully generate the parser code, you will see in your parser.java the part where the parent-child relationship between nodes is established:
case 16: // term ::= UFUN LPAREN expr RPAREN
{
Node RESULT =null;
int ufleft = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)).left;
int ufright = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)).right;
LexerToken uf = (LexerToken)((java_cup.runtime.Symbol) CUP$parser$stack.elementAt(CUP$parser$top-3)).value;
int eleft = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-1)).left;
int eright = ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-1)).right;
Node e = (Node)((java_cup.runtime.Symbol) CUP$parser$stack.elementAt(CUP$parser$top-1)).value;
RESULT = new Node(uf.val, e, null, uf.start);
CUP$parser$result = parser.getSymbolFactory().newSymbol("term",0, ((java_cup.runtime.Symbol)CUP$parser$stack.elementAt(CUP$parser$top-3)), ((java_cup.runtime.Symbol)CUP$parser$stack.peek()), RESULT);
}
return CUP$parser$result;
I am sorry that I cannot publish complete code example, but hopefully this will save someone a few hours of trial and error. Not having complete code is also good because it won't render all those CS homework assignments useless.
As a proof of life, here's a pretty-print of my sample AST.
Input expression:
T21 + 1A / log(max(F1004036, min(a1, a2))) * MIN(1B, 434) -LOG(xyz) - -3.5+10 -.1 + .3 * (1)
Resulting AST:
|--[+]
|--[-]
| |--[+]
| | |--[-]
| | | |--[-]
| | | | |--[+]
| | | | | |--[T21]
| | | | | |--[*]
| | | | | |--[/]
| | | | | | |--[1A]
| | | | | | |--[LOG]
| | | | | | |--[MAX]
| | | | | | |--[F1004036]
| | | | | | |--[MIN]
| | | | | | |--[A1]
| | | | | | |--[A2]
| | | | | |--[MIN]
| | | | | |--[1B]
| | | | | |--[434]
| | | | |--[LOG]
| | | | |--[XYZ]
| | | |--[-]
| | | |--[3.5]
| | |--[10]
| |--[.1]
|--[*]
|--[.3]
|--[1]

Resources