Unable to parse a specific expression using ANTLR4 parser - parsing

I have just started using Antlr4 parser (a beginner).
I wanted to parse strings of the following format (input) :
"mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff"
"mem_bank[0].su_dccm_cgc::rucklhdr::enable"
The Grammar is written like this:
file : boolean EOF;
// ------------------------------------------ BOOLEAN
boolean
: NOT boolean
| logic relop logic
| numeric relop numeric
| logic EQ logic
| numeric EQ numeric
| boolean EQ boolean
| logic NEQ logic
| numeric NEQ numeric
| boolean NEQ boolean
| boolean booleanop=AND boolean
| boolean booleanop=OR boolean
| booleanAtom
| logic
| numeric
| LPAREN boolean RPAREN
| boolean bitSelect
;
booleanAtom
: booleanConstant
| booleanVariable
;
booleanConstant
: BOOLEAN
;
booleanVariable
: '<' variable ',bool>'
;
variable
: VARIABLE
;
VARIABLE
: ('::')? (VALID_ID_START) (VALID_ID_CHAR)*
;
fragment VALID_ID_START
: 'P' (('a' .. 'z')| ('A' .. 'Z') | ('_'))
| (('a' .. 'z')| ('A' .. 'O')| ('Q' .. 'W')| ('Y' .. 'Z') | ('_')) ;
fragment VALID_ID_CHAR
: ('a' .. 'z')
| ('A' .. 'Z')
| ('0' .. '9')
| ('.')
| ('_')
| ('::')
;
Using the above grammar, I ran into the following issues :
error:
line 1:24 no viable alternative at input '<mem_bank<'
line 1:27 token recognition error at: '.'
[ERROR] 17:13:32 - File: /home/harm/src/antlr4/propositionParser/handler/src/PropositionParserHandler.cc at line 528 Message:
Antlr parse error: < In formula <mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff,bool>
Now I Modified grammar like this: I have put only modified part of the grammar else same as above i have attached.
fragment VALID_ID_CHAR
: ('a' .. 'z')
| ('A' .. 'Z')
| ('0' .. '9')
| ('.')
| ('_')
| ('::')
| ('[')
| (']')
| ('<')
| ('>')
;
Now I am able to parse expression like: "mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff" SUCESSFULLY. (with angular brackets)
But some error is still coming while handling square brackets in expression:
"mem_bank[0].su_dccm_cgc::rucklhdr::enable"
error:
line 1:0 mismatched input 'mem_bank[0].su_dccm_cgc::rvclkhdr::enable' expecting {'[', '(', NUMERIC, VERILOG_BINARY, GCC_BINARY, HEX, BOOLEAN, '<', '~', '!'}
[ERROR] 17:20:12 - File: antlr4/propositionParser/handler/src/PropositionParserHandler.cc at line 528
Message: Antlr parse error: mem_bank[0].su_dccm_cgc::rvclkhdr::enable
In formula: mem_bank[0].su_dccm_cgc::rvclkhdr::enable*
To troubleshoot further, i tried to access the parse tree for both the expressions:
I am able to access the parse tree of the expression in angular brackets (mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff)
Parse tree in LISP format :
(file (boolean (booleanAtom (booleanVariable < (variable mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff) ,bool>))) )
However, i am unable to do the same for the other expression with the square brackets.
What could be going wrong here ? Need some inputs.

Related

How to parse decimal values correctly?

I'm using ANTLR with Presto grammar in order to parse SQL queries.
I'm having an issue with parsing a decimal number. I've the following definitions:
number
: decimalValue #decimalLiteral
| DOUBLE_VALUE #doubleLiteral
| INTEGER_VALUE #integerLiteral
;
decimalValue
: INTEGER_VALUE '.' INTEGER_VALUE?
| '.' INTEGER_VALUE
;
DOUBLE_VALUE
: DIGIT+ ('.' DIGIT*)? EXPONENT
| '.' DIGIT+ EXPONENT
;
IDENTIFIER
// : (LETTER | '_' | DIGIT) (LETTER | DIGIT | '_' | '#' | ':' | '.')*
: (LETTER | DIGIT | '_' | '#' | ':' | '-' )+
;
This works ok for most cases. However, it has an issue with parsing decimal values.
select x/(0.3-0.2)
from table1
It fails to parse. The reason is that the lexer thinks "3-0" is identifier.
When I change the query to be something like:
select x/(0.3 - 0.2)
from table1
it works.
Any ideas how can I handle the original query (without, of course, causing a regression)?
Thanks,
Nir.

antlr4 does't parse obvious tree

I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.

Matching of tokens with Antlr4

I am a an Antlr4 newbie and have problems with a relatively simple grammar. The grammar is given at the bottom at the end. (This is a fragment from a grammar for parsing description of biological sequence variants).
I am trying to parse the string "p.A3L" in the following unit test.
#Test
public void testProteinSubtitutionWithoutRef() {
ANTLRInputStream inputStream = new ANTLRInputStream("p.A3L");
HGVSLexer l = new HGVSLexer(inputStream);
HGVSParser p = new HGVSParser(new CommonTokenStream(l));
p.setTrace(true);
p.addErrorListener(new BaseErrorListener() {
#Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
int charPositionInLine, String msg, RecognitionException e) {
throw new IllegalStateException("failed to parse at line " + line + " due to " + msg, e);
}
});
p.hgvs();
}
The test fails with the message "line 1:2 mismatched input 'A3L' expecting AA". I assume that this is related to lexing, i.e. splitting "A3L" into the three tokens A, 3, and L, such that the parser can then generate the corresponding syntax subtree containing the three terminals from it.
What is going wrong here and where can I learn how to fix this?
The grammar
grammar HGVS;
hgvs: protein_var
;
// Basix lexemes
AA: AA1
| AA3
| 'X';
AA1: 'A'
| 'R'
| 'N'
| 'D'
| 'C'
| 'Q'
| 'E'
| 'G'
| 'H'
| 'I'
| 'L'
| 'K'
| 'M'
| 'F'
| 'P'
| 'S'
| 'T'
| 'W'
| 'Y'
| 'V';
AA3: 'Ala'
| 'Arg'
| 'Asn'
| 'Asp'
| 'Cys'
| 'Gln'
| 'Glu'
| 'Gly'
| 'His'
| 'Ile'
| 'Leu'
| 'Lys'
| 'Met'
| 'Phe'
| 'Pro'
| 'Ser'
| 'Thr'
| 'Trp'
| 'Tyr'
| 'Val';
NUMBER: [0-9]+;
NAME: [a-zA-Z0-9_]+;
// Top-level Rule
/** Variant in a protein. */
protein_var: 'p.' AA NUMBER AA
;
There are two problems:
Define the rule for protein_var ahead of the lexer rules (should work now to, but is not easy to read because the other parser rule is ahead).
Remove the rule for NAME. A3L is not (as you probably expected) AA NUMBER AA but NAME <= ANTLR always prefers the longest matching lexer rule
The resulting grammar should look like:
grammar HGVS;
hgvs
: protein_var
;
protein_var
: 'p.' AA NUMBER AA
;
AA: ...;
AA3: ...;
AA1: ...;
NUMBER: [0-9]+;
If you need NAME for other purposes, you will have to disambiguate it in the lexer (by a prefix that NAMEs and AA do not have in common or by using lexer modes).

ANTLR parse assignments

I want to parse some assignments, where I only care about the assignment as a whole. Not about whats inside the assignment. An assignment is indiciated by ':='. (EDIT: Before and after the assignments other things may come)
Some examples:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
Currently I make a difference between assignments containing a 'case' and other assignments. For simple assignments I tried something like ~('case' | 'esac' | ';') but then antlr complained about unmatched tokens (like '=').
assignment :
NAME ':='! expression ;
expression :
( simple_expression | case_expression) ;
simple_expression :
((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ;
case_expression :
'case' .+ 'esac' ';'! ;
I tried replacing with the following, because the eclipse-interpreter did not seem to like the ((OPERATOR | NAME) & ~('case' | 'esac'))+ ';'! ; because of the 'and'.
(~(OPERATOR | ~NAME | ('case' | 'esac')) |
~(~OPERATOR | NAME | ('case' | 'esac')) |
~(~OPERATOR | ~NAME | ('case' | 'esac'))) ';'!
But this does not work. I get
"error(139): /AntlrTutorial/src/foo/NusmvInput.g:78:5: set complement is empty |---> ~(~OPERATOR | ~NAME | ('case' | 'esac'))) EOC! ;"
How can I parse it?
There are a couple of things going wrong here:
you're using & in your grammar while it should be with quotes around it: '&'
unless you know exactly what you're doing, don't use ~ and . (especially not .+ !) inside parser rules: use them in lexer rules only;
create lexer rules instead of defining 'case' and 'esac' in your parser rules (it's safe to use literal tokens in your parser rules if no other lexer rule can potentially match is, but 'case' and 'esac' look a lot like NAME and they could end up in your AST in which case it's better to explicitly define them yourself in the lexer)
Here's a quick demo:
grammar T;
options {
output=AST;
}
tokens {
ROOT;
CASES;
CASE;
}
parse
: (assignment SCOL)* EOF -> ^(ROOT assignment*)
;
assignment
: NAME ASSIGN^ expression
;
expression
: ternary_expression
;
ternary_expression
: or_expression (QMARK^ ternary_expression COL! ternary_expression)?
;
or_expression
: unary_expression ((AND | OR)^ unary_expression)*
;
unary_expression
: NOT^ atom
| atom
;
atom
: TRUE
| FALSE
| NUMBER
| NAME
| CASE single_case+ ESAC -> ^(CASES single_case+)
| '(' expression ')' -> expression
;
single_case
: expression COL expression SCOL -> ^(CASE expression expression)
;
TRUE : 'TRUE';
FALSE : 'FALSE';
CASE : 'case';
ESAC : 'esac';
ASSIGN : ':=';
AND : '&';
OR : '|';
NOT : '!';
QMARK : '?';
COL : ':';
SCOL : ';';
NAME : ('a'..'z' | 'A'..'Z')+;
NUMBER : ('0'..'9')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
which will parse your input:
a := TRUE & FALSE;
c := a ? 3 : 5;
b := case
a : 1;
!a : 0;
esac;
as follows:

Bison: Conflicts: 1 shift/reduce error

I'm trying to build a parser with bison and have narrowed all my errors down to one difficult one.
Here's the debug output of bison with the state where the error lies:
state 120
12 statement_list: statement_list . SEMICOLON statement
24 if_statement: IF conditional THEN statement_lists ELSE statement_list .
SEMICOLON shift, and go to state 50
SEMICOLON [reduce using rule 24 (if_statement)]
$default reduce using rule 24 (if_statement)
Here are the translation rules in the parser.y source
%%
program : ID COLON block ENDP ID POINT
;
block : CODE statement_list
| DECLARATIONS declaration_block CODE statement_list
;
declaration_block : id_list OF TYPE type SEMICOLON
| declaration_block id_list OF TYPE type SEMICOLON
;
id_list : ID
| ID COMMA id_list
;
type : CHARACTER
| INTEGER
| REAL
;
statement_list : statement
| statement_list SEMICOLON statement
;
statement_lists : statement
| statement_list SEMICOLON statement
;
statement : assignment_statement
| if_statement
| do_statement
| while_statement
| for_statement
| write_statement
| read_statement
;
assignment_statement : expression OUTPUTTO ID
;
if_statement : IF conditional THEN statement_lists ENDIF
| IF conditional THEN statement_lists ELSE statement_list
;
do_statement : DO statement_list WHILE conditional ENDDO
;
while_statement : WHILE conditional DO statement_list ENDWHILE
;
for_statement : FOR ID IS expression BY expressions TO expression DO statement_list ENDFOR
;
write_statement : WRITE BRA output_list KET
| NEWLINE
;
read_statement : READ BRA ID KET
;
output_list : value
| value COMMA output_list
;
condition : expression comparator expression
;
conditional : condition
| NOT conditional
| condition AND conditional
| condition OR conditional
;
comparator : ASSIGNMENT
| BETWEEN
| LT
| GT
| LESSEQUAL
| GREATEREQUAL
;
expression : term
| term PLUS expression
| term MINUS expression
;
expressions : term
| term PLUS expressions
| term MINUS expressions
;
term : value
| value MULTIPLY term
| value DIVIDE term
;
value : ID
| constant
| BRA expression KET
;
constant : number_constant
| CHARCONST
;
number_constant : NUMBER
| MINUS NUMBER
| NUMBER POINT NUMBER
| MINUS NUMBER POINT NUMBER
;
%%
When I remove the if_statement rule there are no errors, so I've narrowed it down considerably, but still can't solve the error.
Thanks for any help.
Consider this statement: if condition then s2 else s3; s4
There are two interpretations:
if condition then
s1;
else
s2;
s3;
The other one is:
if condition then
s1;
else
s2;
s3;
In the first one, the statment list is composed of an if statement and s3. While the other statement is composed of only one if statement. That's where the ambiguity comes from. Bison will prefer shift to reduce when a shift-reduce conflict exist, so in the above case, the parser will choose to shift s3.
Since you have an ENDIF in your if-then statement, consider to introduce an ENDIF in your if-then-else statement, then the problem is solved.
I think you are missing ENDIF in the IF-THEN-ELSE-ENDIF rule.

Resources