My task is to write grammar for custom query language, where users can write some basic queries.
My grammar so far:
grammar EAQL;
prog: cond;
cond: cond logical_operator cond | elexpr comparison_operator VALUE;
elexpr: ELSTEREOTYPE '.' eattribute;
conexpr: CSTEREOTYPE '.' cattribute;
eattribute: 'Name' | 'Path' | 'GUID' | conexpr;
cattribute: 'Name' | 'GUID' | elexpr;
VALUE: QUOTATION ( ~([QUOTATION]) | ~('\n'))+ QUOTATION;
ELSTEREOTYPE: 'EG_ApplicationComponent' | 'EG_ApplicationFunction';
CSTEREOTYPE: 'EG_Flow';
SPACE: ' ';
QUOTATION: '"';
EOL: '\n';
WS : (' ' | '\t')+ -> channel(HIDDEN);
AND: 'AND';
OR: 'OR';
logical_operator: AND | OR;
EQUALS: '=';
GREATER_THAN: '>';
SMALLER_THAN: '<';
comparison_operator: GREATER_THAN | SMALLER_THAN | EQUALS;
When i try to parse this string
EG_ApplicationComponent.Name= "name1" AND EG_ApplicationFunction.Name="name2"
ANTLR will create following children in tree:
'EG_ApplicationComponent'
'.'
'Name'
'='
'"name1" AND EG_ApplicationFunction.Name= "name2"'
I am absolute beginner in creating parsers, but i still do not understand why it does greedy matching until end of string in VALUE, when I specified that matching should end when QUOTATION is found. I expect, that if would match 'name1' as VALUE in first branch of tree and then create another branch with EG_ApplicationFunction.Name= "name2" parsed as previous branch.
This would be my expected result:
'EG_ApplicationComponent'
'.'
'Name'
'='
'"name1"'
AND
EG_ApplicationFunction
'.'
'Name'
'='
'"name2"'
~[QUOTATION] matches any character other than Q, U, O, T, A, T, I, O and N. What you need to do is ~["].
Your VALUE rule could look like this:
VALUE
: QUOTATION ~["\r\n]* QUOTATION
;
Related
I have just started using Antlr4 parser (a beginner).
I wanted to parse strings of the following format (input) :
"mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff"
"mem_bank[0].su_dccm_cgc::rucklhdr::enable"
The Grammar is written like this:
file : boolean EOF;
// ------------------------------------------ BOOLEAN
boolean
: NOT boolean
| logic relop logic
| numeric relop numeric
| logic EQ logic
| numeric EQ numeric
| boolean EQ boolean
| logic NEQ logic
| numeric NEQ numeric
| boolean NEQ boolean
| boolean booleanop=AND boolean
| boolean booleanop=OR boolean
| booleanAtom
| logic
| numeric
| LPAREN boolean RPAREN
| boolean bitSelect
;
booleanAtom
: booleanConstant
| booleanVariable
;
booleanConstant
: BOOLEAN
;
booleanVariable
: '<' variable ',bool>'
;
variable
: VARIABLE
;
VARIABLE
: ('::')? (VALID_ID_START) (VALID_ID_CHAR)*
;
fragment VALID_ID_START
: 'P' (('a' .. 'z')| ('A' .. 'Z') | ('_'))
| (('a' .. 'z')| ('A' .. 'O')| ('Q' .. 'W')| ('Y' .. 'Z') | ('_')) ;
fragment VALID_ID_CHAR
: ('a' .. 'z')
| ('A' .. 'Z')
| ('0' .. '9')
| ('.')
| ('_')
| ('::')
;
Using the above grammar, I ran into the following issues :
error:
line 1:24 no viable alternative at input '<mem_bank<'
line 1:27 token recognition error at: '.'
[ERROR] 17:13:32 - File: /home/harm/src/antlr4/propositionParser/handler/src/PropositionParserHandler.cc at line 528 Message:
Antlr parse error: < In formula <mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff,bool>
Now I Modified grammar like this: I have put only modified part of the grammar else same as above i have attached.
fragment VALID_ID_CHAR
: ('a' .. 'z')
| ('A' .. 'Z')
| ('0' .. '9')
| ('.')
| ('_')
| ('::')
| ('[')
| (']')
| ('<')
| ('>')
;
Now I am able to parse expression like: "mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff" SUCESSFULLY. (with angular brackets)
But some error is still coming while handling square brackets in expression:
"mem_bank[0].su_dccm_cgc::rucklhdr::enable"
error:
line 1:0 mismatched input 'mem_bank[0].su_dccm_cgc::rvclkhdr::enable' expecting {'[', '(', NUMERIC, VERILOG_BINARY, GCC_BINARY, HEX, BOOLEAN, '<', '~', '!'}
[ERROR] 17:20:12 - File: antlr4/propositionParser/handler/src/PropositionParserHandler.cc at line 528
Message: Antlr parse error: mem_bank[0].su_dccm_cgc::rvclkhdr::enable
In formula: mem_bank[0].su_dccm_cgc::rvclkhdr::enable*
To troubleshoot further, i tried to access the parse tree for both the expressions:
I am able to access the parse tree of the expression in angular brackets (mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff)
Parse tree in LISP format :
(file (boolean (booleanAtom (booleanVariable < (variable mem_bank<0>.su_dccm_cgc::rucklhdr::en_ff) ,bool>))) )
However, i am unable to do the same for the other expression with the square brackets.
What could be going wrong here ? Need some inputs.
Im writing chrome DEPS file parser. How to match one of either following grammar rule defintion of rightexpr. My grammar is like
the following one:
grammar Depsgrammar;
prog: expr+ EOF;
expr: varline
;
varline:
ID EQ rightexpr
;
rightexpr :
basicvalue | bentukonejsonval| bentuktwojsonval
;
bentukonejsonval :
'[' string? (COMMA string )* COMMA? ']'
;
bentuktwojsonval :
'{' singledictexpr? (COMMA singledictexpr )* COMMA? '}'
;
singledictexpr :
string ':' basicvalue
;
basicvalue :
True
| False
| string
| NUM
| varfunc
;
varfunc :
Var '(' string ')'
;
string :
SIMPLESTRINGEXPRDOUBLEQUOTE
| SIMPLESTRINGEXPRSINGLEQUOTE
;
Var : 'Var' ;
COMMA : ',' ;
NUM : [0-9]+;
ID : [a-zA-Z0-9_]+;
True : [tT] [Rr] [Uu] [Ee];
False: [Ff] [Aa] [Ll] [Ss] [Ee];
fragment SIMPLESTRINGEXPRDOUBLEQUOTEBASE : ~ ( '\n' | '\r' | '"' )* ;
SIMPLESTRINGEXPRDOUBLEQUOTE: '"' SIMPLESTRINGEXPRDOUBLEQUOTEBASE '"' ;
fragment SIMPLESTRINGEXPRSINGLEQUOTEBASE : ~ ( '\n' | '\r' | '\'' )* ;
SIMPLESTRINGEXPRSINGLEQUOTE : '\'' SIMPLESTRINGEXPRSINGLEQUOTEBASE '\'' ;
EQ : '=';
COMMENT:
'#' ~ ( '\n' | '\r' )* '\n' -> skip ;
WS : [ \n\t\r]+ -> skip ;
I want user could enter this input
#adas21 #FS;SFD33
_as= Var('das') # somelongth comment
_as_0= FALSE # somelongth comment
_as_0= 'as!' # somelongth comment
gclient_gn_args = [
#ad as!~;
'checkout_libaom',
'checkout_nacl',
'"{cros_board}" == "amd64-generic"',
'checkout_oculus_sdk',
]
vars = {
'checkout_libaom':1,
'checkout_nacl': "SS",
'checkout_oculus_sdk': FalSe,
'checkout_oculus_sdk':'',
}
s=[
]
whenever I enter simple syntax in grun
sa=true
always give me line 1:3 mismatched input 'true' expecting blah..(rightexpr def). I'm missing in understanding of basic antlr4 choice matching decision. Could you please teach me?
Thanks
Whenever you have an error where the list of expected tokens seemingly includes the unexpected token, it is a good idea to list the generated tokens. You can do that by passing the -tokens option to grun. If you do this for your input, you'll see that true is interpreted as an ID token, not a True token.
The reason for that is that when multiple lexer rules would match on the current input and produce a match of the same size, the one that's defined earlier in the grammar is chosen. So because ID is defined before True, it takes precedence. Generally all keywords should be defined before the ID rule to prevent exactly this issue.
In other words, moving the True and False rules before ID will solve your issue.
I developed this small grammar here i have an issue with:
grammar test;
term : above_term | below_term;
above_term :
<assoc=right> 'forall' binders ',' forall_term
| <assoc=right> above_term '->' above_term
| <assoc=right> above_term '->' below_term
| <assoc=right> below_term '->' above_term
| <assoc=right> below_term '->' below_term
;
below_term :
<assoc = right> below_term arg (arg)*
| '#' qualid (term)*
| below_term '%' IDENT
| qualid
| sort
| '(' term ')'
;
forall_term : term;
arg : term| '(' IDENT ':=' term ')';
binders : binder (binder)*;
binder : name |<assoc=right>name (name)* ':' term | '(' name (name)* ':' term ')' |<assoc=right> name (':' term)? ':=' term;
name : IDENT | '_';
qualid : IDENT | qualid ACCESS_IDENT;
sort : 'Prop' | 'Set' | 'Type' ;
/**************************************
* LEXER RULES
**************************************/
/*
* STRINGS
*/
STRING : '"' (~["])* '"';
/*
* IDENTIFIER AND ACCESS IDENTIFIER
*/
ACCESS_IDENT : '.' IDENT;
IDENT : FIRST_LETTER (SUBSEQUENT_LETTER)*;
fragment FIRST_LETTER : [a-z] | [A-Z] | '_' | UNICODE_LETTER;
fragment SUBSEQUENT_LETTER : [a-z] | [A-Z] | DIGIT | '_' | '"' | UNICODE_LETTER | UNICODE_ID_PART;
fragment UNICODE_LETTER : '\\' 'u' HEX HEX HEX HEX;
fragment UNICODE_ID_PART : '\\' 'u' HEX HEX HEX HEX;
fragment HEX : [0-9a-fA-F];
/*
* NATURAL NUMBERS AND INTEGERS
*/
NUM : DIGIT (DIGIT)*;
INTEGER : ('-')? NUM;
fragment DIGIT : [0-9];
WS : [ \n\t\r] -> skip;
You can copy this grammar and test it with antlr if you want, it will work. Now for my question:
Let's consider an expression like this: a b -> c d -> forall n:nat, c.
Now according to my grammar the ("->") rule (right after forall rule) has the highest precedence.
As for this I want this term to be parsed so that both ("->") rules are on top of the parse tree. like this: (Please note, that this is an abstract view, i know that there are many above and below terms between the leafs)
However sadly it doesn't get parsed this way but this way:
Howcome the parser doesn't see the (->) rules both on top of the parse tree? Is this a precedence issue?
By changing term to below_term in the (arg) rule we can fix the problem arg : below_term| '(' IDENT ':=' term ')'; .
Lets take this expression as an example: a b c.
Once the parser sees, that the pattern a b matches this rule: below_term arg (arg)* he puts a as a below_term and trys to match b with the arg rule. However since arg points to the below_term rule now, no above_term is alowed except when it is braced. This solved my problem.
The term a b -> a b c -> forall n:nat, n now gets parsed this way:
I want to create a Grammar that will parse the input statement
myvar is 43+23
and
otherVar of myvar is "hallo"
But the parser doesn't recognize anything here.
(sorry, I am not allowed to post images :( imagine a statement node with the Tokens
[myvar] [is] [43] [+] [23] as children all marked red. Same goes for the other statement)
I'm getting error messages that confuse me:
line 2:7 no viable alternative at input 'myvaris'
line 3:19 no viable alternative at input 'otherVarofmyvaris'
Where are the spaces gone? I assume, It's something with my lexer, but I can't see what the problem is. Just in case here is the grammar for these statements:
statement
: envCall #call_Environment_Function
| identifier IS expression # assignment_statement // This one should be used
| loopHeader statement_block # loop_statement
etc...
expression
: '(' expression ')' #bracket_Expression
| mathExpression #math_Expression
| identifier #identifier_Expression // this one should be used
| objectExpression #object_Expression
etc ...
identifier //both of these should be used
: selector=IDENTIFIER OF object=expression #ofIdentifier
| selector=IDENTIFIER #idLocal
;
here are all the Lexer rules I have so far:
IdentifierNamespace: IDENTIFIER '.' IDENTIFIER;
FromIn: FROM | IN;
OPENBLOCK: NEWLINE? '{';
CLOSEBLOCK: '}' NEWLINE;
NEWLINE: ['\n''\t']+;
NUMBER: INT | FLOAT;
INT: [0-9]+;
FLOAT: [0-9]* '.' [0-9]+;
IsAre: IS | ARE;
OF: 'of';
IS: 'is';
ARE: 'are';
DO: 'do';
FROM: 'from';
IN: 'in';
IDENTIFIER : [a-zA-Z]+ ;
//WHITESPACE: [ \t]+ -> skip;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
STRING : '"' (ESC | ~["\\])* '"' ;
END: 'END'[.]* EOF;
WHITESPACE : ( '\t' | ' ' )+ -> skip ;
Ok, found it. There was a compOP defined for the parser, and it was messing up the treegeneration.
compOP: '<'
| '>'
| '=' // the programmers '=='
| '>='
| '<='
| '<>'
| '!='
| 'in'
| 'not' 'in'
| 'is' <- removed this one and it works now
;
So: never assign the same keyword to Parser and Lexer, I guess.
Below is a cut down version of a grammar that is parsing an input assembly file. Everything in my grammar is fine until i use labels that have 3 characters (i.e. same length as an OPCODE in my grammar), so I'm assuming Antlr is matching it as an OPCODE rather than a LABEL, but how do I say "in this position, it should be a LABEL, not an OPCODE"?
Trial input:
set a, label1
set b, abc
Output from a standard rig gives:
line 2:5 missing EOF at ','
(OP_BAS set a (REF label1)) (OP_SPE set b)
When I step debug through ANTLRWorks, I see it start down instruction rule 2, but at the reference to "abc" jumps to rule 3 and then fail at the ",".
I can solve this with massive left factoring, but it makes the grammar incredibly unreadable. I'm trying to find a compromise (there isn't so much input that the global backtrack is a hit on performance) between readability and functionality.
grammar TestLabel;
options {
language = Java;
output = AST;
ASTLabelType = CommonTree;
backtrack = true;
}
tokens {
NEGATION;
OP_BAS;
OP_SPE;
OP_CMD;
REF;
DEF;
}
program
: instruction* EOF!
;
instruction
: LABELDEF -> ^(DEF LABELDEF)
| OPCODE dst_op ',' src_op -> ^(OP_BAS OPCODE dst_op src_op)
| OPCODE src_op -> ^(OP_SPE OPCODE src_op)
| OPCODE -> ^(OP_CMD OPCODE)
;
operand
: REG
| LABEL -> ^(REF LABEL)
| expr
;
dst_op
: PUSH
| operand
;
src_op
: POP
| operand
;
term
: '('! expr ')'!
| literal
;
unary
: ('+'! | negation^ )* term
;
negation
: '-' -> NEGATION
;
mult
: unary ( ( '*'^ | '/'^ ) unary )*
;
expr
: mult ( ( '+'^ | '-'^ ) mult )*
;
literal
: number
| CHAR
;
number
: HEX
| BIN
| DECIMAL
;
REG: ('A'..'C'|'I'..'J'|'X'..'Z'|'a'..'c'|'i'..'j'|'x'..'z') ;
OPCODE: LETTER LETTER LETTER;
HEX: '0x' ( 'a'..'f' | 'A'..'F' | DIGIT )+ ;
BIN: '0b' ('0'|'1')+;
DECIMAL: DIGIT+ ;
LABEL: ( '.' | LETTER | DIGIT | '_' )+ ;
LABELDEF: ':' ( '.' | LETTER | DIGIT | '_' )+ {setText(getText().substring(1));} ;
STRING: '\"' .* '\"' {setText(getText().substring(1, getText().length()-1));} ;
CHAR: '\'' . '\'' {setText(getText().substring(1, 2));} ;
WS: (' ' | '\n' | '\r' | '\t' | '\f')+ { $channel = HIDDEN; } ;
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT: '0'..'9' ;
fragment PUSH: ('P'|'p')('U'|'u')('S'|'s')('H'|'h');
fragment POP: ('P'|'p')('O'|'o')('P'|'p');
The parser has no influence on what tokens the lexer produces. So, the input "abc" will always be tokenized as a OPCODE, no matter what the parser tries to match.
What you can do is create a label parser rules that matches either a LABEL or OPCODE and then use this label rule in your operand rule:
label
: LABEL
| OPCODE
;
operand
: REG
| label -> ^(REF label)
| expr
;
resulting in the following AST for your example input:
This will only match OPCODE, but will not change the type of the token. If you want the type to be changed as well, add a bit of custom code to the rule that changes it to type LABEL:
label
: LABEL
| t=OPCODE {$t.setType(LABEL);}
;