Parsing exception in xtext grammar - parsing

I have create xtext grammar(see it bellow) but when I run eclipse I have following exception.
required (...)+ loop did not match anything at input 'IPADDR_eth0'
I know that problem is probably in the grammar, but don't know where.
So, where I'm wrong?
Regards,
grammar com.iamsoft.net.Validate with org.eclipse.xtext.common.Terminals
generate validate "http://www.iamsoft.com/net/Validate"
Model:
netDescription+=DescriptionPair+;
DescriptionPair:
ipaddr | netmask | speed | mtu | tso | gateway | router | subnet | no_vlans | vlan;
vlan:
'VLAN_' name=ID '='
value=IntList;
IntList:
valueList+=INT+ | '"' valueList+=INT+ '"';
IPAddrList:
ipNum1=INT '.' ipNum2=INT '.' ipNum3=INT '.' ipNum4=INT;
//List of numbers with 3 digit
no_vlans:
'NO_VLANS_' name=ID '=' list+=IntList;
subnet:
'SUBNET_' name=ID '=' list+=IPWithQuotes;
router:
'ROUTER_' name=ID '=' list+=IPWithQuotes;
gateway:
'GATEWAY_' name=ID '=' list+=IPWithQuotes;
mtu:
'MTU_' name=ID '=' val=IntWithQuotes;
tso:
'TSO_' name=ID '=' '"' value=ON_OFF '"';
terminal ON_OFF:
'on' | 'off';
netmask:
'NETMASK_' name=ID '=' list+=IPWithQuotes;
speed:
'SPEED_' name=ID '=' value=IntWithQuotes;
ipaddr:
'IPADDR_' name=ID '=' list+=IPWithQuotes;
IPWithQuotes:
IPAddrList | '"' IPAddrList '"';
IntWithQuotes:
value=INT | '"' value=INT '"';

the problem is that your IPADDR_eth0 is parsed a ID not as keyword followed by an ID
maybe it helps to change the ID rule
terminal ID : '^'?('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'0'..'9')*;

Related

why are these bison rules useless?

I am trying to create a simple compiler using flex and bison, however all the rules i've written down have shown to be useless "14 useless nonterminals and 66 useless rules". What makes a rule useless and is there a way to fix it?
File: Decl_Class'*' 'EOF'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Field: Variable
|Constructor
|Method
Variable: Modifier'*' Expr_type Decl_variables
Decl_variables: Decl_variable
|Decl_variable ',' Decl_variables
Decl_variable: IDENT
|IDENT '=' Expr
Constructor: Modifier'*' IDENT '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Method: Modifier'*' '(' Expr_type | VOID ')' IDENT '(' Params '?' ')' {Instructions'*'}
Modifier: IDENT
Expr: IDENT
Params: '(' Expr_type ')' IDENT | '(' Expr_type ')' IDENT ',' Params
Expr_type: BOOLEAN | INT | DOUBLE | IDENT | INTEGER | REAL | TRU | FALS
| THIS | NULLVAL
| '(' Expr ')'
| Access|Access '=' Expr|Access '(' L_Expr '?' ')'
| NEW IDENT '(' L_Expr '?' ')'
| '+''+'Expr | '-''-'Expr | Expr'+''+' | Expr'-''-'
| '!'Expr | '-'Expr | '+'Expr
| Expr Operator Expr
| '(' Expr_type ')' Expr_type
Operator: "==" | "!=" | "<" | "<=" | ">" | ">=" | "+" | "-" | "*" | "/" | "%" | "&&" | "||"
Access: IDENT | Expr '.' IDENT
L_Expr: Expr | Expr ',' L_Expr
Instruction: ';'
| Expr';'
| Expr_type Decl_variables';'
| IF '(' Expr ')' Instruction
| IF '(' Expr ')' Instruction ELSE Instruction
| WHILE '(' Expr ')' Instruction
| FOR '(' L_Expr '?' ';' Expr '?' ';' L_Expr '?'')' Instruction
| FOR '(' Expr ')' Decl_variables ';' Expr '?' ';' L_Expr '?'')' Instruction { Instructions'*' }
| RETURN Expr '?'';'
Decl_Class: CLASS IDENT '(' EXTENDS IDENT ')' '?' {Field'*'}
Here the part inside the {} is a code action (which will produce a syntax error when compiled), not a reference to the Field non-terminal. So Field is never actually used and neither are the non-terminal referenced by it. That's what makes them useless: they're never used.
PS: At various places in your grammar you're using '*' and '?' in a way that suggests the intention may be to match zero or more or zero or one items respectively. Be aware that all '*' and '?' do is to match a token with the given value. There is no syntactic shortcut to repeat something or make it optional in bison - you'll need to define separate non-terminals for that.
PPS: In most (all?) languages that have ++ and -- operators, those consist of a single token not two subsequent '+' or '-' tokens (so - -x would be double negation and only --x without the space between the -s would be a decrement). So your rules for the decrement and increment operators are unusual in that regard.

missing EOF at 'Say'

I am writing grammar to recognize following input
Say Hello Boss
Hello friend
Here is my complete grammar
grammar org.xtext.example.second.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/second/MyDsl"
Example:
statements+=Statement*;
Statement:
(IDLABEL)? Directives;
Directives:
TAG1 | TAG2 | TAG3 | TAG4;
TAG1: tag=('Hi'|'Hello') IDLABEL;
TAG2: tag=('Tag2') IDLABEL;
TAG3: tag=('Tag3') IDLABEL;
TAG4: tag=('Tag4') IDLABEL;
STRING_OPERANDS hidden(WS):
("*"|UNQUOTED|QUOTED)+;
terminal QUOTED:
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'";
terminal UNQUOTED:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '-' | '*' | "/" | "\\" | '(' | ')' | '$' | '=' |'#' |'.' | '"' |'#'|'+'|"'"|'<'|'>')*;
terminal IDLABEL:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9'|'='|'#')*;
For the input, Say Hello Boss
I am getting an error "missing EOF at Say"
and for the input Hello Boss
I am getting an error "mismatched input 'Boss' expecting RULE_IDLABEL"
What is wrong with this grammar?
Boss matches both the rule IDLABEL and UNQUOTED. In cases where two rules can match the current input and both rules match the same prefix, the tokenizer uses the rule that comes first. So the input Boss produces an UNQUOTED token, not an IDLABEL token.
In fact all valid IDLABELs are also valid UNQUOTEDs, so you'll never get any IDLABEL tokens.
To fix this, you can change the order of UNQUOTED and IDLABEL, so that IDLABEL comes first.

ANTLR4 parser rules with other parser rules as arguments (meta-rules)

I would like to be able to write a "meta-rule" in ANTLR4 that takes a rule as an input argument and performs a set modification to that rule. Here's an example grammar:
grammar G;
WS: [ \t\n\r] + -> skip;
CHAR: [a-z];
term: (CHAR)+;
sum: term ('+' term)+;
pterm: '(' term ')' | '(' pterm ')';
psum: '(' sum ')' | '(' psum ')';
expr: term | sum | pterm | psum;
The rules for pterm and psum perform the same action on term and sum, enclosing them in possibly nested parentheses. I would like to be able to replace the last three lines above with something like the following:
enclose[rule]: '(' rule ')' | '(' enclose(rule) ')';
expr: term | sum | enclose(term) | enclose(sum);
Is there a way to construct a meta-rule like this?
The short answer is, no.
Better to resolve by refactoring the grammar and identifying the structurally significant terms:
expr: LPAREN sum RPAREN | LPAREN expr RPAREN ;
sum : term ('+' term)* ; // changed to Kleene star
term: CHAR+ ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : [a-z] ;
WS : [ \t\n\r]+ -> skip ;
The sum rule will consume all terms, so the expr rule only needs to handle sums.

ANTLR v4 different tokens for the same symbols

How can I recognize different tokens for the same symbol in ANTLR v4? For example, in selected = $("library[title='compiler'] isbn"); the first = is an assignment, whereas the second = is an operator.
Here are the relevant lexer rules:
EQUALS
:
'='
;
OP
:
'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
And here is the parser rule for that line:
assign
:
ID EQUALS DOLLAR OPEN_PARENTHESIS QUOTES ID selector ID QUOTES
CLOSE_PARENTHESIS SEMICOLON
;
selector
:
OPEN_BRACKET ID OP APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
This correctly parses the line, as long as I use an OP different than =.
Here is the error log:
JjQueryParser::init:34:29: mismatched input '=' expecting OP
JjQueryParser::init:34:39: mismatched input ''' expecting '\"'
JjQueryParser::init:34:46: mismatched input '"' expecting '='
The problem cannot be solved in the lexer, since the lexer does always return one token type for the same string. But it would be quite easy to resolve it in the parser. Just rewrite the rules lower case:
equals
: '='
;
op
:'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
I had the same issue. Resolved in the lexer as follows:
EQUALS: '=';
OP : '|' EQUALS
| '*' EQUALS
| '~' EQUALS
| '$' EQUALS
| '!' EQUALS
| '^' EQUALS
;
This guarantees that the symbol '=' is represented by a single token all the way. Don't forget to update the relevant rule as follows:
selector
:
OPEN_BRACKET ID (OP|EQUALS) APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;

Characters Matching Multiple Lexer Rules in ANTLR

I've defined multiple lexer rules that potentially matches the same character sequence. For example:
LBRACE: '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*' ;
AMPERSAND: '&' ;
IGNORED_SYMBOLS: ('!' | '#' | '%' | '^' | '-' | '+' | '=' |
'\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/' ) ;
// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};
STRING_LITERAL: '"' (STR_ESC | ~( '"' ))* '"';
fragment STR_ESC: '\\' '"' ;
CHAR_LITERAL : '\'' (CH_ESC | ~( '\'' )) '\'' ;
fragment CH_ESC : '\\' '\'';
My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.
This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).
Take the following rules for example:
DO : 'do';
ID : 'a'..'z'+;
The input "do" would obviously be matched by the rule DO.
And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].

Resources