I'm looking for a way to prevent KEYWORDS matching at a place where those KEYWORDS are not expected.
Take a look at the following grammar. Both 'APPLY' and 'OUTPUT' are keywords.
'OUTPUT' has an argument that contains any characters.
Everything works fine but if this argument contains the word APPLY, an error is raised (extraneous input APPLY expecting RULE_END).
Is there a way to solve this issue?
Thanks.
Sample text
APPLY, 'an id' $
OUTPUT, A text $
OUTPUT, A text with the word APPLY $
DSL
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
statement+=Statement*;
Statement:
ApplyStatement | OutputStatement;
OutputStatement:
'OUTPUT' ',' out+=EXTENDLABEL* end=END;
ApplyStatement:
'APPLY' ',' id=LABELIDENTIFIER end=END;
terminal fragment LETTER:
'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
| 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' |
'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';
terminal LABELIDENTIFIER:
"'"->"'";
terminal EXTENDLABEL:
(LETTER) (LETTER)*;
terminal END:
'$' !('\n' | '\r')*;
I see a few different ways your issue can be handled. First of all, you could escape the keywords appearing, e.g. the Xbase language uses the '^' character as an escape character; if for any reason there is a problem with writing a keyword, you can prefix it with '^', and it would work. Similarly, if you would put your string inside specific symbols, e.g. apostrophes, it would help a lot. Of course, these solutions require to change your language itself, which you may or may not do.
You might also replace your EXTENDLABEL terminal with a datatype rule. This allows greater flexibility with regards to conflict resolution; worst case you could add the language keywords as options. I was suggested this route by a tangentially related case in the Eclipse forums.
an other solution is to change the ID of your token before that your parser used it. Token are provided by the lexer and your parser will take these tokens in input to produce your AST. So the idea is to change the tokens before to pass them to your parser.
To do it you need to declare your own parser:
#Override
public Class<? extends IParser> bindIParser() {
return ModelParser.class;
}
Note : your parser will extends the generated parser of your grammar.
Then you need to override the following method to introduce your own TokenSource:
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new TokenSource(tokenSource, getTokenDefProvider());
}
You own token source need to extend 'XtextTokenStream'.
After you need to override the method 'LT' as following :
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
Then you just need to change the ID :
def void tokenOverride(Token token, int index){
switch (token.text){
case "APPLY" : {
overrideType(t_parameter, InternalModelParser.RULE_ID);
}
}
}
def void overrideType(Token token, int i) {
token.type = i
}
Note : don't forget to add your condition before to change the ID of your token, in this example all token 'APPLY' will become an ID.
And of course inside the switch you can use the ID of the token 'APPLY' instead the text of your token.
Related
I am writing grammar to recognize following input
Say Hello Boss
Hello friend
Here is my complete grammar
grammar org.xtext.example.second.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/second/MyDsl"
Example:
statements+=Statement*;
Statement:
(IDLABEL)? Directives;
Directives:
TAG1 | TAG2 | TAG3 | TAG4;
TAG1: tag=('Hi'|'Hello') IDLABEL;
TAG2: tag=('Tag2') IDLABEL;
TAG3: tag=('Tag3') IDLABEL;
TAG4: tag=('Tag4') IDLABEL;
STRING_OPERANDS hidden(WS):
("*"|UNQUOTED|QUOTED)+;
terminal QUOTED:
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'";
terminal UNQUOTED:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '-' | '*' | "/" | "\\" | '(' | ')' | '$' | '=' |'#' |'.' | '"' |'#'|'+'|"'"|'<'|'>')*;
terminal IDLABEL:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9'|'='|'#')*;
For the input, Say Hello Boss
I am getting an error "missing EOF at Say"
and for the input Hello Boss
I am getting an error "mismatched input 'Boss' expecting RULE_IDLABEL"
What is wrong with this grammar?
Boss matches both the rule IDLABEL and UNQUOTED. In cases where two rules can match the current input and both rules match the same prefix, the tokenizer uses the rule that comes first. So the input Boss produces an UNQUOTED token, not an IDLABEL token.
In fact all valid IDLABELs are also valid UNQUOTEDs, so you'll never get any IDLABEL tokens.
To fix this, you can change the order of UNQUOTED and IDLABEL, so that IDLABEL comes first.
Grammatical rules are defined as:
an integer literal is a sequence of digits;
a boolean literal is one of true or false;
a keyword is one of if, while, or the boolean literals;
a variable is a string that starts with a letter and is followed by letters or digits, and
is not a keyword;
an operator is one of <= >= == != && || = + - * < >
punctuation is one of the ( ) { } , ; characters.
Based on the description I wrote out grammar in EBNF notation as fallows:
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
int literal = digit {digit} ;
bool = "true" | "false" ;
keyword = "if" | "while" | bool ;
letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" |
"h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z" ;
variable = (letter {digit | letter}) -keyword ;
operator = "<=" |">=" | "==" | "!=" | "&&" | "||" | "=" | "+" | "-" | "*" | "<" | ">" | "!" ;
punctuation = "(" | ")" | " {" | " }" | " , " | " ; " ;
Now i want to calculate FIRST, FOLLOW and PREDICT sets but I'm not sure how to do it out of EBNF notation. Should I first change it to Chomsky normal form? Is so then how? Would that be right?
DIGIT -> 0 1 2 3 ...
INT -> DIGIT | DIGIT DIGIT
BOOL -> true false
KEYWORD -> if while BOOL
LETTER -> A B C D ...
VARIABLE -> LETTER | LETTER DIGIT | LETTER LETTER
First and follow are pretty straight-forward, even with EBNF. In this case, they are even easier, since you have no nullable non-terminals. (You need to watch out for repetition groups, since the repetition count can be 0. If you have:
... A { X ... } Y ...
then FOLLOW(A) must include both FIRST(X) and FIRST(Y). And if you have
C -> A { X }
then FOLLOW(A) must include FOLLOW(C).
None of this should be complicated if you're doing the computation by hand. For an automated solution, I would probably unroll the repetition operators into unextended BNF by creating new non-terminals, but you could do the computation directly on the EBNF as well.
The one wrinkle is your use of the set difference operator -, in
variable = (letter {digit | letter}) - keyword ;
In this particular case, it does not create any difficulties, but the general solution is tricky. In fact, since there is no guarantee that the difference between two context-free languages is context-free, it will not really be possible to find a truly general solution.
Predict sets are another story. Indeed, I'm not even 100% sure what a predict set would be for EBNF, since you need to be able to predict repetition of a subpattern, not just derivations. Again, expanding to BNF might help, but it can happen that the expansion creates a predict conflict which didn't exist in the original grammar.
The grammar you present is incomplete, so I don't know how useful computing LL(1) sets will be. I suppose that it is intended to be just the lexical part of the grammar, but really there is a reason why lexical analysis is usually done with regular expressions rather than context-free parsing.
Several reasons, really: aside from the fact that lexical analysis usually involves reasonably readable regular expressions, there is also the important fact that lexical analysis does not usually involve parsing the internal structure of a token. That lets you choose to simply recognize a repeated element rather than worrying about whether the parse tree for the repetition should be left- or right-leaning.
The key insight about computing FIRST and FOLLOW sets is that they mean just what their names indicate. The FIRST set of a non-terminal is precisely the set of tokens which can begin a complete derivation from the non-terminal; similarly, the FOLLOW set is precisely the set of tokens which might immediately follow the non-terminal during a derivation from the start symbol. In many simple grammars, these sets can be computed by inspection; that certainly should be the case for your grammar, at least for the FIRST sets.
The fact that you have no start symbol here is another indication that you are probably not solving the right problem; without a start symbol, there is no meaningful definition of FOLLOW.
If you are trying to do lexical analysis, you might be able to get away with:
start -> { token }
token -> int literal | keyword | identifier | ...
Although to be formally correct, you'd also need to handle "ignored tokens" such as comments and whitespace.
How can I recognize different tokens for the same symbol in ANTLR v4? For example, in selected = $("library[title='compiler'] isbn"); the first = is an assignment, whereas the second = is an operator.
Here are the relevant lexer rules:
EQUALS
:
'='
;
OP
:
'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
And here is the parser rule for that line:
assign
:
ID EQUALS DOLLAR OPEN_PARENTHESIS QUOTES ID selector ID QUOTES
CLOSE_PARENTHESIS SEMICOLON
;
selector
:
OPEN_BRACKET ID OP APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
This correctly parses the line, as long as I use an OP different than =.
Here is the error log:
JjQueryParser::init:34:29: mismatched input '=' expecting OP
JjQueryParser::init:34:39: mismatched input ''' expecting '\"'
JjQueryParser::init:34:46: mismatched input '"' expecting '='
The problem cannot be solved in the lexer, since the lexer does always return one token type for the same string. But it would be quite easy to resolve it in the parser. Just rewrite the rules lower case:
equals
: '='
;
op
:'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
I had the same issue. Resolved in the lexer as follows:
EQUALS: '=';
OP : '|' EQUALS
| '*' EQUALS
| '~' EQUALS
| '$' EQUALS
| '!' EQUALS
| '^' EQUALS
;
This guarantees that the symbol '=' is represented by a single token all the way. Don't forget to update the relevant rule as follows:
selector
:
OPEN_BRACKET ID (OP|EQUALS) APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.
I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)