ANTLR v4 different tokens for the same symbols - parsing

How can I recognize different tokens for the same symbol in ANTLR v4? For example, in selected = $("library[title='compiler'] isbn"); the first = is an assignment, whereas the second = is an operator.
Here are the relevant lexer rules:
EQUALS
:
'='
;
OP
:
'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
And here is the parser rule for that line:
assign
:
ID EQUALS DOLLAR OPEN_PARENTHESIS QUOTES ID selector ID QUOTES
CLOSE_PARENTHESIS SEMICOLON
;
selector
:
OPEN_BRACKET ID OP APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
This correctly parses the line, as long as I use an OP different than =.
Here is the error log:
JjQueryParser::init:34:29: mismatched input '=' expecting OP
JjQueryParser::init:34:39: mismatched input ''' expecting '\"'
JjQueryParser::init:34:46: mismatched input '"' expecting '='

The problem cannot be solved in the lexer, since the lexer does always return one token type for the same string. But it would be quite easy to resolve it in the parser. Just rewrite the rules lower case:
equals
: '='
;
op
:'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;

I had the same issue. Resolved in the lexer as follows:
EQUALS: '=';
OP : '|' EQUALS
| '*' EQUALS
| '~' EQUALS
| '$' EQUALS
| '!' EQUALS
| '^' EQUALS
;
This guarantees that the symbol '=' is represented by a single token all the way. Don't forget to update the relevant rule as follows:
selector
:
OPEN_BRACKET ID (OP|EQUALS) APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;

Related

Why doesn't Bison accept this grammar file?

When I use the command bison -d -o parser.java parser.y to generate a parser from my grammar file parser.y, Bison produces the following error:
:8.8-10: syntax error, unexpected string, expecting char or identifier or type
Here is the file parser.y:
%{
import java.util.;
import java.io.;
%}
%start PROGRAM
%token number identifier function break call if else let read return while write
%token "(" ")" "{" "}" ";" "=" "+" "-" "" "/" "%" "<" ">" " <= " " >= " "==" "!=" "&" "|" "~" "!"
%left "+" "-"
%left "" "/" "%"
%left "&" "|"
%nonassoc "!"
%type <Node> PROGRAM FUNCTION PARAMLIST BLOCK STATEMENT IF ELSE EXPR
%type <String> identifier
%type <Integer> number
%union {
Node node;
String identifier;
int number;
}
%%
PROGRAM:
| PROGRAM FUNCTION
| BLOCK
;
FUNCTION:
function identifier '(' PARAMLIST ')' BLOCK
;
PARAMLIST:
identifier
| identifier ',' PARAMLIST
|
;
BLOCK:
'{' STATEMENT '}'
;
STATEMENT:
BREAK
| CALL ';'
| IF
| LET
| READ
| RETURN
| WHILE
| WRITE
;
BREAK:
break ';'
;
CALL:
call identifier '(' ARGLIST ')'
;
ARGLIST:
EXPR
| EXPR ',' ARGLIST
|
;
IF:
if EXPR BLOCK ELSE
;
ELSE:
else BLOCK
|
;
LET:
let identifier '=' EXPR ';'
| let identifier '=' CALL ';'
;
READ:
read identifier ';'
;
RETURN:
return EXPR ';'
;
WHILE:
while EXPR BLOCK
;
WRITE:
write EXPR ';'
;
EXPR:
number
| identifier
| '(' EXPR ')'
| '!' EXPR
| '~' EXPR
| EXPR '+' EXPR
| EXPR '-' EXPR
| EXPR '*' EXPR
| EXPR '/' EXPR
| EXPR '%' EXPR
| EXPR '&' EXPR
| EXPR '|' EXPR
| EXPR '<' EXPR
| EXPR '>' EXPR
| EXPR "<=" EXPR
| EXPR ">=" EXPR
| EXPR "==" EXPR
| EXPR "!=" EXPR
;
%%
int yyerror(String s) {
System.err.println("error: " + s);
}
Bison doesn't allow you to declare quoted token names (such as "(") with the %token declaration. It knows they are tokens; they cannot be anything else.
You use the %token declaration to declare symbolic names for tokens, which you will find useful when writing your lexer. In the declaration, the symbolic name comes first, optionally followed by the double-quoted alias. You can repeat that as often as you like. For example, you could write:
%token TK_LE "<=" TK_GE ">="
You can then use either the symbolic name or the alias in your grammar, but using the alias makes your grammar more readable. Also, Bison uses the alias when constructing error messages, which is a good thing since "expecting TK_SEMIC" is not a great way to communicate with a user that a ";" was required.
Keep in mind that a single-quoted single character token, such as '(', is not the same token as the double-quoted alias. In your grammar, you use '(' but attempt to declare "(". Had you succeeded in declaring "(", you would have gotten an "unused token" warning. Since '(' doesn't require a symbolic name, you can just remove the declaration. You will only need them for multicharacter tokens like "<=". (Note that spaces are significant inside quotes. " <= " is not the same as "<=".)
Symbolic token names are used as Java values, so their names cannot conflict with variables or Java keywords. You cannot, for example, use break as a symbolic token name. Trying to do so will cause compilation errors.
For this reason, it's customary to write token names in ALL_CAPS, and non-terminals in lower case. Non-terminals names are not used in the generated code, so you can use whatever names you wish.
You reverse this convention, which will cause a variety of errors when you compile the generated parser, and which is hard to read for those of us accustomed to the standard style.
A couple of other notes:
The bison Java interface does not use a %union declaration. The %type declarations are sufficient.
You are missing precedence declarations for many operators, particularly comparison operators. That will lead to a large number of parser conflicts. Make sure you write the precedence levels in the correct order.

missing EOF at 'Say'

I am writing grammar to recognize following input
Say Hello Boss
Hello friend
Here is my complete grammar
grammar org.xtext.example.second.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/second/MyDsl"
Example:
statements+=Statement*;
Statement:
(IDLABEL)? Directives;
Directives:
TAG1 | TAG2 | TAG3 | TAG4;
TAG1: tag=('Hi'|'Hello') IDLABEL;
TAG2: tag=('Tag2') IDLABEL;
TAG3: tag=('Tag3') IDLABEL;
TAG4: tag=('Tag4') IDLABEL;
STRING_OPERANDS hidden(WS):
("*"|UNQUOTED|QUOTED)+;
terminal QUOTED:
"'" ( '\\' . /* 'b'|'t'|'n'|'f'|'r'|'u'|'"'|"'"|'\\' */ | !('\\'|"'") )* "'";
terminal UNQUOTED:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '-' | '*' | "/" | "\\" | '(' | ')' | '$' | '=' |'#' |'.' | '"' |'#'|'+'|"'"|'<'|'>')*;
terminal IDLABEL:
('a'..'z' | 'A'..'Z' | '_' | '0'..'9'|'='|'#')*;
For the input, Say Hello Boss
I am getting an error "missing EOF at Say"
and for the input Hello Boss
I am getting an error "mismatched input 'Boss' expecting RULE_IDLABEL"
What is wrong with this grammar?
Boss matches both the rule IDLABEL and UNQUOTED. In cases where two rules can match the current input and both rules match the same prefix, the tokenizer uses the rule that comes first. So the input Boss produces an UNQUOTED token, not an IDLABEL token.
In fact all valid IDLABELs are also valid UNQUOTEDs, so you'll never get any IDLABEL tokens.
To fix this, you can change the order of UNQUOTED and IDLABEL, so that IDLABEL comes first.

ANTLR lexer disabling tokens then reenabling them not working as expected

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.
I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

ANTLR Parsing Literals and Quoted IDs

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)

Characters Matching Multiple Lexer Rules in ANTLR

I've defined multiple lexer rules that potentially matches the same character sequence. For example:
LBRACE: '{' ;
RBRACE: '}' ;
LPARENT: '(' ;
RPARENT: ')' ;
LBRACKET: '[' ;
RBRACKET: ']' ;
SEMICOLON: ';' ;
ASTERISK: '*' ;
AMPERSAND: '&' ;
IGNORED_SYMBOLS: ('!' | '#' | '%' | '^' | '-' | '+' | '=' |
'\\'| '|' | ':' | '"' | '\''| '<' | '>' | ',' | '.' |'?' | '/' ) ;
// WS comments*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' .* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' .* '\r'? '\n' {$channel=HIDDEN;};
STRING_LITERAL: '"' (STR_ESC | ~( '"' ))* '"';
fragment STR_ESC: '\\' '"' ;
CHAR_LITERAL : '\'' (CH_ESC | ~( '\'' )) '\'' ;
fragment CH_ESC : '\\' '\'';
My IGNORED_SYMBOLS and ASTERISK match /, " and * respectively. Since they're placed (unintentionally) before my comment and string literal rules which also match /* and ", I expect the comment and string literal rules would be disabled (unintentionally) . But surprisely, the ML_COMMENT, SL_COMMENT and STRING_LITERAL rules still work correctly.
This is somewhat confusing. Isn't that a /, whether it is part of /* or just a standalone /, will always be matched and consumed by the IGNORED_SYMBOLS first before it has any chance to be matched by the ML_COMMENT?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
What is the way the lexer decides which rules to apply if the characters match more than one rule?
Lexer rules are matched from top to bottom. In case two (or more) rules match the same number of characters, the one that is defined first has precedence over the one(s) later defined in the grammar. In case a rule matches N number of characters and a later rule matches the same N characters plus 1 or more characters, then the later rule is matched (greedy match).
Take the following rules for example:
DO : 'do';
ID : 'a'..'z'+;
The input "do" would obviously be matched by the rule DO.
And input like: "done" would be greedily matched by ID. It is not tokenized as the 2 tokens: [DO:"do"] followed by [ID:"ne"].

Resources