ANTLR lexer disabling tokens then reenabling them not working as expected - parsing

So i have a lexer with a token defined so that on a boolean property it is enabled/disabled
I create an input stream and parse a text. My token is called PHRASE_TEXT and should match anything falling within this pattern '"' ('\\' ~[] |~('\"'|'\\')) '"' {phraseEnabled}?
I tokenize "foo bar" and as expected I get a single token. After setting the property to false on the lexer and calling setInputStream on it with the same text I get "foo , bar" so 2 tokens instead of one. This is also expected behavior.
The problem comes when setting the property to true again. I would expect the same text to tokenize to the whole 1 token "foo bar" but instead is tokenized to the 2 tokens from before. Is this a bug on my part? What am I doing wrong here? I tried using new instances of the tokenizer and reusing the same instance but it doesn't seem to work either way. Thanks in advance.
Edit : Part of my grammar follows below
grammar LuceneQueryParser;
#header{package com.amazon.platformsearch.solr.queryparser.psclassicqueryparser;}
#lexer::members {
public boolean phrases = true;
}
#parser::members {
public boolean phraseQueries = true;
}
mainQ : LPAREN query RPAREN
| query
;
query : not ((AND|OR)? not)* ;
andClause : AND ;
orClause : OR ;
not : NOT? modifier? clause;
clause : qualified
| unqualified
;
unqualified : LBRACK range_in LBRACK
| LCURL range_out RCURL
| truncated
| {phraseQueries}? quoted
| LPAREN query RPAREN
| normal
;
truncated : TERM_TEXT_TRUNCATED;
range_in : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
range_out : (TERM_TEXT|STAR) TO (TERM_TEXT|STAR);
qualified : TERM_TEXT COLON unqualified ;
normal : TERM_TEXT;
quoted : PHRASE_TEXT;
modifier : PLUS
| MINUS
;
PHRASE_TEXT : '"' (ESCAPE|~('\"'|'\\'))+ '"' {phrases}?;
TERM_TEXT : (TERM_CHAR|ESCAPE)+;
TERM_CHAR : ~(' ' | '\t' | '\n' | '\r' | '\u3000'
| '\\' | '\'' | '(' | ')' | '[' | ']' | '{' | '}'
| '+' | '-' | '!' | ':' | '~' | '^'
| '*' | '|' | '&' | '?' );
ESCAPE : '\\' ~[];
The problem seems to be that after i set the phrases to false, and then to true again, no more tokens seem to be recognized as PHRASE_TEXT. I know that as a guideline i should define my grammars to be unambiguous but this is basically the way it has to end up looking : tokenizing a string with quotes in 2 different modes, depending on the situation.

I'm gonna have to update this with an answer a colleague of mine helpfully pointed out. The lexer generated class has a static DFA[] array shared between all instances of the class. Once the property was set to false instead of the default true the decision tree was apparently changed for all object instances. A fix for this was to have to separate DFA[] arrays for both the true and false instances of the property i was modifying. I think making that array not static would be too expensive and i really can't think about another fix.

Related

antlr error "no viable alternative at input"

I'm following the example given here-
https://datapsyche.wordpress.com/2014/10/23/back-to-learning-grammar-with-antlr/
which basically has following grammar-
grammar Simpleql;
statement : expr command* ;
expr : expr ('AND' | 'OR' | 'NOT') expr # expopexp
| expr expr # expexp
| predicate # predicexpr
| text # textexpr
| '(' expr ')' # exprgroup
;
predicate : text ('=' | '!=' | '>=' | '<=' | '>' | '<') text ;
command : '| show' text* # showcmd
| '| show' text (',' text)* # showcsv
;
text : NUMBER # numbertxt
| QTEXT # quotedtxt
| UQTEXT # unquotedtxt
;
AND : 'AND' ;
OR : 'OR' ;
NOT : 'NOT' ;
EQUALS : '=' ;
NOTEQUALS : '!=' ;
GREQUALS : '>=' ;
LSEQUALS : '<=' ;
GREATERTHAN : '>' ;
LESSTHAN : '<' ;
NUMBER : DIGIT+
| DIGIT+ '.' DIGIT+
| '.' DIGIT+
;
QTEXT : '"' (ESC|.)*? '"' ;
UQTEXT : ~[ ()=,<>!\r\n]+ ;
fragment
DIGIT : [0-9] ;
fragment
ESC : '\\"' | '\\\\' ;
WS : [ \t\r\n]+ -> skip ;
When I pass input like this-
Abishek AND (country=India OR city=NY) LOGIN 404 | show name city
I get error- line 1:65 no viable alternative at input '<EOF>'
I went through a couple of SO posts related to the error but can't seem to be able to figure out what is wrong with the grammar.
I tried running your example but was thrown a number of errors in antlrworks 2. However i was able to run it without any errors in the test rig getting the following output:
(statement (expr (expr (expr (text Abishek)) AND (expr ( (expr (expr (predicate (text country) = (text India))) OR (expr (predicate (text city) = (text NY)))) ))) (expr (expr (text LOGIN)) (expr (text 404)))) (command | show (text name) (text city)))
And the same output of the tree shown on the website.
My opinion on what's wrong may be your actual input, iv had problems in the past with ANTLR reading text from a file if the file was not encoded to be ascii/ansi/utf-8 or whatever works for the os you are using. I encountered this when i saved a file on linux from a linux text editor and tried to run it on windows with the same generated parser. So my recommendation is try re-saving your text input - 'Abishek AND (country=India OR city=NY) LOGIN 404 | show name city' and make sure the encoding is different each time incase this is the cause.
Note you can also specify the encoding like this or similar ways :
CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");
Since having an encoding error will cause it to try and parse irrelevant of encoding and result in no matches being found.
Let me know if it works after saving encoded in a few different ways and i'll try and help further. Hope this helps.

ANTLR grammar for multi-level text segmentation

I want to create a grammar that will parse a text file and create a tree of levels according to configurable "segmentors". This is what I have created so far, it kind of works, but will halt when a "segmentor" appears in the beginning of a text. For example, text "and location" will fail to parse. Any ideas?
Also, I'm pretty certain that the grammar could be greatly improved, so any suggestions are welcome.
grammar DocSegmentor;
#header {
package segmentor.antlr;
}
// PARSER RULES
levelOne: (levelTwo LEVEL1_SEG*)+ ;
levelTwo: (levelThree+ LEVEL2_SEG?)+ ;
levelThree: (levelFour+ LEVEL3_SEG?)+ ;
levelFour: (levelFive+ LEVEL4_SEG?)+ ;
levelFive: tokens;
tokens: (DELIM | PAREN | TEXT | WS)+ ;
// LEXER RULES
LEVEL1_SEG : '\r'? '\n'| EOF ;
LEVEL2_SEG : '.' ;
LEVEL3_SEG : ',' ;
LEVEL4_SEG : 'and' | 'or' ;
DELIM : '`' | '"' | ';' | '/' | ':' | '’' | '‘' | '=' | '?' | '-' | '_';
PAREN : '(' | ')' | '[' | ']' | '{' | '}' ;
TEXT : (('a'..'z') | ('A'..'Z') | ('0'..'9'))+ ;
WS : [ \t]+ ;
I'd definitely go with a Scala parser combinator library.
https://lihaoyi.github.io/fastparse/
https://github.com/scala/scala-parser-combinators
Those are just two examples for a library you can write by hand with little effort and tune to whatever you need. I should mention that you should go with Scalaz (https://github.com/scalaz/scalaz) if you're writing a parser monad on your own.
I wouldn't use a parser at all for that task. All you need is keyword spotting.
It's much easier and more flexibel if you just scan your text for the "segmentators" by walking over the input. This also allows to handle text of any size (e.g. by using memory mapped files) while parsers usually (ANTLR for sure) load the entire text into memory and tokenize it fully, before it comes to parsing.

Xtext : prevent for matching keywords from another rule

I'm looking for a way to prevent KEYWORDS matching at a place where those KEYWORDS are not expected.
Take a look at the following grammar. Both 'APPLY' and 'OUTPUT' are keywords.
'OUTPUT' has an argument that contains any characters.
Everything works fine but if this argument contains the word APPLY, an error is raised (extraneous input APPLY expecting RULE_END).
Is there a way to solve this issue?
Thanks.
Sample text
APPLY, 'an id' $
OUTPUT, A text $
OUTPUT, A text with the word APPLY $
DSL
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
Model:
statement+=Statement*;
Statement:
ApplyStatement | OutputStatement;
OutputStatement:
'OUTPUT' ',' out+=EXTENDLABEL* end=END;
ApplyStatement:
'APPLY' ',' id=LABELIDENTIFIER end=END;
terminal fragment LETTER:
'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
| 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' |
'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z';
terminal LABELIDENTIFIER:
"'"->"'";
terminal EXTENDLABEL:
(LETTER) (LETTER)*;
terminal END:
'$' !('\n' | '\r')*;
I see a few different ways your issue can be handled. First of all, you could escape the keywords appearing, e.g. the Xbase language uses the '^' character as an escape character; if for any reason there is a problem with writing a keyword, you can prefix it with '^', and it would work. Similarly, if you would put your string inside specific symbols, e.g. apostrophes, it would help a lot. Of course, these solutions require to change your language itself, which you may or may not do.
You might also replace your EXTENDLABEL terminal with a datatype rule. This allows greater flexibility with regards to conflict resolution; worst case you could add the language keywords as options. I was suggested this route by a tangentially related case in the Eclipse forums.
an other solution is to change the ID of your token before that your parser used it. Token are provided by the lexer and your parser will take these tokens in input to produce your AST. So the idea is to change the tokens before to pass them to your parser.
To do it you need to declare your own parser:
#Override
public Class<? extends IParser> bindIParser() {
return ModelParser.class;
}
Note : your parser will extends the generated parser of your grammar.
Then you need to override the following method to introduce your own TokenSource:
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new TokenSource(tokenSource, getTokenDefProvider());
}
You own token source need to extend 'XtextTokenStream'.
After you need to override the method 'LT' as following :
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
Then you just need to change the ID :
def void tokenOverride(Token token, int index){
switch (token.text){
case "APPLY" : {
overrideType(t_parameter, InternalModelParser.RULE_ID);
}
}
}
def void overrideType(Token token, int i) {
token.type = i
}
Note : don't forget to add your condition before to change the ID of your token, in this example all token 'APPLY' will become an ID.
And of course inside the switch you can use the ID of the token 'APPLY' instead the text of your token.

ANTLR v4 different tokens for the same symbols

How can I recognize different tokens for the same symbol in ANTLR v4? For example, in selected = $("library[title='compiler'] isbn"); the first = is an assignment, whereas the second = is an operator.
Here are the relevant lexer rules:
EQUALS
:
'='
;
OP
:
'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
And here is the parser rule for that line:
assign
:
ID EQUALS DOLLAR OPEN_PARENTHESIS QUOTES ID selector ID QUOTES
CLOSE_PARENTHESIS SEMICOLON
;
selector
:
OPEN_BRACKET ID OP APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;
This correctly parses the line, as long as I use an OP different than =.
Here is the error log:
JjQueryParser::init:34:29: mismatched input '=' expecting OP
JjQueryParser::init:34:39: mismatched input ''' expecting '\"'
JjQueryParser::init:34:46: mismatched input '"' expecting '='
The problem cannot be solved in the lexer, since the lexer does always return one token type for the same string. But it would be quite easy to resolve it in the parser. Just rewrite the rules lower case:
equals
: '='
;
op
:'|='
| '*='
| '~='
| '$='
| '='
| '!='
| '^='
;
I had the same issue. Resolved in the lexer as follows:
EQUALS: '=';
OP : '|' EQUALS
| '*' EQUALS
| '~' EQUALS
| '$' EQUALS
| '!' EQUALS
| '^' EQUALS
;
This guarantees that the symbol '=' is represented by a single token all the way. Don't forget to update the relevant rule as follows:
selector
:
OPEN_BRACKET ID (OP|EQUALS) APOSTROPHE ID APOSTROPHE CLOSE_BRACKET
;

ANTLR Parsing Literals and Quoted IDs

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)

Resources