ANTLR Lexer matching the wrong rule - parsing

I'm working on a lexer and parser for an old object oriented chat system (MOO in case any readers are familiar with its language). Within this language, any of the below examples are valid floating point numbers:
2.3
3.
.2
3e+5
The language also implements an indexing syntax for extracting one or more characters from a string or list (which is a set of comma separated expressions enclosed in curly braces). The problem arises from the fact that the language supports a range operator inside the index brackets. For example:
a = foo[1..3];
I understand that ANTLR wants to match the longest possible match first. Unfortunately this results in the lexer seeing '1..3' as two floating points numbers (1. and .3), rather than two integers with a range operator ('..') between them. Is there any way to solve this short of using lexer modes? Given that the values inside of an indexing expression can be any valid expression, I would have to duplicate a lot of token rules (essentially all but the floating point numbers as I understand it). Now granted I'm new to ANTLR so I'm sure I'm missing something and any help is much appreciated. I will supply my lexer grammar below:
lexer grammar MooLexer;
channels { COMMENTS_CHANNEL }
SINGLE_LINE_COMMENT
: '//' INPUT_CHARACTER* -> channel(COMMENTS_CHANNEL);
DELIMITED_COMMENT
: '/*' .*? '*/' -> channel(COMMENTS_CHANNEL);
WS
: [ \t\r\n] -> channel(HIDDEN)
;
IF
: I F
;
ELSE
: E L S E
;
ELSEIF
: E L S E I F
;
ENDIF
: E N D I F
;
FOR
: F O R;
ENDFOR
: E N D F O R;
WHILE
: W H I L E
;
ENDWHILE
: E N D W H I L E
;
FORK
: F O R K
;
ENDFORK
: E N D F O R K
;
RETURN
: R E T U R N
;
BREAK
: B R E A K
;
CONTINUE
: C O N T I N U E
;
TRY
: T R Y
;
EXCEPT
: E X C E P T
;
ENDTRY
: E N D T R Y
;
IN
: I N
;
SPLICER
: '#';
UNDERSCORE
: '_';
DOLLAR
: '$';
SEMI
: ';';
COLON
: ':';
DOT
: '.';
COMMA
: ',';
BANG
: '!';
OPEN_QUOTE
: '`';
SINGLE_QUOTE
: '\'';
LEFT_BRACKET
: '[';
RIGHT_BRACKET
: ']';
LEFT_CURLY_BRACE
: '{';
RIGHT_CURLY_BRACE
: '}';
LEFT_PARENTHESIS
: '(';
RIGHT_PARENTHESIS
: ')';
PLUS
: '+';
MINUS
: '-';
STAR
: '*';
DIV
: '/';
PERCENT
: '%';
PIPE
: '|';
CARET
: '^';
ASSIGNMENT
: '=';
QMARK
: '?';
OP_AND
: '&&';
OP_OR
: '||';
OP_EQUALS
: '==';
OP_NOT_EQUAL
: '!=';
OP_LESS_THAN
: '<';
OP_GREATER_THAN
: '>';
OP_LESS_THAN_OR_EQUAL_TO
: '<=';
OP_GREATER_THAN_OR_EQUAL_TO
: '>=';
RANGE
: '..';
ERROR
: 'E_NONE'
| 'E_TYPE'
| 'E_DIV'
| 'E_PERM'
| 'E_PROPNF'
| 'E_VERBNF'
| 'E_VARNF'
| 'E_INVIND'
| 'E_RECMOVE'
| 'E_MAXREC'
| 'E_RANGE'
| 'E_ARGS'
| 'E_NACC'
| 'E_INVARG'
| 'E_QUOTA'
| 'E_FLOAT'
;
OBJECT
: '#' DIGIT+
| '#-' DIGIT+
;
STRING
: '"' ( ESC | [ !] | [#-[] | [\]-~] | [\t] )* '"';
INTEGER
: DIGIT+;
FLOAT
: DIGIT+ [.] (DIGIT*)? (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| [.] DIGIT+ (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| DIGIT+ EXPONENTNOTATION EXPONENTSIGN DIGIT+
;
IDENTIFIER
: (LETTER | DIGIT | UNDERSCORE)+
;
LETTER
: LOWERCASE
| UPPERCASE
;
/*
* fragments
*/
fragment LOWERCASE
: [a-z] ;
fragment UPPERCASE
: [A-Z] ;
fragment EXPONENTNOTATION
: ('E' | 'e');
fragment EXPONENTSIGN
: ('-' | '+');
fragment DIGIT
: [0-9] ;
fragment ESC
: '\\"' | '\\\\' ;
fragment INPUT_CHARACTER
: ~[\r\n\u0085\u2028\u2029];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];

No, AFAIK, there is no way to solve this using lexer modes. You'll need a predicate with a bit of target specific code. If Java is your target, that might look like this:
lexer grammar RangeTestLexer;
FLOAT
: [0-9]+ '.' [0-9]+
| [0-9]+ '.' {_input.LA(1) != '.'}?
| '.' [0-9]+
;
INTEGER
: [0-9]+
;
RANGE
: '..'
;
SPACES
: [ \t\r\n] -> skip
;
If you run the following Java code:
Lexer lexer = new RangeTestLexer(CharStreams.fromString("1 .2 3. 4.5 6..7 8 .. 9"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s `%s`\n", RangeTestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get the following output:
INTEGER `1`
FLOAT `.2`
FLOAT `3.`
FLOAT `4.5`
INTEGER `6`
RANGE `..`
INTEGER `7`
INTEGER `8`
RANGE `..`
INTEGER `9`
EOF `<EOF>`
The { ... }? is the predicate and the embedded code must evaluate to a boolean. In my example, the Java code _input.LA(1) != '.' returns true if the character stream 1 step ahead of the current position does not equal a '.' char.

Related

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

ANTLR parse key value list

I'm new to ANTLR and I am trying to parse something like
ref:something title:(something else) blah ref other
and to obtain a list like
KEY = ref VALUE = something
KEY = title VALUE = something else
KEY = null VALUE = blah
KEY = null VALUE = ref // same ref string as item 1 key
KEY = null VALUE = other
The grammar I have is
searchCriteriaList
locals[List<object> s = new List<object>()]
: t+=criteriaBean (WS t+=criteriaBean)* { $s.addAll($t); }
;
criteriaBean : (KEY ':' WS* expression)
| expression ;
expression : '(' WORD (WS WORD)* ')'
| WORD ;
/*
* Lexer Rules
*/
fragment A : ('A'|'a') ;
fragment B : ('B'|'b') ;
fragment C : ('C'|'c') ;
fragment D : ('D'|'d') ;
fragment E : ('E'|'e') ;
fragment F : ('F'|'f') ;
fragment G : ('G'|'g') ;
fragment H : ('H'|'h') ;
fragment I : ('I'|'i') ;
fragment J : ('J'|'j') ;
fragment K : ('K'|'k') ;
fragment L : ('L'|'l') ;
fragment M : ('M'|'m') ;
fragment N : ('N'|'n') ;
fragment O : ('O'|'o') ;
fragment P : ('P'|'p') ;
fragment Q : ('Q'|'q') ;
fragment R : ('R'|'r') ;
fragment S : ('S'|'s') ;
fragment T : ('T'|'t') ;
fragment U : ('U'|'u') ;
fragment V : ('V'|'v') ;
fragment W : ('W'|'w') ;
fragment X : ('X'|'x') ;
fragment Y : ('Y'|'y') ;
fragment Z : ('Z'|'z') ;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
TITLE : T I T L E ;
MESSAGE : M E S S A G E ;
REF : R E F ;
KEY : TITLE | MESSAGE | REF ;
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
WS : [ \t\u000C\r\n] ;
When I try parsing the string I get 2 exceptions and in the addAll method I end up with 3 elements rather than 5.
Can someone point me into the right direction? What I am doing wrong?
Thanks,
S
PS: The exception I am getting is:
Exception of type 'Antlr4.Runtime.InputMismatchException' was thrown.
InputStream: {ref:something title:(something else) blah ref other }
OffendingToken: {[#0,0:2='ref',<5>,1:0]}
The lexer tries to match as much characters as possible when constructing tokens. When 2 or more lexer rules match the same characters, the rule defined first "wins". With this in mind, the KEY token will never be created since TITLE, MESSAGE and REF are defined above it:
TITLE : T I T L E ;
MESSAGE : M E S S A G E ;
REF : R E F ;
KEY : TITLE | MESSAGE | REF ;
WORD : (LOWERCASE | UPPERCASE | '_')+ ;
So the input ref will always become a REF token, never a KEY or a WORD. What you need to do is create a parser rule from KEY instead.
Also, since you want a WORD to also match your keywords, you should not do this:
expression
: '(' WORD (WS WORD)* ')'
| WORD
;
but something like this instead:
expression
: '(' word (WS word)* ')'
| word
;
word
: key
| WORD
;
key
: TITLE
| MESSAGE
| REF
;
Oh, and this:
fragment Z : ('Z'|'z') ;
can be rewritten as:
fragment Z : [Zz] ;
And is there a particular reason you're littering your parser rules with WS tokens? You could just remove them during tokenisation:
WS : [ \t\u000C\r\n] -> skip;

How to make certain rules mandatory in Antlr

I wrote the following grammar which should check for a conditional expression.
Examples below is what I want to achieve using this grammar:
test invalid
test = 1 valid
test = 1 and another_test>=0.2 valid
test = 1 kasd y = 1 invalid (two conditions MUST be separated by AND/OR)
a = 1 or (b=1 and c) invalid (there cannot be a lonely character like 'c'. It should always be a triplet. i.e, literal operator literal)
grammar expression;
expr
: literal_value
| expr ( '='|'<>'| '<' | '<=' | '>' | '>=' ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
| '(' expr ')'
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| IDENTIFIER
;
keyword
: K_AND
| K_OR
;
name
: any_name
;
function_name
: any_name
;
database_name
: any_name
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
| STRING_LITERAL
| '(' any_name ')'
;
K_AND : A N D;
K_OR : O R;
IDENTIFIER
: '"' (~'"' | '""')* '"'
| '`' (~'`' | '``')* '`'
| '[' ~']'* ']'
| [a-zA-Z_] [a-zA-Z_0-9]*
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( ~'\'' | '\'\'' )* '\''
;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
WS: [ \n\t\r]+ -> skip;
So my question is, how can I get the grammar to work for the examples mentioned above? Can we make certain words as mandatory between two triplets (literal operator literal)? In a sense I'm just trying to get a parser to validate the where clause condition but only simple condition and functions are permitted. I also want have a visitor that retrieves the values like function, parenthesis, any literal etc in Java, how to achieve that?
Yes and no.
You can change your grammar to only allow expressions that are comparisons and logical operations on the same:
expr
: term ( '='|'<>'| '<' | '<=' | '>' | '>=' ) term
| expr K_AND expr
| expr K_OR expr
| '(' expr ')'
;
term
: literal_value
| function_name '(' ( expr ( ',' expr )* | '*' )? ')'
;
The issue comes if you want to allow boolean variables or functions -- you need to classify the functions/vars in your lexer and have a different terminal for each, which is tricky and error prone.
Instead, it is generally better to NOT do this kind of checking in the parser -- have your parser be permissive and accept anything expression-like, and generate an expression tree for it. Then have a separate pass over the tree (called a type checker) that checks the types of the operands of operations and the arguments to functions.
This latter approach (with a separate type checker) generally ends up being much simpler, clearer, more flexible, and gives better error messages (rather than just 'syntax error').

ANTLR position specific Tokens

We want to ANTLR grammar where "0" in first position treated differently compared with other position.
Sample Input
0 Test
World 0
Is there anyway to identify position or context specific Token?
Currently we are writing code to change the Token type.
Grammar
// Define a grammar called Hello
grammar Hello;
r : (r2 | r1)+ ;
r2 : 'World' Num ;
r1 : keyZero 'Test';
keyZero : {_input.LT(1).getCharPositionInLine() == 0}? Num ;
Num : [0-9]+ ;
Ws : [ ] -> skip;
Eol : [\r\n]->skip;
It could be handled in the lexer:
grammar Hello;
r : (r2 | r1)+ EOF;
r2 : 'World' NUM ;
r1 : KEY_ZERO 'Test';
KEY_ZERO : {getCharPositionInLine() == 0}? [0-9]+;
NUM : [0-9]+;
SPACE : [ \r\n] -> skip;

Alternatives in ANTLR rewrite rule

I'm writing a grammar that supports arbitrary boolean expressions. The grammar is used to represent a program, which is later passed through the static analysis tool. The static analysis tool has certain limitations so I want to apply the following rewrite rules:
Strict inequalities are approximated with epsilon:
expression_a > expression_b -> expression_a >= expression_b + EPSILON
Inequality is approximated using "or" statement:
expression_a != expression_b -> expression_a > expression_b || expression_a < expression_b
Is there any easy way to do it using ANTLR? Currently my grammar looks like so:
comparison : expression ('=='^|'<='^|'>='^|'!='^|'>'^|'<'^) expression;
I'm not sure how to apply a different rewrite rule depending on what the operator is. I want to tree stay as it is if the operator is ("==", "<=" or ">=") and to recursively transform it otherwise, according to the rules defined above.
[...] and to recursively transform it otherwise, [...]
You can do it partly.
You can't tell ANTLR to rewrite a > b to ^('>=' a ^('+' b epsilon)) and then define a != b to become ^('||' ^('>' a b) ^('<' a b)) and then have ANTLR automatically rewrite both ^('>' a b) and ^('<' a b) to ^('>=' a ^('+' b epsilon)) and ^('<=' a ^('-' b epsilon)) respectively.
A bit of manual work is needed here. The trick is that you can't just use a token like >= if this token isn't actually parsed. A solution to this is to use imaginary tokens.
A quick demo:
grammar T;
options {
output=AST;
}
tokens {
AND;
OR;
GTEQ;
LTEQ;
SUB;
ADD;
EPSILON;
}
parse
: expr
;
expr
: logical_expr
;
logical_expr
: comp_expr ((And | Or)^ comp_expr)*
;
comp_expr
: (e1=mult_expr -> $e1) ( Eq e2=mult_expr -> ^(AND ^(GTEQ $e1 $e2) ^(LTEQ $e1 $e2))
| LtEq e2=mult_expr -> ^(LTEQ $e1 $e2)
| GtEq e2=mult_expr -> ^(GTEQ $e1 $e2)
| NEq e2=mult_expr -> ^(OR ^(GTEQ $e1 ^(ADD $e2 EPSILON)) ^(LTEQ $e1 ^(SUB $e2 EPSILON)))
| Gt e2=mult_expr -> ^(GTEQ $e1 ^(ADD $e2 EPSILON))
| Lt e2=mult_expr -> ^(LTEQ $e1 ^(SUB $e2 EPSILON))
)?
;
add_expr
: mult_expr ((Add | Sub)^ mult_expr)*
;
mult_expr
: atom ((Mult | Div)^ atom)*
;
atom
: Num
| Id
| '(' expr ')'
;
Eq : '==';
LtEq : '<=';
GtEq : '>=';
NEq : '!=';
Gt : '>';
Lt : '<';
Or : '||';
And : '&&';
Mult : '*';
Div : '/';
Add : '+';
Sub : '-';
Num : '0'..'9'+ ('.' '0'..'9'+)?;
Id : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
The parser generated from the grammar above will produce the following:
a == b
a != b
a > b
a < b

Resources