Don't passed string in the antlr4 grammar - parsing

I have create the grammar(see it bellow) and when I try to validate the string
Time;15 16 * * 1-5; 'muni_eval_comments_'yyyyMMdd_HHmmss'.csv';
have following error message
line 1:, charPositionInLine:57, msg: extraneous input '.csv' expecting {';', '-', ''', '.', '_', ID}
Where I'm wrong and how to fix it?
Regards,
Vladimir
lexer grammar FileTriggerLexer;
#header {
}
STEP
:
'/' INTEGER
;
SCHEDULE
:
'Schedule'
;
SEMICOLON
:
';'
;
ASTERISK
:
'*'
;
CRON
:
'cron'
;
MARKET_CRON
:
'marketCron'
;
COMBINED
:
'combined'
;
FILE_FEED
:
'FileFeed'
;
TIME: 'Time';
LBRACKET
:
'('
;
RBRACKET
:
')'
;
PERCENT
:
'%'
;
INTEGER
:
[0-9]+
;
DASH
:
'-'
;
DOUBLE_QUOTE
:
'"'
;
QUOTE
:
'\''
;
SLASH
:
'/'
;
DOT
:
'.'
;
COMMA
:
','
;
UNDERSCORE
:
'_'
;
ID
:
[a-zA-Z] [a-zA-Z0-9]*
;
REGEX
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
)+
;
WS
:
[ \t\r\n]+ -> skip
;
/**
* Define a grammar called Hello
*/
grammar FileTriggerValidator;
options
{
tokenVocab = FileTriggerLexer;
}
r
:
(
schedule
| file_feed
| time_feed
)+
;
time_feed
:
TIME SEMICOLON cron_part SEMICOLON file_name SEMICOLON
;
file_feed
:
file_feed_name SEMICOLON source_file SEMICOLON source_host SEMICOLON
source_host SEMICOLON regEx SEMICOLON regEx
(
SEMICOLON source_host
)*
;
formatString
:
source_host
(
'%' source_host?
)* DOT source_host
;
regEx
:
REGEX
;
source_host
:
ID
(
DASH ID
)*
;
file_feed_name
:
FILE_FEED
;
source_file
:
(
ID
| DASH
| UNDERSCORE
)+
;
schedule
:
SCHEDULE SEMICOLON schedule_defining SEMICOLON file_name SEMICOLON timezone
(
SEMICOLON INTEGER
)?
;
schedule_defining
:
cron
| market_cron
| combined_cron
;
cron
:
CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE RBRACKET
;
market_cron
:
MARKET_CRON LBRACKET DOUBLE_QUOTE cron_part timezone DOUBLE_QUOTE COMMA
DOUBLE_QUOTE ID DOUBLE_QUOTE RBRACKET
;
combined_cron
:
COMBINED LBRACKET cron_list_element
(
COMMA cron_list_element
)* RBRACKET
;
mic_defining
:
ID
;
file_name
:
(
ID
| DOT
| QUOTE
| DASH
| UNDERSCORE
)+
;
cron_list_element
:
cron
| market_cron
;
//
schedule_defined_string
:
cron
;
//
cron_part
:
minutes hours days_of_month month week_days
;
//
minutes
:
INTEGER
| with_step_value
;
//
hours
:
INTEGER
| with_step_value
;
//
int_list
:
INTEGER
(
COMMA INTEGER
)*
;
interval
:
INTEGER DASH INTEGER
;
//
days_of_month
:
INTEGER
| with_step_value
;
//
month
:
INTEGER
| with_step_value
;
//
week_days
:
INTEGER
| with_step_value
;
//
timezone
:
timezone_part
(
SLASH timezone_part
)?
;
//
timezone_part
:
ID
(
UNDERSCORE ID
)?
;
//
with_step_value
:
(
int_list
| interval
| ASTERISK
) STEP?
;

Your .csv code fragment has been recognized as REGEX token:
TIME SEMICOLON(;) INTEGER(15) INTEGER(16) ASTERISK(*) ASTERISK(*)
INTEGER(1) DASH(-) INTEGER(5) SEMICOLON(;) QUOTE(') ID(muni) UNDERSCORE(_)
ID(eval) UNDERSCORE(_) ID(comments) UNDERSCORE(_) QUOTE(') ID(yyyyMMdd)
UNDERSCORE(_) ID(HHmmss) QUOTE(') REGEX(.csv) QUOTE(') SEMICOLON(;) EOF(<EOF>) EOF
But file_name does not contain REGEX token:
file_name
:
(
ID
| DOT
| QUOTE
| DASH
| UNDERSCORE
)+
;
Try to include REGEX to file_name rule or remove REGEX and use regEx parser rule instead:
regEx
:
(
ID
| DOT
| ASTERISK
| INTEGER
| PERCENT
)+
;

Related

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

ANTLR4 precedence of operator

This is my grammar:
grammar FOOL;
#header {
import java.util.ArrayList;
}
#lexer::members {
public ArrayList<String> lexicalErrors = new ArrayList<>();
}
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
prog : exp SEMIC #singleExp
| let exp SEMIC #letInExp
| (classdec)+ SEMIC (let)? exp SEMIC #classExp
;
classdec : CLASS ID ( EXTENDS ID )? (LPAR (vardec ( COMMA vardec)*)? RPAR)? (CLPAR ((fun SEMIC)+)? CRPAR)?;
let : LET (dec SEMIC)+ IN ;
vardec : type ID ;
varasm : vardec ASM exp ;
fun : type ID LPAR ( vardec ( COMMA vardec)* )? RPAR (let)? exp ;
dec : varasm #varAssignment
| fun #funDeclaration
;
type : INT
| BOOL
| ID
;
exp : left=term (operator=(PLUS | MINUS) right=term)*
;
term : left=factor (operator=(TIMES | DIV) right=factor)*
;
factor : left=value (operator=(EQ | LESSEQ | GREATEREQ | GREATER | LESS | AND | OR ) right=value)*
;
value : MINUS?INTEGER #intVal
| (NOT)? ( TRUE | FALSE ) #boolVal
| LPAR exp RPAR #baseExp
| IF cond=exp THEN CLPAR thenBranch=exp CRPAR (ELSE CLPAR elseBranch=exp CRPAR)? #ifExp
| MINUS?ID #varExp
| THIS #thisExp
| funcall #funExp
| (ID | THIS) DOT funcall #methodExp
| NEW ID ( LPAR (exp (COMMA exp)* )? RPAR)? #newExp
| PRINT ( exp ) #print
;
/* PRINT LPAR exp RPAR */
funcall
: ID ( LPAR (exp (COMMA exp)* )? RPAR )
;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
SEMIC : ';' ;
COLON : ':' ;
COMMA : ',' ;
EQ : '==' ;
ASM : '=' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
TRUE : 'true' ;
FALSE : 'false' ;
LPAR : '(' ;
RPAR : ')' ;
CLPAR : '{' ;
CRPAR : '}' ;
IF : 'if' ;
THEN : 'then' ;
ELSE : 'else' ;
PRINT : 'print' ;
LET : 'let' ;
IN : 'in' ;
VAR : 'var' ;
FUN : 'fun' ;
INT : 'int' ;
BOOL : 'bool' ;
CLASS : 'class' ;
EXTENDS : 'extends' ;
THIS : 'this' ;
NEW : 'new' ;
DOT : '.' ;
LESSEQ : ('<=' | '=<') ;
GREATEREQ : ('>=' | '=>') ;
GREATER: '>' ;
LESS : '<' ;
AND : '&&' ;
OR : '||' ;
NOT : '!' ;
//Numbers
fragment DIGIT : '0'..'9';
INTEGER : DIGIT+;
//IDs
fragment CHAR : 'a'..'z' |'A'..'Z' ;
ID : CHAR (CHAR | DIGIT)* ;
//ESCAPED SEQUENCES
WS : (' '|'\t'|'\n'|'\r')-> skip;
LINECOMENTS : '//' (~('\n'|'\r'))* -> skip;
BLOCKCOMENTS : '/*'( ~('/'|'*')|'/'~'*'|'*'~'/'|BLOCKCOMENTS)* '*/' -> skip;
ERR_UNKNOWN_CHAR
: . { lexicalErrors.add("UNKNOWN_CHAR " + getText()); }
;
I think that there is a problem in the grammar concerning the precedence of operator.
In particular, this one
let
int x = (5-2)+4;
in
print x;
prints 7, while this one:
let
int x = 5-2+4;
in
print x;
prints 9.
Why the first one works? How can I make the second one working, only changing the grammar?
I think there is something to change in exp, term or factor.
This is the first parse tree http://it.tinypic.com/r/2nj8tqw/9 .
This is the second parse tree http://it.tinypic.com/r/2iv02z6/9 .
exp : left=term (operator=(PLUS | MINUS) right=exp)?
This produces parse tree that is causing it. Simply put, 5 - 2 + 4 will be parsed as:
term PLUS exp
2 term MINUS exp
2 term
4
This should help, although you'll have to change the evaluation logic:
exp : left=term (operator=(PLUS | MINUS) right=term)*
Same for factor and any other possible binary operations.

Antlr not recognizing number

I have 3 types of numbers defined, number, decimal and percentage.
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Percentage and decimal work fine but when I assign a number, unless I put a sign (+ or -) in front of the number, it doesn't recognize it as a number.
number foo = +5 // does recognize
number foo = 5; // does not recognize
It does recognize it in an evaluation expression.
if (foo == 5 ) // does recognize
Here is my language (I took out the functions and left only the language recognition).
grammar Fetal;
transaction : begin statements end;
begin : 'begin' ;
end : 'end' ;
statements : (statement)+
;
statement
: declaration ';'
| command ';'
| assignment ';'
| evaluation
| ';'
;
declaration : type var;
var returns : identifier;
type returns
: DecimalType
| NumberType
| StringType
| BooleanType
| DateType
| ObjectType
| DaoType
;
assignment
: lharg Equals rharg
| lharg unaryOP rharg
;
assignmentOp : Equals
;
unaryOP : PlusEquals
| MinusEquals
| MultiplyEquals
| DivideEquals
| ModuloEquals
| ExponentEquals
;
expressionOp : arithExpressOp
| bitwiseExpressOp
;
arithExpressOp : Multiply
| Divide
| Plus
| Minus
| Modulo
| Exponent
;
bitwiseExpressOp
: And
| Or
| Not
;
comparisonOp : IsEqualTo
| IsLessThan
| IsLessThanOrEqualTo
| IsGreaterThan
| IsGreaterThanOrEqualTo
| IsNotEqualTo
;
logicExpressOp : AndExpression
| OrExpression
| ExclusiveOrExpression
;
rharg returns
: rharg expressionOp rharg
| '(' rharg expressionOp rharg ')'
| var
| literal
| assignmentCommands
;
lharg returns : var;
identifier : Identifier;
evaluation : IfStatement '(' evalExpression ')' block (Else block)?;
block : OpenBracket statements CloseBracket;
evalExpression
: evalExpression logicExpressOp evalExpression
| '(' evalExpression logicExpressOp evalExpression ')'
| eval
| '(' eval ')'
;
eval : rharg comparisonOp rharg ;
assignmentCommands
: GetBalance '(' stringArg ')'
| GetVariableType '(' var ')'
| GetDescription
| Today
| GetDays '(' startPeriod=dateArg ',' endPeriod=dateArg ')'
| DayOfTheWeek '(' dateArg ')'
| GetCalendarDay '(' dateArg ')'
| GetMonth '(' dateArg ')'
| GetYear '(' dateArg ')'
| Import '(' stringArg ')' /* Import( path ) */
| Lookup '(' sql=stringArg ',' argumentList ')' /* Lookup( table, SQL) */
| List '(' sql=stringArg ',' argumentList ')' /* List( table, SQL) */
| invocation
;
command : Print '(' rharg ')'
| Credit '(' amtArg ',' stringArg ')'
| Debit '(' amtArg ',' stringArg ')'
| Ledger '(' debitOrCredit ',' amtArg ',' acc=stringArg ',' desc=stringArg ')'
| Alias '(' account=stringArg ',' name=stringArg ')'
| MapFile ':' stringArg
| invocation
| Update '(' sql=stringArg ',' argumentList ')'
;
invocation
: o=objectLiteral '.' m=identifier '('argumentList? ')'
| o=objectLiteral '.' m=identifier '()'
;
argumentList
: rharg (',' rharg )*
;
amtArg : rharg ;
stringArg : rharg ;
numberArg : rharg ;
dateArg : rharg ;
debitOrCredit : charLiteral ;
literal
: numericLiteral
| doubleLiteral
| booleanLiteral
| percentLiteral
| stringLiteral
| dateLiteral
;
fileName : '<' fn=Identifier ('.' ft=Identifier)? '>' ;
charLiteral : ('D' | 'C');
numericLiteral : Number ;
doubleLiteral : Decimal ;
percentLiteral : Percentage ;
booleanLiteral : Boolean ;
stringLiteral : String ;
dateLiteral : Date ;
objectLiteral : Identifier ;
daoLiteral : Identifier ;
//Below are Token definitions
// Data Types
DecimalType : 'decimal' ;
NumberType : 'number' ;
StringType : 'string' ;
BooleanType : 'boolean' ;
DateType : 'date' ;
ObjectType : 'object' ;
DaoType : 'dao' ;
/******************************************************************
* Assignmnt operator
******************************************************************/
Equals : '=' ;
/*****************************************************************
* Unary operators
*****************************************************************/
PlusEquals : '+=' ;
MinusEquals : '-=' ;
MultiplyEquals : '*=' ;
DivideEquals : '/=' ;
ModuloEquals : '%=' ;
ExponentEquals : '^=' ;
/*****************************************************************
* Binary operators
*****************************************************************/
Plus : '+' ;
Minus : '-' ;
Multiply : '*' ;
Divide : '/' ;
Modulo : '%' ;
Exponent : '^' ;
/***************************************************************
* Bitwise operators
***************************************************************/
And : '&' ;
Or : '|' ;
Not : '!' ;
/*************************************************************
* Compariso operators
*************************************************************/
IsEqualTo : '==' ;
IsLessThan : '<' ;
IsLessThanOrEqualTo : '<=' ;
IsGreaterThan : '>' ;
IsGreaterThanOrEqualTo : '>=' ;
IsNotEqualTo : '!=' ;
/*************************************************************
* Expression operators
*************************************************************/
AndExpression : '&&' ;
OrExpression : '||' ;
ExclusiveOrExpression : '^^' ;
// Reserve words (Assignment Commands)
GetBalance : 'getBalance';
GetVariableType : 'getVariableType' ;
GetDescription : 'getDescription' ;
Today : 'today';
GetDays : 'getDays' ;
DayOfTheWeek : 'dayOfTheWeek' ;
GetCalendarDay : 'getCalendarDay' ;
GetMonth : 'getMonth' ;
GetYear : 'getYear' ;
Import : 'import' ;
Lookup : 'lookup' ;
List : 'list' ;
// Reserve words (Commands)
Credit : 'credit';
Debit : 'debit';
Ledger : 'ledger';
Alias : 'alias' ;
MapFile : 'mapFile' ;
Update : 'update' ;
Print : 'print';
IfStatement : 'if';
Else : 'else';
OpenBracket : '{';
CloseBracket : '}';
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Boolean : 'true' | 'false';
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Date : Year '-' Month '-' Day;
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
String: '"' ( ESC | ~[\\"] )* '"';
/************************************************************
* Fragment Definitions
************************************************************/
fragment
ESC : '\\' [abtnfrv"'\\]
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
fragment
Sign : Plus | Minus;
fragment
Digits
: [-+]?[0-9]+
;
fragment
Year
: Digit Digit Digit Digit;
fragment
Month
: Digit Digit;
fragment
Day
: Digit Digit;
fragment Dot : '.';
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| SimpleEscapeSequence
| '\\\n' // Added line
| '\\\r\n' // Added line
;
fragment
CChar
: ~['\\\r\n]
| SimpleEscapeSequence
;
fragment
SimpleEscapeSequence
: '\\' ['"?abfnrtv\\]
;
ExtendedAscii
: [\x80-\xfe]+
-> skip
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I have a hunch that this use of a fragment is incorrect:
fragment Sign : Plus | Minus;
I couldn't find anything in the reference book, but I think it needs to be changed to something like this:
fragment Sign : [+-];
I found the issue. I was using version 4.5.2-1 because every attempt to upgrade to 4.7 caused more errors and I didn't want to cause more errors while trying to solve another. I finally broke down and upgraded the libraries to 4.7, fixed the errors and the number recognition issue disappeared. It was a bug in the library, all this time.

Antlr3 - Non Greedy Double Quoted String with Escaped Double Quote

The following Antlr3 Grammar file doesn't cater for escaped double quotes as part of the STRING lexer rule. Any ideas why?
Expressions working:
\"hello\"
ref(\"hello\",\"hello\")
Expressions NOT working:
\"h\"e\"l\"l\"o\"
ref(\"hello\", \"hel\"lo\")
Antlr3 grammar file runnable in AntlrWorks:
grammar Grammar;
options
{
output=AST;
ASTLabelType=CommonTree;
language=CSharp3;
}
public oaExpression
: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
STRING
: '"' ( options { greedy=false; } : ~'"' )* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
Try this:
STRING
: '"' // a opening quote
( // start group
'\\' ~('\r' | '\n') // an escaped char other than a line break char
| // OR
~('\\' | '"'| '\r' | '\n') // any char other than '"', '\' and line breaks
)* // end group and repeat zero or more times
'"' // the closing quote
;
When I test the 4 different test cases from your comment:
"\"hello\""
"ref(\"hello\",\"hello\")"
"\"h\"e\"l\"l\"o\""
"ref(\"hello\", \"hel\"lo\")"
with the lexer rule I suggested:
grammar T;
parse
: string+ EOF
;
string
: STRING
;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
ANTLRWorks' debugger produces the following parse tree:
In other words: it works just fine (on my machine :)).
EDIT II
And I've also used your grammar (making some small changes to make it Java compatible) where I replaced the incorrect STRING rule into the one I suggested:
oaExpression
: STRING+ EOF!
//: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
//STRING
// : '"' ( options { greedy=false; } : ~'"' )* '"'
// ;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;} /*{$channel=Hidden;}*/
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
which parses the input from my previous example in an identical parse tree.
This is how I do this with strings that can contain escape sequences (not just \" but any):
DOUBLE_QUOTED_TEXT
#init { int escape_count = 0; }:
DOUBLE_QUOTE
(
DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }
| ESCAPE_OPERATOR . { escape_count++; }
| ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)
)*
DOUBLE_QUOTE
{ EMIT(); LTOKEN->user1 = escape_count; }
;
The rule additionally counts the escapes and stores them in the token. This allows the receiver to quickly see if it needs to do anything with the string (if user1 > 0). If you don't need that remove the #init part and the actions.

antlrworks with multiple equals

I'm having a problem with my grammar.
It says Decision can match input such as "{EQUAL, GREATER..GREATER_EQUAL, LOWER..LOWER_EQUAL, NOT_EQUAL}" using multiple alternatives: 1, 2, Although all the trees of rules are correct.
Anyone to help?!
grammar test;
//parser : .* EOF;
program :T_PROGRAM ID T_LEFTPAR identifier_list T_RIGHTPAR T_SEMICOLON declarations subprogram_declarations compound_statement T_DOT;
identifier_list :(ID) (',' ID)*;
declarations :() (T_VAR identifier_list COLON type T_SEMICOLON)* ;
type : standard_type|
T_ARRAY T_LEFTBRACK NUM T_TO NUM T_RIGHTBRACK T_OF standard_type ;
standard_type : INT
| FLOAT ;
subprogram_declarations :() (subprogram_declaration T_SEMICOLON)* ;
subprogram_declaration : subprogram_head declarations compound_statement;
subprogram_head :T_FUNCTION ID arguments COLON standard_type |
T_PROCEDURE ID arguments ;
arguments :T_LEFTPAR parameter_list T_RIGHTPAR | ;
parameter_list :(identifier_list COLON type) (T_SEMICOLON identifier_list COLON type)*;
compound_statement : T_BEGIN optional_statements T_END;
optional_statements :statement_list | ;
statement_list :(statement) (T_SEMICOLON statement)*;
statement :
variable ASSIGN expression
| procedure_statement
| compound_statement
| T_IF expression T_THEN statement T_ELSE statement
| T_WHILE expression T_DO statement
;
procedure_statement :ID
| ID T_LEFTPAR expression_list T_RIGHTPAR;
expression_list : (expression) (',' expression)*;
variable : ID T_LEFTBRACK expression T_RIGHTBRACK | ;
expression :( () |simple_expression) (( LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL ) simple_expression)* ;
simple_expression :
( () | sign ) term (( PLUS | MINUS | T_OR ) term)*;
term :
(factor) (( CROSS | DIVIDE | MOD | T_AND ) factor)*;
factor :
variable
|ID T_LEFTPAR expression_list T_RIGHTPAR
| NUM
| T_LEFTPAR expression T_RIGHTPAR
| T_NOT factor;
sign :
'+'
| '-';
/********/
T_PROGRAM : 'program';
T_FUNCTION : 'function';
T_PROCEDURE : 'procedure';
T_READ : 'read';
T_WRITE : 'write';
T_OF : 'of';
T_ARRAY : 'array';
T_VAR : 'var';
T_FLOAT : 'float';
T_INT : 'int';
T_CHAR : 'char';
T_STRING : 'string';
T_BEGIN : 'begin';
T_END : 'end';
T_IF : 'if';
T_THEN : 'then';
T_ELSE : 'else';
T_WHILE : 'while';
T_DO : 'do';
T_NOT : 'not';
NUM : INT
| FLOAT;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
LOWER : '<';
LOWER_EQUAL : '<=';
GREATER : '>';
GREATER_EQUAL : '>=';
EQUAL : '=';
NOT_EQUAL : '<>';
ASSIGN :
':=';
COLON :
':';
PLUS : '+';
MINUS : '-';
T_OR : 'OR';
CROSS : '*';
DIVIDE : '/';
MOD : 'MOD';
T_AND : 'AND';
T_LEFTPAR
: '(';
T_RIGHTPAR
: ')';
T_LEFTBRACK
: '[';
T_RIGHTBRACK
: ']';
T_TO
: '..';
T_DOT : '.';
T_SEMICOLON
: ';';
T_COMMA
: ',';
T_BADNUM
: (NUM)(CHAR)*;
T_BADSTRING
: '"' ( ESC_SEQ | ~('\\'|'"') )*WS;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
INT : '0'..'9'+
;
fragment
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
Problem 1
The ambiguity (the errors/warnings that certain input can be matched in more than 1 way) originate mainly from the fact that both your variable and expression rules can match empty input:
variable
: ID T_LEFTBRACK expression T_RIGHTBRACK
| /* nothing */
;
expression
: (
(/* nothing */)
| simple_expression
)
(
(LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL) simple_expression
)* /* because of the `*`, this can also match nothing */
;
(I added the /* ... */ comments and reformatted the rules to make them more readable)
Fix 1
You probably want to do it like this instead:
variable
: ID T_LEFTBRACK expression T_RIGHTBRACK
;
expression
: simple_expression ((LOWER | LOWER_EQUAL | GREATER | GREATER_EQUAL | EQUAL | NOT_EQUAL) simple_expression)*
;
Problem 2
Another problem (one that will show up once you would have resolved the ambiguities, is that you defined the tokens MOD, T_AND and T_OR after your ID rule:
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
...
MOD : 'MOD';
T_AND : 'AND';
T_OR : 'OR';
This will cause MOD, T_AND and T_OR to be never created since ID matches these characters too.
Fix 2
Place MOD, T_AND and T_OR before your ID rule:
MOD : 'MOD';
T_AND : 'AND';
T_OR : 'OR';
...
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;

Resources