Mandatory and optional spaces - parsing

I need to parse strings like this:
"qqq www eee" -> "qqq", "www", "eee" (case A)
"qqq www eee" -> "qqq", "www", "eee" (case B)
Here's the grammar I currently have:
grammar Query;
SHORT_NAME : ('a'..'z')+ ;
name returns [String s]: SHORT_NAME { $s = $SHORT_NAME.text; };
names
returns [List<String> v]
#init { $v = new ArrayList<String>(); }
: name1 = name { $v.add($name1.s); }
(' ' name2 = name { $v.add($name2.s); })*;
It works fine for caseA, but fails for caseB:
line 1:4 missing SHORT_NAME at ' '
line 1:5 extraneous input ' ' expecting SHORT_NAME
line 1:10 extraneous input ' ' expecting SHORT_NAME
Any ideas how to make it work?

Remove the literal ' ' from your names rule and replace it with a SPACES token:
grammar Query;
SPACES
: (' ' | '\t')+
;
SHORT_NAME
: ('a'..'z')+
;
name returns [String s]
: SHORT_NAME { $s = $SHORT_NAME.text; }
;
names returns [List<String> v]
#init { $v = new ArrayList<String>(); }
: a=name { $v.add($a.s); } (SPACES b=name { $v.add($b.s); })*
;
Or simply discard the spaces at the lexer-level so that you don't need to put them in your parser rules:
grammar Query;
SPACES
: (' ' | '\t')+ {skip();}
;
SHORT_NAME
: ('a'..'z')+
;
name returns [String s]
: SHORT_NAME { $s = $SHORT_NAME.text; }
;
names returns [List<String> v]
#init { $v = new ArrayList<String>(); }
: (name { $v.add($b.s); })+
;

Related

Antlr4 parsing issue

When I try to work my message.expr file with Zmes.g4 grammar file via antlr-4.7.1-complete only first line works and there is no reaction for second one. Grammar is
grammar Zmes;
prog : stat+;
stat : (message|define);
message : 'MSG' MSGNUM TEXT;
define : 'DEF:' ('String '|'int ') ID ( ',' ('String '|'Int ') ID)* ';';
fragment QUOTE : '\'';
MSGNUM : [0-9]+;
TEXT : QUOTE ~[']* QUOTE;
MODULE : [A-Z][A-Z][A-Z] ;
ID : [A-Z]([A-Za-z0-9_])*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
and message.expr is
MSG 100 'MESSAGE YU';
DEF: String Svar1,Int Intv1;`
On cmd when I run like this
grun Zmes prog -tree message.expr
(prog (stat (message MSG 100 'MESSAGE YU')))
and there is no second reaction. Why can it be.
Your message should include ';' at the end:
message : 'MSG' MSGNUM TEXT ';';
Also, in your define rule you have 'int ', which should probably be 'Int' (no space and a capital i).
I'd go for something like this:
grammar Zmes;
prog : stat+ EOF;
stat : (message | define) SCOL;
message : MSG MSGNUM TEXT;
define : DEF COL type ID (COMMA type ID)*;
type : STRING | INT;
MSG : 'MSG';
DEF : 'DEF';
STRING : 'String';
INT : 'Int';
COL : ':';
SCOL : ';';
COMMA : ',';
MSGNUM : [0-9]+;
TEXT : '\'' ~[']* '\'';
MODULE : [A-Z] [A-Z] [A-Z] ;
ID : [A-Z] [A-Za-z0-9_]*;
SKIPS : (' '|'\t'|'\r'?'\n'|'\r')+ -> skip;
which produces:
You should also add EOF if you want to parse the entire input, e.g.
prog : stat+ EOF;
See here why.

Resolving ANTLR ambiguity while matching specific Types

I'm starting exploring ANTLR and I'm trying to match this format: (test123 A0020 )
Where :
test123 is an Identifier of max 10 characters ( letters and digits )
A : Time indicator ( for Am or Pm ), one letter can be either "A" or "P"
0020 : 4 digit format representing the time.
I tried this grammar :
IDENTIFIER
:
( LETTER | DIGIT ) +
;
INT
:
DIGIT+
;
fragment
DIGIT
:
[0-9]
;
fragment
LETTER
:
[A-Z]
;
WS : [ \t\r\n(\s)+]+ -> channel(HIDDEN) ;
formatter: '(' information ')';
information :
information '/' 'A' INT
|IDENTIFIER ;
How can I resolve the ambiguity and get the time format matched as 'A' INT not as IDENTIFIER?
Also how can I add checks like length of token to the identifier?
I tknow that this doesn't work in ANTLR : IDENTIFIER : (DIGIT | LETTER ) {2,10}
UPDATE:
I changed the rules to have semantic checks but I still have the same ambiguity between the identifier and the Time format. here's the modified rules:
formatter
: information
| information '-' time
;
time :
timeMode timeCode;
timeMode:
{ getCurrentToken().getText().matches("[A,C]")}? MOD
;
timeCode: {getCurrentToken().getText().matches("[0-9]{4}")}? INT;
information: {getCurrentToken().getText().length() <= 10 }? IDENTIFIER;
MOD: 'A' | 'C';
So the problem is illustrated in the production tree, A0023 is matched to timeMode and the parser is complaining that the timeCode is missing
Here is a way to handle it:
grammar Test;
#lexer::members {
private boolean isAhead(int maxAmountOfCharacters, String pattern) {
final Interval ahead = new Interval(this._tokenStartCharIndex, this._tokenStartCharIndex + maxAmountOfCharacters - 1);
return this._input.getText(ahead).matches(pattern);
}
}
parse
: formatter EOF
;
formatter
: information ( '-' time )?
;
time
: timeMode timeCode
;
timeMode
: TIME_MODE
;
timeCode
: {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\d{4}")}?
IDENTIFIER_OR_INTEGER
;
information
: {getCurrentToken().getType() == IDENTIFIER_OR_INTEGER && getCurrentToken().getText().matches("\\w*[a-zA-Z]\\w*")}?
IDENTIFIER_OR_INTEGER
;
IDENTIFIER_OR_INTEGER
: {!isAhead(6, "[AP]\\d{4}(\\D|$)")}? [a-zA-Z0-9]+
;
TIME_MODE
: [AP]
;
SPACES
: [ \t\r\n] -> skip
;
A small test class:
public class Main {
private static void indent(String lispTree) {
int indentation = -1;
for (final char c : lispTree.toCharArray()) {
if (c == '(') {
indentation++;
for (int i = 0; i < indentation; i++) {
System.out.print(i == 0 ? "\n " : " ");
}
}
else if (c == ')') {
indentation--;
}
System.out.print(c);
}
}
public static void main(String[] args) throws Exception {
TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
indent(parser.parse().toStringTree(parser));
}
}
will print:
(parse
(formatter
(information 1P23) -
(time
(timeMode A)
(timeCode 0023))) <EOF>)
for the input "1P23 - A0023".
EDIT
ANTLR also can output the parse tree on UI component. If you do this instead:
public class Main {
public static void main(String[] args) throws Exception {
TestLexer lexer = new TestLexer(new ANTLRInputStream("1P23 - A0023"));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
new TreeViewer(Arrays.asList(TestParser.ruleNames), parser.parse()).open();
}
}
the following dialog will appear:
Tested with ANTLR version 4.5.2-1
Using semantic predicates (check this amazing QA), you can define parser rules for your specific model, having logic checks that the information can be parsed. Note this is only an option for parser rules, not lexer rules.
information
: information '/' meridien time
| text
;
meridien
: am
| pm
;
am: {input.LT(1).getText() == "A"}? IDENTIFIER;
pm: {input.LT(1).getText() == "P"}? IDENTIFIER;
time: {input.LT(1).getText().length == 4}? INT;
text: {input.LT(1).getText().length <= 10}? IDENTIFIER;
compileUnit
: alfaNum time
;
alfaNum : (ALFA | MOD | NUM)+;
time : MOD NUM+;
MOD: 'A' | 'P';
ALFA: [a-zA-Z];
NUM: [0-9];
WS
: ' ' -> channel(HIDDEN)
;
You need to avoid ambiguity by including MOD into alfaNum rule.

antlr lexer rule that may match almost anything

I got a problem here that I'm sure it is about how antlr works and I am doing it all wrong but I read a lot of docs and tutorials and I still don't fully understand it.
My symptom is thy my grammar stop working when I add (because I need) a lexer rule that may match things it should not. It should only be applied in the right context.
I need ATTR rule because I need description and other rules that will follow to get a string from that keyword to end of line.
This is the conflicting rule:
ATTR
: (~('\r'| '\n'))*
;
It seems it matches anything so it 'eats' the text that should match different tokens. It makes sense, but I need it or I need another solution.
This is my current example input:
; This is a comment
; Comment 2
audit-template this is the id {
description a description may include any char but {line break}
}
For reference this is my current complete grammar:
grammar Grammar;
options {
superClass = AbstractTParser;
}
#header {
package antlrTest;
}
#lexer::header {
package antlrTest;
}
#lexer::members {
private void debug(String str) {
System.err.println("DEBUG(L) " + str);
}
}
#members {
private void debug(String str) {
System.err.println("DEBUG(P) " + str);
}
}
parse
: (template|'**TODO**') EOF { debug ("EOF"); }
;
template : 'audit-template' id=IDENTIFIER OB content=templateContent CB { debug("template id=" + $id.text); }
;
templateContent:
description?
;
description : 'description' ATTR
;
//
COMMENT
: ';' ~( '\r' | '\n' )* {$channel=HIDDEN; debug("COMMENT");}
;
SPACES : ( '\t' | '\f' | ' ' | '\n'| '\r' ) {$channel=HIDDEN;}
;
OB : '{' { debug("OB"); }
;
CB : '}' ('\r'| '\n')+ { debug("CB(lf)"); }
| '}' EOF { debug("CB(eof)"); }
;
IDENTIFIER
: ( 'a'..'z' | 'A'..'Z' | '_' )
( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '.' | ' ' | '\t')*
;
ATTR
: (~('\r'| '\n'))*
;
I am getting this error (the error handling is a bit tweaked):
Line 4 char 0
at antlrTest.AbstractTParser.reportError(AbstractTParser.java:45)
at antlrTest.GrammarParser.parse(GrammarParser.java:106)
at antlrTest.TemplateParser.parseInput(TemplateParser.java:15)
at antlrTest.Main.testFile(Main.java:32)
at antlrTest.Main.main(Main.java:11)
Caused by: NoViableAltException(7#[])
at antlrTest.GrammarParser.parse(GrammarParser.java:71)
... 3 more
Solution? Alternatives? Thank you.

Error generating files in ANTLR

So I'm trying to write a parser in ANTLR, this is my first time using it and I'm running into a problem that I can't find a solution for, apologies if this is a very simple problem. Anyway, the error I'm getting is:
"(100): Expr.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52)"
The code I'm currently using is:
grammar Expr.g;
options{
output=AST;
}
tokens{
MAIN = 'main';
OPENBRACKET = '(';
CLOSEBRACKET = ')';
OPENCURLYBRACKET = '{';
CLOSECURLYBRACKET = '}';
COMMA = ',';
SEMICOLON = ';';
GREATERTHAN = '>';
LESSTHAN = '<';
GREATEROREQUALTHAN = '>=';
LESSTHANOREQUALTHAN = '<=';
NOTEQUAL = '!=';
ISEQUALTO = '==';
WHILE = 'while';
IF = 'if';
ELSE = 'else';
READ = 'read';
OUTPUT = 'output';
PRINT = 'print';
RETURN = 'return';
READC = 'readc';
OUTPUTC = 'outputc';
PLUS = '+';
MINUS = '-';
DIVIDE = '/';
MULTIPLY = '*';
PERCENTAGE = '%';
}
#header {
//package test;
import java.util.HashMap;
}
#lexer::header {
//package test;
}
#members {
/** Map variable name to Integer object holding value */
HashMap memory = new HashMap();
}
prog: stat+ ;
stat: expr NEWLINE {System.out.println($expr.value);}
| ID '=' expr NEWLINE
{memory.put($ID.text, new Integer($expr.value));}
| NEWLINE
;
expr returns [int value]
: e=multExpr {$value = $e.value;}
( '+' e=multExpr {$value += $e.value;}
| '-' e=multExpr {$value -= $e.value;}
)*
;
multExpr returns [int value]
: e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;})*
;
atom returns [int value]
: INT {$value = Integer.parseInt($INT.text);}
| ID
{
Integer v = (Integer)memory.get($ID.text);
if ( v!=null ) $value = v.intValue();
else System.err.println("undefined variable "+$ID.text);
}
| '(' e=expr ')' {$value = $e.value;}
;
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
INT : '0'..'9'+ ;
NEWLINE:'\r'? '\n' ;
WS : (' '|'\t')+ {skip();} ;
Thanks for any help.
EDIT: Well, I'm an idiot, it's just a formatting error. Thanks for the responses from those who helped out.
You have some illegal characters after your IDENT token:
IDENT : ('a'..'z'^|'A'..'Z'^)+ ; : .;
The : .; are invalid there. And you're also trying to mix the tree-rewrite operator ^ inside a lexer rule, which is illegal: remove them. Lastly, you've named it IDENT while in your parser rules, you're using ID.
It should be:
ID : ('a'..'z' | 'A'..'Z')+ ;

ANTLR grammar: Add "dynamic" proximity operator

For a study project, I am using the following ANTLR grammar to parse query strings containing some simple boolean operators like AND, NOT and others:
grammar SimpleBoolean;
options { language = CSharp2; output = AST; }
tokens { AndNode; }
#lexer::namespace { INR.Infrastructure.QueryParser }
#parser::namespace { INR.Infrastructure.QueryParser }
LPARENTHESIS : '(';
RPARENTHESIS : ')';
AND : 'AND';
OR : 'OR';
ANDNOT : 'ANDNOT';
NOT : 'NOT';
PROX : **?**
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'ä'|'Ä'|'ü'|'Ü'|'ö'|'Ö');
fragment QUOTE : ('"');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C');
WS : (SPACE) { $channel=Hidden; };
WORD : (~( ' ' | '\t' | '\r' | '\n' | '/' | '(' | ')' ))*;
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
startExpression : andExpression;
andExpression : (andnotExpression -> andnotExpression) (AND? a=andnotExpression -> ^(AndNode $andExpression $a))*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
proxExpression : **?**
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
Now I would like to add an operator for so-called proximity queries. For example, the query "A /5 B" should return everything that contains A with B following within the next 5 words. The number 5 could be any other positive int of course. In other words, a proximity query should result in the following syntax tree:
http://graph.gafol.net/pic/ersaDEbBJ.png
Unfortunately, I don't know how to (syntactically) add such a "PROX" operator to my existing ANTLR grammar.
Any help is appreciated. Thanks!
You could do that like this:
PROX : '/' '0'..'9'+;
...
startExpression : andExpression;
andExpression : (andnotExpression -> andnotExpression) (AND? a=andnotExpression -> ^(AndNode $andExpression $a))*;
andnotExpression : proxExpression (ANDNOT^ proxExpression)*;
proxExpression : orExpression (PROX^ orExpression)*;
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | LPARENTHESIS! andExpression RPARENTHESIS!;
If you parse the input:
A /500 B OR D NOT E AND F
the following AST is created:

Resources