I am trying to parse message using antlr4
:12B:DOCUMENT:some nice text
DOCUMENT2:some nice text
this is the expected output from the parser
heading -> 12B
subheading -> DOCUMENT
subheading -> DOCUMENT2
TEXT -> some nice text
TEXT -> some nice text
but on trying to extract the heading with the following grammar
grammar Simple;
para : heading* EOF;
header : heading text ;
heading : COLEN HEAD COLEN;
text : TEXT;
/* tokens */
TEXT : ~[\t]+ ;
HEAD : [0-9A-Z]+ ;
COLEN : ':';
one supplying the input I am getting the following error
line 1:0 mismatched input ':12:nithin\n' expecting ':'
Could someone please tell me the possible cause and solution to parse the same? If I've missed anything, over- or under-emphasized a specific point, please let me know in the comments. Thank you so much in advance for your time.
Related
I'm trying to use nearley.js to write a parse for ini-like files, with difference that string value may contain special control symbols. For example ^y mean the text after this symbol must be yellow, ^b - blue, &i - italic.
I use nearley playground (http://omrelli.ug/nearley-playground/) and started with very basic grammar for value:
VALUE -> FONT_MODIFIER | COLOR_MODIFIER | TEXT
TEXT -> [^\n\^\&]:+
FONT_MODIFIER -> "&" [iIbBsS]
COLOR_MODIFIER -> "^" [aAbBcCdDfFgGiIkKmMoOpPrRsSwWyYnN]
But after I add test with random text (just letters, like "asdassad") after few seconds it gives me error Possible infinite loop detected! Check your grammar for infinite recursion.
What am I doing wrong? I just can't see where the loop comes from.
I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.
I've the following island grammar that works fine (and I think as expected):
lexer grammar FastTestLexer;
// Default mode rules (the SEA)
OPEN1 : '#' -> mode(ISLAND) ; // switch to ISLAND mode
OPEN2 : '##' -> mode(ISLAND);
OPEN3 : '###' -> mode(ISLAND);
OPEN4 : '####' -> mode(ISLAND);
LISTING_OPEN : '~~~~~' -> mode(LISTING);
NL : [\r\n]+;
TEXT : ~('#'|'~')+; // ~('#'|'~')+ ; // clump all text together
mode ISLAND;
CLOSE1 : '#' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE2 : '##' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE3 : '###' -> mode(DEFAULT_MODE) ; // back to SEA mode
CLOSE4 : '####' -> mode(DEFAULT_MODE) ; // back to SEA mode
INLINE : ~'#'+ ; // clump all text together
mode LISTING;
LISTING_CLOSE : '~~~~~' -> mode(DEFAULT_MODE);
INLINE_LISTING : ~'~'+; //~('~'|'#')+;
And the parser grammar:
parser grammar FastTextParser;
options { tokenVocab=FastTestLexer; } // use tokens from ModeTagsLexer.g4
dnpMD
: subheadline NL headline NL lead (subheading | listing | text | NL)*
;
headline
: OPEN1 INLINE CLOSE1
;
subheadline
: OPEN2 INLINE CLOSE2
;
lead
: OPEN3 INLINE CLOSE3
;
subheading
: OPEN4 INLINE CLOSE4
;
listing
: LISTING_OPEN INLINE_LISTING LISTING_CLOSE
;
text
: TEXT
;
Input text like this ones working fine:
## Heading2 ##
# Heading1 #
### Heading3 ###
fffff
#### Heading4 ####
I'm a line.
~~~~~
ffffff
~~~~~
I'm a line, too.
#### Heading4a ####
The TEXT lexer token is matching all the text. Of course except '#' and '~' so the parser knows when there are headings and listings are coming.
My problem is that within the text both characters '#' and '~' should be allowed. The single '#' is only needed for the heading and this parser rule is not active within the body (just one heading at the beginning of the document).
Is there a way to allow '#' and '~' within the text without escaping? My first thought was to disallow '##' within the text:
TEXT : ~('##'|'~')+;
But multiple characters are not allowed there. :(
Maybe someone can give me a hint. But I think this isn't solvable at all. Not solvable with ANTLR4 I mean. Maybe there's another technology.
You could try to do more work in the parser and less in the lexer. Allow # and ~ inside text and not inside TEXT, something similar to:
text
: TEXT
: OPEN1
: TEXT text
: OPEN1 text
;
Adjust the rules for the headlines etc. accordingly.
That way, not the lexer has to decide what a # (or ~) means, what can be relatively hard, because the lexer does not really know the context, but it only decides that it has seen a hash sign. Instead, the parser decides on the meaning of it, and it knows the context in which it appears.
I have a very simple example text which I want to parse with ANTLR, and yet I'm getting wrong results due to ambiguous definition of the rule.
Here is the grammar:
grammar SimpleExampleGrammar;
prog : event EOF;
event : DEFINE EVT_HEADER eventName=eventNameRule;
eventNameRule : DIGIT+;
DEFINE : '#define';
EVT_HEADER : 'EVT_';
DIGIT : [0-9a-zA-Z_];
WS : ('' | ' ' | '\r' | '\n' | '\t') -> channel(HIDDEN);
First text example:
#define EVT_EX1
Second text example:
#define EVT_EX1
#define EVT_EX2
So, the first example is parsed correctly.
However, the second example doesn't work, as the eventNameRule matches the next "#define ..." and the parse tree is incorrect
Appreciate any help to change the grammar to parse this correctly.
Thanks,
Busi
Beside the missing loop specifier you also have a problem in your WS rule. The first alt matches anything. Remove that. And, btw, give your DIGIT rule a different name. It matches more than just digits.
As Adrian pointed out, my main mistake here is that in the initial rule (prog) I used "event" and not "event+" this will solve the issue.
Thanks Adrian.
I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.