ANTR3 set the number of accepted characters for a token - parsing

I have to create a Lexer which will accept for example an integer only if it has a maximum of 8 digits. Is here an alternative to do it rather than just writing it like this?
INTEGER : (DIG | DIG DIG | DIG DIG DIG | ...)

This can be done using a Gated Semantic Predicates like this:
INTEGER
#init{int n = 1;}
: ({n <= 8}?=> DIGIT {n++;})+
;
fragment DIGIT : '0'..'9';
Details about this kind of predicate, see: What is a 'semantic predicate' in ANTLR?

Related

How to differentiate between a parser and lexer rule in the following?

Sometimes I get a bit confused between a lexing rule vs. a parsing rule, and there's been a nice thread on it here. For example in the following:
value
: string CAST_OPERATOR type
;
string
: S_QUOTE STRING_VALUE S_QUOTE
;
# <-- what is this?
type
: 'date' | 'string'
;
STRING_VALUE
: [a-zA-Z0-9-]+
;
CAST_OPERATOR
: '::'
;
For the type -- this is either the string (or character stream) date or string. Should that be a lexing rule or a parsing rule? I suppose I could break it down even more into:
type
: DATE_TYPE | STRING_TYPE
;
DATE_TYPE
: 'date'
;
STRING_TYPE
: 'string'
;
But still I'm not quite sure which of the above is preferable, and why it would be so. The first two rules -- value and string seem clear to me to be parsing rules -- and the last two rules -- STRING_VALUE and CAST_OPERATOR seem clear to me to be lexing rules (only by intuition though, I could not give a proper explanation). So why would the type be one way or the other?
Literally the only practical difference I've found is that a lexing rule can include a character class and a parsing rule cannot.
Update: I suppose another thing is a lexing rule is terminal, it won't provide any subdivision of parts. For example in the following we can break down $55 into $ and 55:
But if we set the cost as a lexing rule, it will not break it down any further:
So basically a lexing rule is atomic and terminal, whereas a parsing rule is more like a molecule that consists of various parts (atoms) that can be seen within it. Is that a good description/understanding of it?
Your "Update" is on the right track. That's a definite distinction.
You also need to understand the ANTLR pipeline. I.e. that the stream of characters is processed by the Lexer rules to produce a stream of tokens (atoms, in you analogy). It does not do that with recursive descent rule matching, but rather attempts to match you input against all of the Lexer rules. Where:
The rule that matches the longest sequence of input characters will "win"
In the event that multiple Lexer rules match the same length character sequence, then the rule that occurs first will "win"
Once you've got you stream of "atoms" (aka Tokens), then ANTLR uses the parser rules (recursively from the start rule) to try to match sequences of tokens.

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
Inputs:
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
UPDATE:
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
SCREENSHOTS
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

Ambiguous ANTLR parser rule

I have a very simple example text which I want to parse with ANTLR, and yet I'm getting wrong results due to ambiguous definition of the rule.
Here is the grammar:
grammar SimpleExampleGrammar;
prog : event EOF;
event : DEFINE EVT_HEADER eventName=eventNameRule;
eventNameRule : DIGIT+;
DEFINE : '#define';
EVT_HEADER : 'EVT_';
DIGIT : [0-9a-zA-Z_];
WS : ('' | ' ' | '\r' | '\n' | '\t') -> channel(HIDDEN);
First text example:
#define EVT_EX1
Second text example:
#define EVT_EX1
#define EVT_EX2
So, the first example is parsed correctly.
However, the second example doesn't work, as the eventNameRule matches the next "#define ..." and the parse tree is incorrect
Appreciate any help to change the grammar to parse this correctly.
Thanks,
Busi
Beside the missing loop specifier you also have a problem in your WS rule. The first alt matches anything. Remove that. And, btw, give your DIGIT rule a different name. It matches more than just digits.
As Adrian pointed out, my main mistake here is that in the initial rule (prog) I used "event" and not "event+" this will solve the issue.
Thanks Adrian.

Lexer predicate for XPath 3 comments

I am trying to implement an XPath 3 parser in Antlr 4. In the EBNF given in the XPath specification it makes use of - to indicate that something should be excluded, if I understand correctly, then in Antlr I can use a predicate instead to achieve the same behaviour.
I am struggling with implementing CommentContents from the EBNF, as I am not quite sure how to construct the predicate. This is what I have so far:
/** [2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
*
* //any Unicode character, excluding the surrogate blocks, FFFE, and FFFF
*/
Char : '\u0001'..'\uD7FF' | '\uE000'..'\uFFFD' | '\u10000'..'\u10FFFF' ;
/** [108] CommentContents ::= (Char+ - (Char* ('(:' | ':)') Char*)) */
CommentContents : Char+ { $Char+.text.indexOf("(:") + $Char+.text.indexOf(":)") == 0 } ;
Can someone confirm if I have the predicate for CommentContents correct so that it matches the intention of the EBNF?
You need {...}? not {...}. Also $Char+.text won't work. I suggest making Java function that does the test, returning boolean then just call it.

Grammar for a recognizer of a spice-like language

I'm trying to build a grammar for a recognizer of a spice-like language using Antlr-3.1.3 (I use this version because of the Python target). I don't have experience with parsers. I've found a master thesis where the student has done the syntactic analysis of the SPICE 2G6 language and built a parser using the LEX and YACC compiler writing tools. (http://digitool.library.mcgill.ca/R/?func=dbin-jump-full&object_id=60667&local_base=GEN01-MCG02) In chapter 4, he describes a grammar in Backus-Naur form for the SPICE 2G6 language, and appends to the work the LEX and YACC code files of the parser.
I'm basing myself in this work to create a simpler grammar for a recognizer of a more restrictive spice language.
I read the Antlr manual, but could not figure out how to solve two problems, that the code snippet below illustrates.
grammar Najm_teste;
resistor
: RES NODE NODE VALUE 'G2'? COMMENT? NEWLINE
;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+; // non-negative integer
VALUE : REAL; // non-negative real
fragment
SIG : '+'|'-';
fragment
DIG : '0'..'9';
fragment
EXP : ('E'|'e') SIG? DIG+;
fragment
FLT : (DIG+ '.' DIG+)|('.' DIG+)|(DIG+ '.');
fragment
REAL : (DIG+ EXP?)|(FLT EXP?);
COMMENT : '%' ( options {greedy=false;} : . )* NEWLINE;
NEWLINE : '\r'? '\n';
WS : (' '|'\t')+ {$channel=HIDDEN;};
// END:tokens
In the grammar above, the token NODE is a subset of the set represented by the VALUE token. The grammar correctly interprets an input like "R1 5 0 1.1/n", but cannot interpret an input like "R1 5 0 1/n", because it maps "1" to the token NODE, instead of mapping it to the token VALUE, as NODE comes before VALUE in the tokens section. Given such inputs, does anyone has an idea of how can I map the "1" to the correct token VALUE, or a suggestion of how can I alter the grammar so that I can correctly interpret the input?
The second problem is the presence of a comment at the end of a line. Because the NEWLINE token delimits: (1) the end of a comment; and (2) the end of a line of code. When I include a comment at the end of a line of code, two newline characters are necessary to the parser correctly recognize the line of code, otherwise, just one newline character is necessary. How could I improve this?
Thanks!
Problem 1
The lexer does not "listen" to the parser. The lexer simply creates tokens that contain as much characters as possible. In case two tokens match the same amount of characters, the token defined first will "win". In other words, "1" will always be tokenized as a NODE, even though the parser is trying to match a VALUE.
You can do something like this instead:
resistor
: RES NODE NODE value 'G2'? COMMENT? NEWLINE
;
value : NODE | REAL;
// START:tokens
RES : ('R'|'r') DIG+;
NODE : DIG+;
REAL : (DIG+ EXP?) | (FLT EXP?);
...
E.g., I removed VALUE, added value and removed fragment from REAL
Problem 2
Do not let the comment match the line break:
COMMENT : '%' ~('\r' | '\n')*;
where ~('\r' | '\n')* matches zero or more chars other than line break characters.

Resources