ANTLR4: How can I recognize words from an alphabet? - parsing

I am new to Antl4. I have an antlr grammar file that consists of something similar to:
consonant : 'b' | 'c' | 'd' | 'f' ;
vowel : 'a' | 'e' | 'i' ;
connector : ':' | '-' ;
cseq : (consonant)+ ;
vseq : (vowel)+ ;
prefix : cseq vseq ;
word : (cseq vseq | cseq)+ ;
From my understanding, even though these lines are at the bottom of a file, they're still considered rules. My parse tree captures each individual letter instead of treating them as lexical items - or words. How can I change these rules into lexer statements?

A couple of things to keep in mind.
parser rules are rules beginning with lower case letters
lexer rules are those whose name begins with an uppercase character (fairly common convention is to make then all uppercase)
if you put a literal character in a parser rule (all of your rules are parser rules, as they begin with lower case characters), ANTLR will synthesize a TOKEN rule for those characters.
Since it appears that you want a word to be a lexical item (i.e. Token), you could do something along the lines of:
fragment CONSONANT : 'b' | 'c' | 'd' | 'f' ;
fragment VOWEL : 'a' | 'e' | 'i' ;
CONNECTOR : ':' | '-' ; // not sure what you intend for this
fragment CSEQ: CONSONANT+ ;
fragment VSEQ : VOWEL+ ;
PREFIX : CSEQ VSEQ ; // not sure what you intend for this
WORD : (CSEQ VSEQ | CSEQ)+ ;
(That's making quite a few assumptions about your intention.)
Main point, if you want WORDs to be single tokens, they need to be defined as a Lexer rule.
If you want to compose rules for Lexer rules, you can define fragment rules. These rules can be used to compose Lexer rules, but will not, themselves, be recognized as tokens.
With the changes here, you should be able to use WORD in a parser rule, and have all the characters that make up your WORD in a single Token.

Related

Does antlr automatically discard whitespace?

I've written the following arithmetic grammar:
grammar Calc;
program
: expressions
;
expressions
: expression (NEWLINE expression)*
;
expression
: '(' expression ')' // parenExpression has highest precedence
| expression MULDIV expression // then multDivExpression
| expression ADDSUB expression // then addSubExpression
| OPERAND // finally the operand itself
;
MULDIV
: [*/]
;
ADDSUB
: [-+]
;
// 12 or .12 or 2. or 2.38
OPERAND
: [0-9]+ ('.' [0-9]*)?
| '.' [0-9]+
;
NEWLINE
: '\n'
;
And I've noticed that regardless of how I space the tokens I get the same result, for example:
1+2
2+3
Or:
1 +2
2+3
Still give me the same thing. Also I've noticed that adding in the following rule does nothing for me:
WS
: [ \r\n\t] + -> skip
Which makes me wonder whether skipping whitespace is the default behavior of antlr4?
ANTLR4 based parsers have the ability to skip over single unwanted or missing tokens and continue parsing if possible (which is the case here). And there's no default to ignore whitespaces. You have to always specify a whitespace rule which either skips them or puts them on a hidden channel.

Why use the fragment modifier here?

I've seen the use of fragment quite frequently within a Lexing rule, but not quite sure what its use is, or why it cannot just be removed. For example in the following rule:
NUMBER
: DECIMAL ([Ee] [+-]?[0-9]+)?
;
fragment DECIMAL
: [0-9]+ ('.' [0-9]*)? | '.' [0-9]+
;
When I remove the fragment I still get the same parse tree. So what exactly is the use of using fragment or is it mainly an annotative type of thing?
As another example from this tutorial page:
Fragments are reusable parts of lexer rules which cannot match on their own - they need to be referenced from a lexer rule.
INTEGER: DIGIT+
| '0' [Xx] HEX_DIGIT+
;
fragment DIGIT: [0-9];
fragment HEX_DIGIT: [0-9A-Fa-f];
I see no difference from using the following two approaches:
And without fragments:
Could someone please explain why these would be useful then?
The fragment declaration prevents the part from being recognized as a token. That might not be necessary very often, but it can definitely save you from hard-to-find bugs.
Let's take the second example in your post, without the fragment modifiers:
expression: INTEGER ;
INTEGER: DIGIT+
| '0' [Xx] HEX_DIGIT+
;
DIGIT: [0-9];
HEX_DIGIT: [0-9A-Fa-f];
Now, we decide that we want to add variables to the grammar:
expression: INTEGER | IDENTIFIER ;
INTEGER: DIGIT+
| '0' [Xx] HEX_DIGIT+
;
DIGIT: [0-9];
HEX_DIGIT: [0-9A-Fa-f];
IDENTIFIER: LETTER (LETTER | DIGIT)+ ;
LETTER: [A-Za-z] ;
Do you see the bug?
The parser won't handle the input a, although it has no trouble with ax or i. That's because the tokeniser will interpret a as a HEX_DIGIT, not an IDENTIFIER.
Of course, I could have prevented that by putting HEX_DIGIT after IDENTIFIER, but that's more thinking about lexer rule ordering than I really want to do. I'd like the implementation details of IDENTIFIER and INTEGER to not interfere with each other, thank you very much.
Correctly flagging non-token fragments, like LETTER, DIGIT and HEX_DIGIT saves me from having to think about whether a fragment might somehow manage to high-jack a token definition somewhere else in the file.
Here's a possibly more pernicious example, based on your first example:
NUMBER : DECIMAL EXPONENT? ;
EXPONENT: [Ee] [+-]? [0-9]+ ;
DECIMAL : [0-9]+ ('.' [0-9]*)? | '.' [0-9]+ ;
Once I add expressions to that grammar, I'll find that f+17 is fine, but e+17 is a syntax error. Why? Because it is recognised as an EXPONENT, rather than being parsed as an expression. No reordering of lexical rules will prevent that. But adding the fragment modifiers does the trick.

ANTLR4 - How to tokenize differently inside quotes?

I am defining an ANTLR4 grammar and I'd like it to tokenize certain - but not all - things differently when they appear inside double-quotes than when they appear outside double-quotes. Here's the grammar I have so far:
grammar SimpleGrammar;
AND: '&';
TERM: TERM_CHAR+;
PHRASE_TERM: (TERM_CHAR | '%' | '&' | ':' | '$')+;
TRUNCATION: TERM '!';
WS: WS_CHAR+ -> skip;
fragment TERM_CHAR: 'a' .. 'z' | 'A' .. 'Z';
fragment WS_CHAR: [ \t\r\n];
// Parser rules
expr:
expr AND expr
| '"' phrase '"'
| TERM
| TRUNCATION
;
phrase:
(TERM | PHRASE_TERM | TRUNCATION)+
;
The above grammar works when parsing a! & b, which correctly parses to:
AND
/ \
/ \
a! b
However, when I attempt to parse "a! & b", I get:
line 1:4 extraneous input '&' expecting {'"', TERM, PHRASE_TERM, TRUNCATION}
The error message makes sense, because the & is getting tokenized as AND. What I would like to do, however, is have the & get tokenized as a PHRASE_TERM when it appears inside of double-quotes (inside a "phrase"). Note, I do want the a! to tokenize as TRUNCATION even when it appears inside the phrase.
Is this possible?
It is possible if you use lexer modes. It is possible to change mode after encounter of specific token. But lexer rules must be defined separately, not in combined grammar.
In your case, after encountering quote, you will change mode and after encountering another quote, you will change mode back to the default one.
LBRACK : '[' -> pushMode(CharSet);
RBRACK : ']' -> popMode;
For more information google 'ANTLR lexer Mode'

Ignoring whitespace (in certain parts) in Antlr4

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).
So say we have the following grammar
grammar Foo;
program : A* ;
A : ID '#' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;
and a test input
foo#bar(X,Y);
foo#baz ( z,Z) ;
The first line is parsed correctly whereas the second one is not.
I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'#'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.
Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.
Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.
grammar Foo;
program : a* EOF ; // EOF will require parsing the entire input
a : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ; // simpler equivalent construct
Addr : ID '#' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
Define ID '#' ID as lexer token rather than as parser token.
A : AID '(' IDList ')' ';' ;
AID : [a-zA-Z]+ '#' [a-zA-Z]+;
Other options
enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Antlr grammar, implicit token definition in parser rule

A weird thing is going on. I defined the grammar and this is an excerpt.
name
: Letter
| Digit name
| Letter name
;
numeral
: Digit
| Digit numeral
;
fragment
Digit
: [0-9]
;
fragment
Letter
: [a-zA-Z]
;
So why does it show warnings for just two lines (Letter and Digit name) where i referenced a fragment and others below are completely fine...
Lexer rules you mark as fragments can only be used by other lexer rules, not by parser rules. Fragment rules never become a token of their own.
Be sure you understand the difference: What does "fragment" mean in ANTLR?
EDIT
Also, I now see that you're doing too much in the parser. The rules name and numeral should really be a lexer rule:
Name
: ( Digit | Letter)* Letter
;
Numeral
: Digit+
;
in which case you don't need to account for a Space rule in any of your parser rules (this is about your last question which was just removed).
Just in case you are using an older version of antlr:
[0-9]
and
[a-zA-Z]
are not valid regular expressions in old Antlr.
replace them with
'0'..'9'
and
('a'..'z' | 'A'..'Z')
and your issues should go away.

Resources