ANTLR grammar to recognize DIGIT keys and INTEGERS too - parsing

I'm trying to create an ANTLR grammar to parse sequences of keys that optionally have a repeat count. For example, (a b c r5) means "repeat keys a, b, and c five times."
I have the grammar working for KEYS : ('a'..'z'|'A'..'Z').
But when I try to add digit keys KEYS : ('a'..'z'|'A'..'Z'|'0'..'9') with an input expression like (a 5 r5), the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY. (Or so I think; the error messages are difficult to interpret "NoViableAltException").
I have tried these grammatical forms, which work ('r' means "repeatcount"):
repeat : '(' LETTERKEYS INTEGER ')' - works for a-zA-Z
repeat : '(' LETTERKEYS 'r' INTEGER ')'; - works for a-zA-Z
But I fail with
repeat : '(' LETTERSandDIGITKEYS INTEGER ')' - fails on '(a 5 r5)'
repeat : '(' LETTERSandDIGITKEYS 'r' INTEGER ')'; - fails on '(a 5 r5)'
Maybe the grammar can't do the recognition; maybe I need to recognize all the 5's keys in the same way (as KEYS or DIGITS or INTEGERS) and in the parse tree visitor interpret the middle DIGIT instances as keys, and the last set of DIGITS as an INTEGER count?
Is it possible to define a grammar that allows me to repeat digit keys as well as letter keys so that expressions like (a 5 123 r5) will be recognized correctly? (That is, "repeat keys a,5,1,2,3 five times.") I'm not tied to that specific syntax, although it would be nice to use something similar.
Thank you.

the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY.
If you have defined the following rules:
INTEGER : [0-9]+;
KEY : [a-zA-Z0-9];
then a single digit, like 5 in your example, will always become an INTEGER token. Even if
the parser is trying to match a KEY token, the 5 will become an INTEGER. There is nothing
you can do about that: this is the way ANTLR's lexer works. The lexer works in the following way:
try to consume as many characters as possible (the longest match wins)
if 2 or more rules match the same characters (like INTEGER and KEY in case of 5), let the rule defined first "win"
If you want a 5 to be an INTEGER, but sometimes a KEY, do something like this instead:
key : KEY | SINGLE_DIGIT | R;
integer : INTEGER | SINGLE_DIGIT;
repeat : R integer;
SINGLE_DIGIT : [0-9];
INTEGER : [0-9]+;
R : 'r';
KEY : [a-zA-Z];
and in your parser rules, you use key and integer instead of KEY and INTEGER.

You can split your grammar into two parts. One to be the lexer grammar, one to be the parser grammar. The lexer grammar splits the input characters into tokens. The parser grammar uses the string of tokens to parse and build a syntax tree. I work on Tunnel Grammar Studio (TGS) that can generate parsers with this two ABNF (RFC 5234) like grammars:
key = 'a'-'z' / 'A'-'Z' / '0'-'9'
repeater = 'r' 1*('0'-'9')
That is the lexer grammar that has two rules. Each character that is not processed by the lexer grammar, is converted to a token, made from the character itself. Meaning that a is a key, r11 is a repeater and ( for example will be transferred to the parser as a token (.
document = *ws repeat
repeat = '(' *ws *({key} *ws) [{repeater} *ws] ')' *ws
ws = ' ' / %x9 / %xA / %xD
This is the parser grammar, that has 3 rules. The document rule accepts (recognizes) white space at first, then one repeat rule. The repeat rule starts with a scope, followed by any number of white space. After that is a list of keys maybe separated by white space and after all keys there is an optional repeater token followed by optional white space, closing scope and again optional white space. The white space is space tab carriage return and line feed in that order.
The runtime of this parsing is linear for both the lexer and the parser because both grammars are LL(1). Bellow is the generic parse tree from the TGS online laboratory, where you can run this grammars for input (a 5 r5) and you will get this tree:
If you want to have more complex key, then you may use this:
key = 1*('a'-'z' / 'A'-'Z' / '0'-'9')
In this case however, the key and repeater lexer rules will both recognize the sequence r7, but because the repeater rule is defined later it will take precedence (i.e. overwrites the previous rule). With other words r7 will be a repeater token, and the parsing will still be linear. You will get a warning from TGS if your lexer rules overwrite one another.

Related

Use character as operator between numbers, otherwise treat it as a token ANTLR4

I'm making a language in ANTLR, where a sequence of digits is a number. A sequence of digits, letters and an underscore is, however, an identifier. So, for example:
These are numbers: 234, 0243, 0, 11
These are identifiers: foo, 2foo, foo2, 2y8
However, I also have operators, like multiplication, addition, division... They all work fine, except for one operator, which is the scientific operator e (or E). Unlike in most other languages, where the e for a scientific number is considered part of the number itself (like 2e3), in my language the e is considered to be an operator itself. So for example, (2+5)e4 is valid.
However, this brings an issue: since e is a letter, unless I separe my scientific number into spaces, ANTLR recognizes 2e3 as an identifier instead of the operation 2-e-3.
I'd like the language to always treat e as an operator, and not as part of an identifier, if what's on both sides of the e is something other than a letter or an underscore, or if it's alone. So, for example:
The following are treated as operations: 2e3, 2 e 3, 2E3, 2e 3, 2 e3, 23.4e78.9, 5e($my_var), .0e4, -7e3, 7e+9, 0e-4, (2)e(3), 2e(hello)
The following are treated as identifiers: hello, 2ey9, w3e6, 6ee7, 7ep, e9, e.
I have the following minimal reproduction example:
grammar ScientificExample;
file: expr* EOF;
expr
: UNARY expr
| L_PAREN expr R_PAREN
| expr SCIENTIFIC expr
| atom;
atom: NUMBER | IDENTIFIER;
WS : [ \r\t\n]+ -> skip ;
SCIENTIFIC: [eE];
NUMBER: ( [0-9]* '.' )? [0-9]+;
L_PAREN: '(';
R_PAREN: ')';
IDENTIFIER: [a-zA-Z0-9_]+;
UNARY: [+-];
As I understand it, ANTLR gives priority to whichever parsing rule produces the largest possible result, which might be why it's not giving priority to scientific being first. Any ideas how I might engineer this so priority is given to valid scientific expressions, as long as both members are an expression that isn't a simple IDENTIFIER atom, or the lack thereof?
I've thought of over-engineering the IDENTIFIER lexing rule, however since ANTLR doesn't really have lookahead and lookbehind expressions, I'm not completely sure how I'd achieve that.

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Tokenize abstract terminals in LL grammar

I am currently writing a basic parser for an XML flavor. As an exercise, I am implementing an LL table-driven parser.
This is my example of BNF grammar:
%token name data string
%% /* LL(1) */
doc : elem
elem : "<" open_tag
open_tag : name attr close_tag
close_tag : ">" elem_or_data "</" name ">"
| "/>"
;
elem_or_data : "<" open_tag elem_or_data
| data elem_or_data
| /* epsilon */
;
attr : name ":" string attr
| /* epsilon */
;
Is this grammar correct ?
Each terminal literal is between quotes. The abstract terminals are specified by %token.
I am coding an hand-written lexer to convert my input into a tokens list. How would I tokenize the abstract terminals ?
The classic approach would be to write a regular expression (or other recogniser) for each possible terminal.
What you call "abstract" terminals, which are perfectly concrete, are actually terminals whose associated patterns recognise more than one possible input string. The string actually recognised (or some computed function of that string) should be passed to the parser as the semantic value of the token.
Nominally, at each point in the input string, the tokeniser will run all recognisers and choose the one with the longest match. (This is the so-called "maximal munch" rule.) This can usually be optimised, particularly if all the patterns are regular expressions. (F)lex will do that optimisation for you, for example.
A complication in your case is that the tokenisation of your language is context-dependent. In particular, when the target is elem_or_data, the only possible tokens are <, </ and "data". However, inside a tag, "data" is not possible, and "name" and "string" tags are possible (among others).
It is also possible that the value of an attribute could have the same lexical form as the key (i.e. a name). In XML itself, the attribute value must be a quoted string and the use of an unquoted string will be flagged as an error, but there are certainly "XML-like" languages (such as HTML) in which attribute values without whitespace can be inserted unquoted.
Since the lexical analysis depends on context, the lexical analyser must be passed (or have access to) an additional piece of information defining the lexical context. This is usually represented as a single enumeration value, which could be computed based on the last few tokens returned, or based on the FIRST set of the current parser stack.

Overloading multiplication using menhir and OCaml

I have written a lexer and parser to analyze linear algebra statements. Each statement consists of one or more expressions followed by one or more declarations. I am using menhir and OCaml to write the lexer and parser.
For example:
Ax = b, where A is invertible.
This should be read as A * x = b, (A, invertible)
In an expression all ids must be either an uppercase or lowercase symbol. I would like to overload the multiplication operator so that the user does not have to type in the '*' symbol.
However, since the lexer also needs to be able to read strings (such as "invertible" in this case), the "Ax" portion of the expression is sent over to the parser as a string. This causes a parser error since no strings should be encountered in the expression portion of the statement.
Here is the basic idea of the grammar
stmt :=
| expr "."
| decl "."
| expr "," decl "."
expr :=
| term
| unop expr
| expr binop expr
term :=
| <int> num
| <char> id
| "(" expr ")"
decl :=
| id "is" kinds
kinds :=
| <string> kind
| kind "and" kinds
Is there some way to separate the individual characters and tell the parser that they should be treated as multiplication? Is there a way to change the lexer so that it is smart enough to know that all character clusters before a comma are ids and all clusters after should be treated as strings?
It seems to me you have two problems:
You want your lexer to treat sequences of characters differently in different places.
You want multiplication to be indicated by adjacent expressions (no operator in between).
The first problem I would tackle in the lexer.
One question is why you say you need to use strings. This implies that there is a completely open-ended set of things you can say. It might be true, but if you can limit yourself to a smallish number, you can use keywords rather than strings. E.g., invertible would be a keyword.
If you really want to allow any string at all in such places, it's definitely still possible to hack a lexer so that it maintains a state describing what it has seen, and looks ahead to see what's coming. If you're not required to adhere to a pre-defined grammar, you could adjust your grammar to make this easier. (E.g., you could use commas for only one purpose.)
For the second problem, I'd say you need to add adjacency to your grammar. I.e., your grammar needs a rule that says something like term := term term. I suspect it's tricky to get this to work correctly, but it does work in OCaml (where adjacent expressions represent function application) and in awk (where adjacent expressions represent string concatenation).

Invisible multiplication in algeabric expressions

Parsing math expressions, would be better treat invisible multiplication (e.g. ab, meaning a times b, or (a-b)c, or (a-b)(c+d) ecc. ecc.) at level of the lexer or of the parser ?
Implicit multiplication is a grammatical construct. Lexing is purely about recognizing the individual symbols. The fact that two adjacent expressions should be multiplied is not a lexical notion, as the lexer does not know about "expressions". The parser does.
If the lexer were responsible, you'd have to add lots of rules relating to adjacent tokens. For instance, insert a × token between two IDENTIFIERs, or an IDENTIFIER and a NUMBER, or a NUMBER and an IDENTIFIER, or between ) and IDENTIFIER, or IDENTIFIER and (... except uh oh, IDENTIFIER ( could be a function call, so maybe I need to look up IDENTIFIER in the symbol table to see if it's a function name...
What a mess!
The parser, on the other hand, can do this with a single grammar rule.
E → E '×' E
| E E

Resources