ANTLR rule for function call - parsing

I'm new on ANTLR4 and I trying to parse this input
X = 1 2 A(2) B (2)
In this input, A should be a function call while B should be a variable of name B.But i have a rule in lexer that skip the whitespace.
How can i write the parser rule for this input but keeping the rule that skip whitespace
Thanks in advance

The solution is to define a function introducer in the lexer, where you have control over the whitespaces and then continue the function call in the parser for flexible parameter handling:
FUNCTION_START: ID OPEN_PAR;
function: FUNCTION_START parameters CLOSE_PAR;
The keypoint here is that in the lexer the whitespace rule doesn't kick in while you are in another lexer rule, hence the FUNCTION_START rule will only accept input of the form identifer( with no whitespaces inbetween. It will not match B (.

Related

ANTLR grammar to recognize DIGIT keys and INTEGERS too

I'm trying to create an ANTLR grammar to parse sequences of keys that optionally have a repeat count. For example, (a b c r5) means "repeat keys a, b, and c five times."
I have the grammar working for KEYS : ('a'..'z'|'A'..'Z').
But when I try to add digit keys KEYS : ('a'..'z'|'A'..'Z'|'0'..'9') with an input expression like (a 5 r5), the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY. (Or so I think; the error messages are difficult to interpret "NoViableAltException").
I have tried these grammatical forms, which work ('r' means "repeatcount"):
repeat : '(' LETTERKEYS INTEGER ')' - works for a-zA-Z
repeat : '(' LETTERKEYS 'r' INTEGER ')'; - works for a-zA-Z
But I fail with
repeat : '(' LETTERSandDIGITKEYS INTEGER ')' - fails on '(a 5 r5)'
repeat : '(' LETTERSandDIGITKEYS 'r' INTEGER ')'; - fails on '(a 5 r5)'
Maybe the grammar can't do the recognition; maybe I need to recognize all the 5's keys in the same way (as KEYS or DIGITS or INTEGERS) and in the parse tree visitor interpret the middle DIGIT instances as keys, and the last set of DIGITS as an INTEGER count?
Is it possible to define a grammar that allows me to repeat digit keys as well as letter keys so that expressions like (a 5 123 r5) will be recognized correctly? (That is, "repeat keys a,5,1,2,3 five times.") I'm not tied to that specific syntax, although it would be nice to use something similar.
Thank you.
the parse fails on the middle 5 because it can't tell if the 5 is an INTEGER or a KEY.
If you have defined the following rules:
INTEGER : [0-9]+;
KEY : [a-zA-Z0-9];
then a single digit, like 5 in your example, will always become an INTEGER token. Even if
the parser is trying to match a KEY token, the 5 will become an INTEGER. There is nothing
you can do about that: this is the way ANTLR's lexer works. The lexer works in the following way:
try to consume as many characters as possible (the longest match wins)
if 2 or more rules match the same characters (like INTEGER and KEY in case of 5), let the rule defined first "win"
If you want a 5 to be an INTEGER, but sometimes a KEY, do something like this instead:
key : KEY | SINGLE_DIGIT | R;
integer : INTEGER | SINGLE_DIGIT;
repeat : R integer;
SINGLE_DIGIT : [0-9];
INTEGER : [0-9]+;
R : 'r';
KEY : [a-zA-Z];
and in your parser rules, you use key and integer instead of KEY and INTEGER.
You can split your grammar into two parts. One to be the lexer grammar, one to be the parser grammar. The lexer grammar splits the input characters into tokens. The parser grammar uses the string of tokens to parse and build a syntax tree. I work on Tunnel Grammar Studio (TGS) that can generate parsers with this two ABNF (RFC 5234) like grammars:
key = 'a'-'z' / 'A'-'Z' / '0'-'9'
repeater = 'r' 1*('0'-'9')
That is the lexer grammar that has two rules. Each character that is not processed by the lexer grammar, is converted to a token, made from the character itself. Meaning that a is a key, r11 is a repeater and ( for example will be transferred to the parser as a token (.
document = *ws repeat
repeat = '(' *ws *({key} *ws) [{repeater} *ws] ')' *ws
ws = ' ' / %x9 / %xA / %xD
This is the parser grammar, that has 3 rules. The document rule accepts (recognizes) white space at first, then one repeat rule. The repeat rule starts with a scope, followed by any number of white space. After that is a list of keys maybe separated by white space and after all keys there is an optional repeater token followed by optional white space, closing scope and again optional white space. The white space is space tab carriage return and line feed in that order.
The runtime of this parsing is linear for both the lexer and the parser because both grammars are LL(1). Bellow is the generic parse tree from the TGS online laboratory, where you can run this grammars for input (a 5 r5) and you will get this tree:
If you want to have more complex key, then you may use this:
key = 1*('a'-'z' / 'A'-'Z' / '0'-'9')
In this case however, the key and repeater lexer rules will both recognize the sequence r7, but because the repeater rule is defined later it will take precedence (i.e. overwrites the previous rule). With other words r7 will be a repeater token, and the parsing will still be linear. You will get a warning from TGS if your lexer rules overwrite one another.

Why are the newline and the white space treated differently in the lexer specification?

I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }
A rule is essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g., '\n') or class of characters ([' ' '\t' '\r']) in your input. The expression on the right size of the rule, inside the curly braces { ... }, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.
The expression Tokenize lexbuf is a recursive call to the Tokenize rule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to your Tokenize rule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.
The second rule, the one that matches \n, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos) with the position of the next line's end (lexbuf.EndPos.NextLine) before recursively calling Tokenize again. Why? Presumably so that the end position is correct on the next recursive call.
Since you're only showing a lexer fragment here, I can only guess as to what lexbug.EndPos is used for, but it's pretty common to keep that information around for diagnostic purposes.

Tokenize abstract terminals in LL grammar

I am currently writing a basic parser for an XML flavor. As an exercise, I am implementing an LL table-driven parser.
This is my example of BNF grammar:
%token name data string
%% /* LL(1) */
doc : elem
elem : "<" open_tag
open_tag : name attr close_tag
close_tag : ">" elem_or_data "</" name ">"
| "/>"
;
elem_or_data : "<" open_tag elem_or_data
| data elem_or_data
| /* epsilon */
;
attr : name ":" string attr
| /* epsilon */
;
Is this grammar correct ?
Each terminal literal is between quotes. The abstract terminals are specified by %token.
I am coding an hand-written lexer to convert my input into a tokens list. How would I tokenize the abstract terminals ?
The classic approach would be to write a regular expression (or other recogniser) for each possible terminal.
What you call "abstract" terminals, which are perfectly concrete, are actually terminals whose associated patterns recognise more than one possible input string. The string actually recognised (or some computed function of that string) should be passed to the parser as the semantic value of the token.
Nominally, at each point in the input string, the tokeniser will run all recognisers and choose the one with the longest match. (This is the so-called "maximal munch" rule.) This can usually be optimised, particularly if all the patterns are regular expressions. (F)lex will do that optimisation for you, for example.
A complication in your case is that the tokenisation of your language is context-dependent. In particular, when the target is elem_or_data, the only possible tokens are <, </ and "data". However, inside a tag, "data" is not possible, and "name" and "string" tags are possible (among others).
It is also possible that the value of an attribute could have the same lexical form as the key (i.e. a name). In XML itself, the attribute value must be a quoted string and the use of an unquoted string will be flagged as an error, but there are certainly "XML-like" languages (such as HTML) in which attribute values without whitespace can be inserted unquoted.
Since the lexical analysis depends on context, the lexical analyser must be passed (or have access to) an additional piece of information defining the lexical context. This is usually represented as a single enumeration value, which could be computed based on the last few tokens returned, or based on the FIRST set of the current parser stack.

How to adapt this LL(1) parser to a LL(k) parser?

In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.
statement : variable ':=' expression
| functionCall
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
variable : ID
| ID'.'variable
| ID '[' expression ']'
;
How could I adapt the lexer for LL(1) parser to support k look ahead tokens?
Are there some elegant ways?
I know I can add some buffers for tokens. I'd like to discuss some details of programming.
this is the Parser:
class Parser
{
private Lexer lex;
private Token look;
public Parser(Lexer l)
{
lex = l;
move();
}
private void move()
{
look = lex.scan();
}
}
and the Lexer.scan() returns the next token from the stream.
In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.
In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)
Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)
Two notes:
1) I'm skeptical about the rule:
functionCall : ID'(' (expression ( ',' expression )*)* ')'
;
That would allow, for example:
function(a[3], b[2] c[x] d[y], e.foo)
which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.

YACC: Stop parsing specific path

I'm using Python PLY to parse a specific language. For a grammar like:
IF LPAREN condition RPAREN LBRACE stmtlist RBRACE ELSE LBRACE stmtlist RBRACE
When I know the condition value, say True, then is there a way to stop parsing the stmtlist in the ELSE path?
Thanks,
You will have to continue the parse, because you need to find the end of the block enclosed by the second RBRACE; in other words, you need to parse to find the beginning of the next statement.
That said, when you analyze the results of the parse (to generate code, construct an AST, whatever you need to do), if you can determine that condition always evaluates to true (perhaps it is the expression 1 = 1), then you can suppress generating code for the second stmtlist.
Update:
Your grammar (the syntax of your language) is specified non-procedurally, so there's no place for you to attach conditional logic.
On the other hand, you specify semantic actions to take when a particular syntactic element of your grammar is matched, and you do this procedurally. In PLY, you do this by coding the body of grammar rule function(s). In the grammar rule function that matches the second stmtlist, you can write conditional code to skip over code generation, based on other information you have figured out about the input program (the input to your compiled language processor).

Resources