I have strange shift-reduce warnings no matter what I change. Reduced grammar:
expr : NUMBER
| NUMBER ',' expr
| expr ',' NUMBER ',' NUMBER
Bison reports shift reduce on 2nd rule with comma. I tried to set precedence:
%nonassoc num_p
%nonassoc exp_p
expr : NUMBER %prec num_p
| NUMBER ',' expr %prec exp_p
| expr ',' NUMBER ',' NUMBER
but warning stays the same. Can someone explain me what am I missing here?
It is clear that the following is ambiguous:
expr : NUMBER %prec num_p
| NUMBER ',' expr %prec exp_p
| expr ',' NUMBER ',' NUMBER
since any list of three or more numbers can be parsed in various ways. Roughly speaking, we can take single numbers off the beginning of the list or pairs of numbers off the end of the list, until we meet somewhere in the middle; however, there is no definition of where the middle point might be.
Consider, for example, the various parse trees which could produce 1, 2, 3, 4, 5. Here are just two (with numbers indicating which production was used to expand expr):
expr(2) expr(3)
/ \ / | \
/ \ / | |
| expr(2) / | |
| / \ / | |
| / \ expr(3) | |
| / expr(3) / | \ | |
| | / | \ / | \ | |
| |expr(1)| \ expr(1)| | | |
| | | | | | | | | |
1 , 2 , 3 , 4 , 5 1 , 2 , 3 , 4 , 5
Both of the above trees are, in some sense, maximal. The one on the left takes as many single NUMBERs as possible using production 2, until only two NUMBERs are left for production 3. The one on the right applies production 3 as many times as possible. (It would have needed a single application of production 2 if the list of numbers had an even length.)
In order to resolve the ambiguity, we need a clear statement of the intent. But it seems to me unlikely that it can be resolved with a precedence declaration. Remember that precedence relations are always between a possible reduction (on the top of the parser stack) and a lookahead symbol. They never compare two lookahead symbols nor two productions. If the lookahead symbol wins, it is shifted onto the stack; if the reduction wins, the stack is reduced. It is no more complicated than that.
So if precedence could help, the relevant token must be ',', not NUMBER. NUMBER must always be shifted onto the parse stack. Since no production ends with ',', it is never possible to reduce the stack when NUMBER is the lookahead symbol. By contrast, when ',' is the lookahead symbol, it is usually possible either to reduce the top of the parser stack or to shift the ',' in preparation for a different reduction.
The only place where this decision is even possible is in the case where we have seen NUMBER and are looking at a ',', so we have to decide whether to apply production 1, in preparation for production 3, or shift the ',', leaving production 2 as the only option. Neither of these decisions can prosper: If we choose to reduce, it is possible that production 3 will turn out to be impossible because there are not enough numbers in the list; if we choose to shift, then production 3 will never be used.
In a left-to-right parse, the algorithm which produces the right-hand parse above is not possible, because we cannot know whether the list is of even or odd length until we reach the end, at which point it is far too late to retroactively reduce the first two numbers. On the other hand, the left-hand parse will require a lookahead of four tokens, not one, in order to decide at which point to stop using production 2. That makes it possible to construct an LR(1) grammar, because there is a way to rewrite any LR(k) grammar as an LR(1) grammar, but the resulting grammar is usually complicated.
I suspect that neither of these was really your intention. In order to recommend a resolution, it will be necessary to know what the precise intention is.
One possibility (motivated by a comment) is that expr also includes something which is neither a number nor a list of numbers:
expr: NUMBER
| complex_expression
In that case, it might be that the grammar intends to capture two possibilities:
a list containing NUMBERs with a possibly complex_expression at the end;
a list containing an even number of NUMBERs with a possibly complex_expression at the beginning.
What is left ambiguous in the above formulation is a list consisting only of NUMBERs, since either the first or last number could be parsed as an expr. Here there are only a couple of reasonable possible resolutions:
a list of NUMBERs is always be parsed as the first option (expr at the end)
a list of NUMBERs is parsed as the second option (expr at the beginning) if and only if there are an odd number of elements in the list.
The first resolution is much easier, so we can start with it. In this case, the first element in the list essentially determines how the list will be parsed, so it not possible to reduce the first NUMBER to expr. We can express that by separating the different types of expr:
expr: list_starting_expr | list_ending_expr
list_starting_expr: complex_expression ',' NUMBER ',' NUMBER
| list_start_expr ',' NUMBER ',' NUMBER
list_ending_expr : complex_expression
| NUMBER ',' list_ending_expr
| NUMBER
The last production in the above example allows for a list entirely of NUMBERs to be parsed as a list_ending_expr.
It also allows a list containing only a single complex_expression to be parsed as a list_ending_expr, while a list_starting_expr is required to have at least three elements. Without that, a list consisting only of a complex_expression would have been ambiguous. In the example grammar in the question, the list containing only a complex_expression is implicitly forbidden; that could be reproduced by changing the base production for list_ending_expr from list_ending_expr: complex_expression to list_ending_expr: NUMBER ',' complex_expression.
But what if we wanted the second resolution? We can still recognize the language, but constructing a correct parse tree may require some surgery. We can start by separating out the case where the list consists only of NUMBERs.
expr: list_starting_expr | list_ending_expr | ambiguous_list
list_starting_expr: complex_expression ',' NUMBER ',' NUMBER
| list_starting_expr ',' NUMBER ',' NUMBER
list_ending_expr : NUMBER ',' complex_expression
| NUMBER ',' list_ending_expr
ambiguous_list : NUMBER
| NUMBER ',' ambiguous_list
Despite the frequently-repeated claim that right-recursion should be avoided in bottom-up grammars, here it is essential that ambiguous_list and list_ending_expr be right-recursive, because we cannot distinguish between the two possibilities until we reach the end of the list.
Now we could use a semantic action to simply count the number of elements in the list. That action needs to be associated with the reduction of ambiguous_list to expr:
expr: list_starting_expr | list_ending_expr
| ambiguous_list {
if (count_elements($1) % 2 == 1) {
$$ = make_list_starting_expr($1);
}
else {
$$ = make_list_starting_expr($1);
}
}
But we can actually distinguish the two cases grammatically, precisely because of the right recursion:
expr: list_starting_expr
| list_ending_expr
| even_list_of_numbers
| odd_list_of_numbers
list_starting_expr : complex_expression ',' NUMBER ',' NUMBER
| list_starting_expr ',' NUMBER ',' NUMBER
list_ending_expr : NUMBER ',' complex_expression
| NUMBER ',' list_ending_expr
odd_list_of_numbers : NUMBER
| NUMBER ',' NUMBER ',' odd_list_of_numbers
even_list_of_numbers: NUMBER ',' NUMBER
| NUMBER ',' NUMBER ',' even_list_of_numbers
It might be more meaningful to write this as:
expr: expr_type_one | expr_type_two
expr_type_one: list_starting_expr | even_list_of_numbers
expr_type_two: list_ending_expr | odd_list_of_numbers
/* Remainder as above */
Note: The above grammars, like the grammar in the original question, do not allow expr to consist only of a complex_expression. If that it were desired to handle the case of only a single complex_expression, then that syntax could be added directly to the productions for expr (or to whichever of expr_type_one or expr_type_two were appropriate.
Sometime it helps to rewrite left recursion to right one, something like this:
expr : NUMBER
| expr ',' NUMBER
;
Theoretic ground can be found there: https://cs.stackexchange.com/questions/9963/why-is-left-recursion-bad
Related
I'm trying to make an expression parser and although it works, it does calculations chronologically rather than by BIDMAS; 1 + 2 * 3 + 4 returns 15 instead of 11. I've rewritten the parser to use recursive descent parsing and a proper grammar which I thought would work, but it makes the same mistake.
My grammar so far is:
exp ::= term op exp | term
op ::= "/" | "*" | "+" | "-"
term ::= number | (exp)
It also lacks other features but right now I'm not sure how to make division precede multiplication, etc.. How should I modify my grammar to implement operator-precedence?
Try this:
exp ::= add
add ::= mul (("+" | "-") mul)*
mul ::= term (("*" | "/") term)*
term ::= number | "(" exp ")"
Here ()* means zero or more times. This grammar will produce right associative trees and it is deterministic and unambiguous. The multiplication and the division are with the same priority. The addition and subtraction also.
I am trying to create parser for MAXScript language using their official grammar description of the language. I use flex and bison to create the lexer and parser.
However, I have run into following problem. In traditional languages (e.g. C) statements are separated by a special token (; in C). But in MAXScript expressions inside a compound expression can be separated either by ; or newline. There are other languages that use whitespace characters in their parsers, like Python. But Python is much more strict about the placement of the newline, and following program in Python is invalid:
# compile error
def
foo(x):
print(x)
# compile error
def bar
(x):
foo(x)
However in MAXScript following program is valid:
fn
foo x =
( // parenthesis start the compound expression
a = 3 + 2; // the semicolon is optional
print x
)
fn bar
x =
foo x
And you can even write things like this:
for
x
in
#(1,2,3,4)
do
format "%," x
Which will evaluate fine and print 1,2,3,4, to the output. So newlines can be inserted into many places with no special meaning.
However if you insert one more newline in the program like this:
for
x
in
#(1,2,3,4)
do
format "%,"
x
You will get a runtime error as format function expects to have more than one parameter passed.
Here is part of the bison input file that I have:
expr:
simple_expr
| if_expr
| while_loop
| do_loop
| for_loop
| expr_seq
expr_seq:
"(" expr_semicolon_list ")"
expr_semicolon_list:
expr
| expr TK_SEMICOLON expr_semicolon_list
| expr TK_EOL expr_semicolon_list
if_expr:
"if" expr "then" expr "else" expr
| "if" expr "then" expr
| "if" expr "do" expr
// etc.
This will parse only programs which use newline only as expression separator and will not expect newlines to be scattered in other places in the program.
My question is: Is there some way to tell bison to treat a token as an optional token? For bison it would mean this:
If you find newline token and you can shift with it or reduce, then do so.
Otherwise just discard the newline token and continue parsing.
Because if there is no way to do this, the only other solution I can think of is modifying the bison grammar file so that it expects those newlines everywhere. And bump the precedence of the rule where newline acts as an expression separator. Like this:
%precedence EXPR_SEPARATOR // high precedence
%%
// w = sequence of whitespace tokens
w: %empty // either nothing
| TK_EOL w // or newline followed by other whitespace tokens
expr:
w simple_expr w
| w if_expr w
| w while_loop w
| w do_loop w
| w for_loop w
| w expr_seq w
expr_seq:
w "(" w expr_semicolon_list w ")" w
expr_semicolon_list:
expr
| expr w TK_SEMICOLON w expr_semicolon_list
| expr TK_EOL w expr_semicolon_list %prec EXPR_SEPARATOR
if_expr:
w "if" w expr w "then" w expr w "else" w expr w
| w "if" w expr w "then" w expr w
| w "if" w expr w "do" w expr w
// etc.
However this looks very ugly and error-prone, and I would like to avoid such solution if possible.
My question is: Is there some way to tell bison to treat a token as an optional token?
No, there isn't. (See below for a longer explanation with diagrams.)
Still, the workaround is not quite as ugly as you think, although it's not without its problems.
In order to simplify things, I'm going to assume that the lexer can be convinced to produce only a single '\n' token regardless of how many consecutive newlines appear in the program text, including the case where there are comments scattered among the blank lines. That could be achieved with a complex regular expression, but a simpler way to do it is to use a start condition to suppress \n tokens until a regular token is encountered. The lexer's initial start condition should be the one which suppresses newline tokens, so that blank lines at the beginning of the program text won't confuse anything.
Now, the key insight is that we don't have to insert "maybe a newline" markers all over the grammar, since every newline must appear right after some real token. And that means that we can just add one non-terminal for every terminal:
tok_id: ID | ID '\n'
tok_if: "if" | "if" '\n'
tok_then: "then" | "then" '\n'
tok_else: "else" | "else" '\n'
tok_do: "do" | "do" '\n'
tok_semi: ';' | ';' '\n'
tok_dot: '.' | '.' '\n'
tok_plus: '+' | '+' '\n'
tok_dash: '-' | '-' '\n'
tok_star: '*' | '*' '\n'
tok_slash: '/' | '/' '\n'
tok_caret: '^' | '^' '\n'
tok_open: '(' | '(' '\n'
tok_close: ')' | ')' '\n'
tok_openb: '[' | '[' '\n'
tok_closeb: ']' | ']' '\n'
/* Etc. */
Now, it's just a question of replacing the use of a terminal with the corresponding non-terminal defined above. (No w non-terminal is required.) Once we do that, bison will report a number of shift-reduce conflicts in the non-terminal definitions just added; any terminal which can appear at the end of an expression will instigate a conflict, since the newline could be absorbed either by the terminal's non-terminal wrapper or by the expr_semicolon_list production. We want the newline to be part of expr_semicolon_list, so we need to add precedence declarations starting with newline, so that it is lower precedence than any other token.
That will most likely work for your grammar, but it is not 100% certain. The problem with precedence-based solutions is that they can have the effect of hiding real shift-reduce conflict issues. So I'd recommend running bison on the grammar and verifying that all the shift-reduce conflicts appear where expected (in the wrapper productions) before adding the precedence declarations.
Why token fallback is not as simple as it looks
In theory, it would be possible to implement a feature similar to the one you suggest. [Note 1]
But it's non-trivial, because of the way the LALR parser construction algorithm combines states. The result is that the parser might not "know" that the lookahead token cannot be shifted until it has done one or more reductions. So by the time it figures out that the lookahead token is not valid, it has already performed reductions which would have to be undone in order to continue the parse without the lookahead token.
Most parser generators compound the problem by removing error actions corresponding to a lookahead token if the default action in the state for that token is a reduction. The effect is again to delay detection of the error until after one or more futile reductions, but it has the benefit of significantly reducing the size of the transition table (since default entries don't need to be stored explicitly). Since the delayed error will be detected before any more input is consumed, the delay is generally considered acceptable. (Bison has an option to prevent this optimisation, however.)
As a practical illustration, here's a very simple expression grammar with only two operators:
prog: expr '\n' | prog expr '\n'
expr: prod | expr '+' prod
prod: term | prod '*' term
term: ID | '(' expr ')'
That leads to this state diagram [Note 2]:
Let's suppose that we wanted to ignore newlines pythonically, allowing the input
(
a + b
)
That means that the parser must ignore the newline after the b, since the input might be
(
a + b
* c
)
(Which is fine in Python but not, if I understand correctly, in MAXScript.)
Of course, the newline would be recognised as a statement separator if the input were not parenthesized:
a + b
Looking at the state diagram, we can see that the parser will end up in State 15 after the b is read, whether or not the expression is parenthesized. In that state, a newline is marked as a valid lookahead for the reduction action, so the reduction action will be performed, presumably creating an AST node for the sum. Only after this reduction will the parser notice that there is no action for the newline. If it now discards the newline character, it's too late; there is now no way to reduce b * c in order to make it an operand of the sum.
Bison does allow you to request a Canonical LR parser, which does not combine states. As a result, the state machine is much, much bigger; so much so that Canonical-LR is still considered impractical for non-toy grammars. In the simple two-operator expression grammar above, asking for a Canonical LR parser only increases the state count from 16 to 26, as shown here:
In the Canonical LR parser, there are two different states for the reduction term: term '+' prod. State 16 applies at the top-level, and thus the lookahead includes newline but not ) Inside parentheses the parser will instead reach state 26, where ) is a valid lookahead but newline is not. So, at least in some grammars, using a Canonical LR parser could make the prediction more precise. But features which are dependent on the use of a mammoth parsing automaton are not particularly practical.
One alternative would be for the parser to react to the newline by first simulating the reduction actions to see if a shift would eventually succeed. If you request Lookahead Correction (%define parse.lac full), bison will insert code to do precisely this. This code can create significant overhead, but many people request it anyway because it makes verbose error messages more accurate. So it would certainly be possible to repurpose this code to do token fallback handling, but no-one has actually done so, as far as I know.
Notes:
A similar question which comes up from time to time is whether you can tell bison to cause a token to be reclassified to a fallback token if there is no possibility to shift the token. (That would be useful for parsing languages like SQL which have a lot of non-reserved keywords.)
I generated the state graphs using Bison's -g option:
bison -o ex.tab.c --report=all -g ex.y
dot -Tpng -oex.png ex.dot
To produce the Canonical LR, I defined lr.type to be canonical-lr:
bison -o ex_canon.c --report=all -g -Dlr.type=canonical-lr ex.y
dot -Tpng -oex_canon.png ex_canon.dot
From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:
top_level : expression PLUS term
| expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
expression : expression PLUS term
| expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.
For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:
top_level : expression top_level1
| term top_level2
| NUMBER
top_level1 : PLUS term
| MINUS term
top_level2 : TIMES factor
| DIVIDE factor
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Is it possible to simplify this grammar to ease the recursive-descent approach?
The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.
Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Sure. It's not even that difficult.
The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.
Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.
Is it possible to simplify this grammar to ease the recursive-descent approach?
I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.
top_level : NUMBER optional_expression_or_term_1
| LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
| expression_or_term_1
expression_or_term_1
: PLUS term expression1
| MINUS term expression1
| TIMES factor term1 expression1
| DIVIDE factor term1 expression1
expression : term expression1
expression1 : PLUS term expression1
| MINUS term expression1
| empty
term : factor term1
term1 : TIMES factor term1
| DIVIDE factor term1
| empty
factor : NUMBER
| LPAREN expression RPAREN
I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).
The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.
Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.
To make the grammar LL(1) you need to finish left-factoring top_level.
You stopped at:
top_level : expression top_level1
| term top_level2
| NUMBER
expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:
top_level : NUMBER term1 expression1 top_level1
| NUMBER term1 top_level2
| NUMBER
| LPAREN expression RPAREN term1 expression1 top_level1
| LPAREN expression RPAREN term1 top_level2
which you can then left-factor to
top_level : NUMBER term1 top_level3
| LPAREN expression RPAREN term1 top_level4
top_level3 : expression1 top_level1
| top_level2
| empty
top_level4 : expression1 top_level1
| top_level2
Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)
I have a working calculator apart from one thing: unary operator '-'.
It has to be evaluated and dealt with in 2 difference cases:
When there is some expression further like so -(3+3)
When there isn't: -3
For case 1, I want to get a postfix output 3 3 + -
For case 2, I want to get just correct value of this token in this field, so for example in Z10 it's 10-3 = 7.
My current idea:
E: ...
| '-' NUM %prec NEGATIVE { $$ = correct(-yylval); appendNumber($$); }
| '-' E %prec NEGATIVE { $$ = correct(P-$2); strcat(rpn, "-"); }
| NUM { appendNumber(yylval); $$ = correct(yylval); }
Where NUM is a token, but obviously compiler says there is a confict reduce/reduce as E can also be a NUM in some cases, altough it works I want to get rid of the compilator warning.. and I ran out of ideas.
It has to be evaluated and dealt with in 2 difference cases:
No it doesn't. The cases are not distinct.
Both - E and - NUM are incorrect. The correct grammar would be something like:
primary
: NUM
| '-' primary
| '+' primary /* for completeness */
| '(' expression ')'
;
Normally, this should be implemented as two rules (pseudocode, I don't know bison syntax):
This is the likely rule for the 'terminal' element of an expression. Naturally, a parenthesized expression leads to a recursion to the top rule:
Element => Number
| '(' Expression ')'
The unary minus (and also the unary plus!) are just on one level up in the stack of productions (grammar rules):
Term => '-' Element
| '+' Element
| Element
Naturally, this can unbundle into all possible combinations such as '-' Number, '-' '(' Expression ')', likewise with '+' and without any unary operator at all.
Suppose we want addition / subtraction, and multiplication / division. Then the rest of the grammar would look like this:
Expression => Expression '+' MultiplicationExpr
| Expression '-' MultiplicationExpr
| MultiplicationExpr
MultiplicationExpr => MultiplicationExpr '*' Term
| MultiplicationExpr '/' Term
| Term
For the sake of completeness:
Terminals:
Number
Non-terminals:
Expression
Element
Term
MultiplicationExpr
Number, which is a terminal, shall match a regexp like this [0-9]+. In other words, it does not parse a minus sign — it's always a positive integer (or zero). Negative integers are calculated by matching a '-' Number sequence of tokens.
How can I improve my parser grammar so that instead of creating an AST that contains couple of decFunc rules for my testing code. It will create only one and sum becomes the second root. I tried to solve this problem using multiple different ways but I always get a left recursive error.
This is my testing code :
f :: [Int] -> [Int] -> [Int]
f x y = zipWith (sum) x y
sum :: [Int] -> [Int]
sum a = foldr(+) a
This is my grammar:
This is the image that has two decFuncin this link
http://postimg.org/image/w5goph9b7/
prog : stat+;
stat : decFunc | impFunc ;
decFunc : ID '::' formalType ( ARROW formalType )* NL impFunc
;
anotherFunc : ID+;
formalType : 'Int' | '[' formalType ']' ;
impFunc : ID+ '=' hr NL
;
hr : 'map' '(' ID* ')' ID*
| 'zipWith' '(' ('*' |'/' |'+' |'-') ')' ID+ | 'zipWith' '(' anotherFunc ')' ID+
| 'foldr' '(' ('*' |'/' |'+' |'-') ')' ID+
| hr op=('*'| '/' | '.&.' | 'xor' ) hr | DIGIT
| 'shiftL' hr hr | 'shiftR' hr hr
| hr op=('+'| '-') hr | DIGIT
| '(' hr ')'
| ID '(' ID* ')'
| ID
;
Your test input contains two instances of content that will match the decFunc rule. The generated parse-tree shows exactly that: two sub-trees, each having a deFunc as the root.
Antlr v4 will not produce a true AST where f and sum are the roots of separate sub-trees.
Is there any thing can I do with the grammar to make both f and sum roots – Jonny Magnam
Not directly in an Antlr v4 grammar. You could:
switch to Antlr v3, or another parser tool, and define the generated AST as you wish.
walk the Antlr v4 parse-tree and create a separate AST of your desired form.
just use the parse-tree directly with the realization that it is informationally equivalent to a classically defined AST and the implementation provides a number practical benefits.
Specifically, the standard academic AST is mutable, meaning that every (or all but the first) visitor is custom, not generated, and that any change in the underlying grammar or an interim structure of the AST will require reconsideration and likely changes to every subsequent visitor and their implemented logic.
The Antlr v4 parse-tree is essentially immutable, allowing decorations to be accumulated against tree nodes without loss of relational integrity. Visitors all use a common base structure, greatly reducing brittleness due to grammar changes and effects of prior executed visitors. As a practical matter, tree-walks are easily constructed, fast, and mutually independent except where expressly desired. They can achieve a greater separation of concerns in design and easier code maintenance in practice.
Choose the right tool for the whole job, in whatever way you define it.