ANTLR does not automatically do lookahead matching?

ANTLR does not automatically do lookahead matching? - parsing

I'm currently writing a simple grammar that requires operator precedence and mixed associativities in one expression. An example expression would be a -> b ?> C ?> D -> e, which should be parsed as (a -> (((b ?> C) ?> D) -> e). That is, the ?> operator is a high-precedence left-associative operator wheras the -> operator is a lower-precedence right-associative operator.
I'm prototyping the grammar in ANTLR 3.5.1 (via ANTLRWorks 1.5.2) and find that it can't handle the following grammar:
prog : expr EOF;
expr : term '->' expr
| term;
term : ID rest;
rest : '?>' ID rest
| ;
It produces rule expr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2 error.
The term and rest productions work fine in isolation when I tested it , so I assumed this happened because the parser is getting confused by expr. To get around that, I did the following refactor:
prog : expr EOF;
expr : term exprRest;
exprRest
: '->' expr
| ;
term : ID rest;
rest : DU ID rest
| ;
This works fine. However, because of this refactor I now need to check for empty exprRest nodes in the output parse tree, which is non-ideal. Is there a way to make ANTLR work around the ambiguity in the initial declaration of expr? I would of assumed that the generated parser would fully match term and then do a lookahead search for "->" and either continue parsing or return the lone term. What am I missing?

As stated, the problem is in this rule:
expr : term '->' expr
| term;
The problematic part is the term which is common to both alternatives.
LL(1) grammar doesn't allow this at all (unless term only matches zero tokens - but such rules would be pointless), because it cannot decide which alternative to use with only being able to see one token ahead (that's the 1 in LL(1)).
LL(k) grammar would only allow this if the term rule could match at most k - 1 tokens.
LL(*) grammar which ANTLR 3.5 uses does some tricks that allows it to handle rules that match any number of tokens (ANTLR author calls this "variable look-ahead").
However, one thing that these tricks cannot handle is if the rule is recursive, i.e. if it or any rules it invokes reference itself in any way (direct or indirect) - and that is exactly what your term rule does:
term : ID rest;
rest : '?>' ID rest
| ;
- the rule rest, referenced from term, recursively references itself. Thus, the error message
rule expr has non-LL(*) decision due to recursive rule invocations ...
The way to solve this limitation of LL grammars is called left-factoring:
expr : term
( '->' expr )?
;
What I did here is said "match term first" (since you want to match it in both alternatives, there's no point in deciding which one to match it in), then decide whether to match '->' expr (this can be decided just by looking at the very next token - if it's ->, use it - so this is even LL(1) decision).
This is very similar to what you came to as well, but the parse tree should look very much like you intended with the original grammar.

Related

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?

Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Combining unary operators with different precedence

I was having some trouble with Bison creating an operator as such:
<- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.
I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.
Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.
Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?
Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):
post_unary:
copy
| post_unary "++"
| post_unary "--"
| post_unary '!'
;
Chaining operators in this category, e.g. x ! -- ! works fine syntactically.

Ok, let me suggest a possible erroneous grammar based on your sketch:
low_postfix:
mid_infix
| low_postfix "<-"
mid_infix:
high_postfix
| mid_infix '+' high_postfix
high_postfix:
term
| high_postfix "++"
term:
ID
'(' expr ')'
It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.
On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.
If the intention were to allow both of those postfix sequences, then that grammar is incorrect.
A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.
Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:
$ python3 -c 'print(41 + not False)'
File "<string>", line 1
print(41 + not False)
^
SyntaxError: invalid syntax
So, how can we fix that?
On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.
On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]
Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)
%left ','
%precedence "<-"
%precedence "->"
%left '+'
%left '*'
%precedence NEG
%right "++" '('
%%
expr: expr ',' expr { $$ = make_binop(OP_LIST, $1, $3); }
| "<-" expr { $$ = make_unop(OP_LARR, $2); }
| expr "->" { $$ = make_unop(OP_RARR, $1); }
| expr '+' expr { $$ = make_binop(OP_ADD, $1, $3); }
| expr '*' expr { $$ = make_binop(OP_MUL, $1, $3); }
| '-' expr %prec NEG { $$ = make_unop(OP_NEG, $2); }
| expr '(' expr ')' %prec '(' { $$ = make_binop(OP_CALL, $1, $3); }
| "++" expr { $$ = make_unop(OP_PREINC, $2); }
| expr "++" { $$ = make_unop(OP_POSTINC, $1); }
| VALUE { $$ = make_ident($1); }
| '(' expr ')' { $$ = $2; }
A couple of notes:
I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.
For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.
This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:
? a++->
=> [-> [++/post a]]
? a->++
=> [++/post [-> a]]
? 3*f(a)+2
=> [+ [* 3 [CALL f a]] 2]
? 3*f(a)->+2
=> [+ [-> [* 3 [CALL f a]]] 2]
? 2+<-f(a)*3
=> [+ 2 [<- [* [CALL f a] 3]]]
? 2+<-f(a)*3->
=> [+ 2 [<- [-> [* [CALL f a] 3]]]]
But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:
? <-2*f(a)->+3
=> [<- [+ [-> [* 2 [CALL f a]]] 3]]
? <-2+f(a)->*3
=> [<- [* [-> [+ 2 [CALL f a]]] 3]]
? 2+<-f(a)->*3
=> [+ 2 [<- [* [-> [CALL f a]] 3]]]
There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:
? 2+f(a)*3
=> [+ 2 [* [CALL f a] 3]]
? 2+f(a)->*3
=> [* [-> [+ 2 [CALL f a]]] 3]
If that's your intention, fine. It's your language.
Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.
A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.
Notes
Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)
All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.
The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.
But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.
That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)
It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.

yacc shift-reduce for ambiguous lambda syntax

I'm writing a grammar for a toy language in Yacc (the one packaged with Go) and I have an expected shift-reduce conflict due to the following pseudo-issue. I have to distilled the problem grammar down to the following.
start:
stmt_list
expr:
INT | IDENT | lambda | '(' expr ')' { $$ = $2 }
lambda:
'(' params ')' '{' stmt_list '}'
params:
expr | params ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
A lambda function looks something like this:
map((v) { v * 2 }, collection)
My parser emits:
conflicts: 1 shift/reduce
Given the input:
(a)
It correctly parses an expr by the '(' expr ')' rule. However given an input of:
(a) { a }
(Which would be a lambda for the identity function, returning its input). I get:
syntax error: unexpected '{'
This is because when (a) is read, the parser is choosing to reduce it as '(' expr ')', rather than consider it to be '(' params ')'. Given this conflict is a shift-reduce and not a reduce-reduce, I'm assuming this is solvable. I just don't know how to structure the grammar to support this syntax.
EDIT | It's ugly, but I'm considering defining a token so that the lexer can recognize the ')' '{' sequence and send it through as a single token to resolve this.
EDIT 2 | Actually, better still, I'll make lambdas require syntax like ->(a, b) { a * b} in the grammar, but have the lexer emit the -> rather than it being in the actual source code.

Your analysis is indeed correct; although the grammar is not ambiguous, it is impossible for the parser to decide with the input reduced to ( <expr> and with lookahead ) whether or not the expr should be reduced to params before shifting the ) or whether the ) should be shifted as part of a lambda. If the next token were visible, the decision could be made, so the grammar LR(2), which is outside of the competence of go/yacc.
If you were using bison, you could easily solve this problem by requesting a GLR parser, but I don't believe that go/yacc provides that feature.
There is an LR(1) grammar for the language (there is always an LR(1) grammar corresponding to any LR(k) grammar for any value of k) but it is rather annoying to write by hand. The essential idea of the LR(k) to LR(1) transformation is to shift the reduction decisions k-1 tokens forward by accumulating k-1 tokens of context into each production. So in the case that k is 2, each production P: N → α will be replaced with productions TNU → Tα U for each T in FIRST(α) and each U in FOLLOW(N). [See Note 1] That leads to a considerable blow-up of non-terminals in any non-trivial grammar.
Rather than pursuing that idea, let me propose two much simpler solutions, both of which you seem to be quite close to.
First, in the grammar you present, the issue really is simply the need for a two-token lookahead when the two tokens are ){. That could easily be detected in the lexer, and leads to a solution which is still hacky but a simpler hack: Return ){ as a single token. You need to deal with intervening whitespace, etc., but it doesn't require retaining any context in the lexer. This has the added bonus that you don't need to define params as a list of exprs; they can just be a list of IDENT (if that's relevant; a comment suggests that it isn't).
The alternative, which I think is a bit cleaner, is to extend the solution you already seem to be proposing: accept a little too much and reject the errors in a semantic action. In this case, you might do something like:
start:
stmt_list
expr:
INT
| IDENT
| lambda
| '(' expr_list ')'
{ // If $2 has more than one expr, report error
$$ = $2
}
lambda:
'(' expr_list ')' '{' stmt_list '}'
{ // If anything in expr_list is not a valid param, report error
$$ = make_lambda($2, $4)
}
expr_list:
expr | expr_list ',' expr
stmt:
/* empty */ | expr
stmt_list:
stmt | stmt_list ';' stmt
Notes
That's only an outline; the complete algorithm includes the mechanism to recover the original parse tree. If k is greater than 2 then T and U are strings the the FIRSTk-1 and FOLLOWk-1 sets.

If it really is a shift-reduce conflict, and you want only the shift behavior, your parser generator may give you a way to prefer a shift vs. a reduce. This is classically how the conflict for grammar rules for "if-then-stmt" and "if-then-stmt-else-stmt" is resolved, when the if statement can also be a statement.
See http://www.gnu.org/software/bison/manual/html_node/Shift_002fReduce.html
You can get this effect two ways:
a) Count on the accidental behavior of the parsing engine.
If an LALR parser handles shifts first, and then reductions if there are no shifts, then you'll get this "prefer shift" for free. All the parser generator has to do is built the parse tables anyway, even if there is a detected conflict.
b) Enforce the accidental behavior. Design (or a get a) parser generator to accept "prefer shift on token T". Then one can supress the ambiguity. One still have to implement the parsing engine as in a) but that's pretty easy.
I think this is easier/cleaner than abusing the lexer to make strange tokens (and that doesn't always work anyway).
Obviously, you could make a preference for reductions to turn it the other way. With some extra hacking, you could make shift-vs-reduce specific the state in which the conflict occured; you can even make it specific to the pair of conflicting rules but now the parsing engine needs to keep preference data around for nonterminals. That still isn't hard. Finally, you could add a predicate for each nonterminal which is called when a shift-reduce conflict is about to occur, and it have it provide a decision.
The point is you don't have to accept "pure" LALR parsing; you can bend it easily in a variety of ways, if you are willing to modify the parser generator/engine a little bit. This gives a really good reason to understand how these tools work; then you can abuse them to your benefit.

What to use in ANTLR4 to resolve ambiguities in more complex cases (instead of syntactic predicates)?

In ANTLR v3, syntactic predicates could be used to solve ambiguitites, i.e., to explicitly tell ANTLR which alternative should be chosen. ANTLR4 seems to simply accept grammars with similar ambiguities, but during parsing it reports these ambiguities. It produces a parse tree, despite these ambiguities (by chosing the first alternative, according to the documentation). But what can I do, if I want it to chose some other alternative? In other words, how can I explicitly resolve ambiguities?
(For the simple case of the dangling else problem see: What to use in ANTLR4 to resolve ambiguities (instead of syntactic predicates)?)
A more complex example:
If I have a rule like this:
expr
: expr '[' expr? ']'
| ID expr
| '[' expr ']'
| ID
| INT
;
This will parse foo[4] as (expr foo (expr [ (expr 4) ])). But I may want to parse it as (expr (expr foo) [ (expr 4) ]). (I. e., always take the first alternative if possible. It is the first alternative, so according to the documentation, it should have higher precedence. So why it builds this tree?)
If I understand correctly, I have 2 solutions:
Basically implement the syntactic predicate with a semantic predicate (however, I'm not sure how, in this case).
Restructure the grammar.
For example, replace expr with e:
e : expr | pe
;
expr
: expr '[' expr? ']'
| ID expr
| ID
| INT
;
pe : '[' expr ']'
;
This seems to work, although the grammar became more complex.
I may misunderstood some things, but both of these solutions seem less elegant and more complicated than syntactic predicates. Although, I like the solution for the dangling else problem with the ?? operator. But I'm not sure how to use in this case. Is it possible?

You may be able to resolve this by placing the ID alternative above ID expr. When left-recursion is eliminated, all of your alternatives which are not left recursive are parsed before your alternatives which are left recursive.
For your example, the first non-left-recursive alternative ID expr matches the entire expression, so there is nothing left to parse afterwards.

To get this expression (expr (expr foo) [ (expr 4) ]), you can use
top : expr EOF;
expr : expr '[' expr? ']' | ID | INT ;

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!

The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.

The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart