I have a mathematical expression parser that should handle +, -, *, /, ^, (-), functions, and of course atoms (such as x, 1, pi, etc.). The parser is basically designed according to Wikipedia's operator precedence parser, which I've reproduced below; parse_primary() is defined elsewhere.
parse_expression ()
return parse_expression_1 (parse_primary (), 0)
parse_expression_1 (lhs, min_precedence)
while the next token is a binary operator whose precedence is >= min_precedence
op := next token
rhs := parse_primary ()
while the next token is a binary operator whose precedence is greater
than op's, or a right-associative operator
whose precedence is equal to op's
lookahead := next token
rhs := parse_expression_1 (rhs, lookahead's precedence)
lhs := the result of applying op with operands lhs and rhs
return lhs
How can I modify this parser to handle (-) correctly? Better yet, how can I implement a parser with support for all infix and postfix operators (! for instance) that I might want? Finally, how should functions be treated?
I should note that (-) negation is distinguished in the lexer from - "subtraction", so it can be treated as a different token.
All unary operators and function call stuff basically belongs in the parse_primary, which should accept a legal unary term.
Related
I was having some trouble with Bison creating an operator as such:
<- = identity postfix operator with a low precedence to force evaluation of what's on the left first, e.g. 1+2<-*3 (equivalent (1+2)*3) as well as -> which is a prefix operator which does the same thing but to the right.
I was not able to get the syntax to work properly and tested with Python using - not False, which resulted in a syntax error (in Python, - has a greater precedence than not). However, this is not a problem in C or C++, where - and !/not have the same precedence.
Of course, the difference in precedence has nothing to do with the relationship between the 2 operators, only a relationship with other operators that result in the relative precedences between them.
Why is chaining prefix or postfix operators with different precedences a problem when parsing and how can implement the <- and -> operators while still having higher-precedence operators like !, ++, NOT, etc.?
Obligatory Bison (this pattern is repeated for all operators, where copy has greater precedence than post_unary):
post_unary:
copy
| post_unary "++"
| post_unary "--"
| post_unary '!'
;
Chaining operators in this category, e.g. x ! -- ! works fine syntactically.
Ok, let me suggest a possible erroneous grammar based on your sketch:
low_postfix:
mid_infix
| low_postfix "<-"
mid_infix:
high_postfix
| mid_infix '+' high_postfix
high_postfix:
term
| high_postfix "++"
term:
ID
'(' expr ')'
It should be clear just looking at those productions that var <- ++ is not part of the language. The only things that can be used as an operand to ++ are terms and other applications of ++. var <- is neither of these things.
On the other hand, var ++ <- is fine, because the operand to <- can be a mid_infix which can be a high_postfix which is an application of the ++ operator.
If the intention were to allow both of those postfix sequences, then that grammar is incorrect.
A version of that cascade is present in the Python grammar (albeit using prefix operators) which is why not - False is OK, but - not False is a syntax error. I'm reluctant to call that a bug because it may have been intentional. (Really, neither of those expressions makes much sense.) We could disagree about the value of such an intention but not on SO, which prefers to avoid opinionated discussions.
Note that what we might call "strict precedence" in this grammar and the Python grammar is by no means restricted to combinations of unary operators. Here's another one which you have likely never tried:
$ python3 -c 'print(41 + not False)'
File "<string>", line 1
print(41 + not False)
^
SyntaxError: invalid syntax
So, how can we fix that?
On some level, it would be nice to be able to just write an unambiguous grammar which conveyed our intention. And it is certainly possible to write an unambiguous grammar, which would convey the intention to bison. But it's at least an open question as to whether it would convey anything to a human reader, because the massive clutter of multiple rules necessary in order to keep track of what is and is not an acceptable grouping would be pretty daunting.
On the other hand, it's dead simple to do with bison/yacc precedence declarations. We just list the operators in order, and the parser generator resolves all the ambiguities accordingly. [See Note 1 below]
Here's a similar grammar to the above, with precedence declarations. (I left the actions in place in case you want to play with it, although it's by no means a Reproducible Example; the infrastructure it relies upon is much bigger than the grammar itself, and of little use to anyone other than me. So you'll have to define the three functions and fill in some of the bison type declarations. Or just delete the AST functions and use your own.)
%left ','
%precedence "<-"
%precedence "->"
%left '+'
%left '*'
%precedence NEG
%right "++" '('
%%
expr: expr ',' expr { $$ = make_binop(OP_LIST, $1, $3); }
| "<-" expr { $$ = make_unop(OP_LARR, $2); }
| expr "->" { $$ = make_unop(OP_RARR, $1); }
| expr '+' expr { $$ = make_binop(OP_ADD, $1, $3); }
| expr '*' expr { $$ = make_binop(OP_MUL, $1, $3); }
| '-' expr %prec NEG { $$ = make_unop(OP_NEG, $2); }
| expr '(' expr ')' %prec '(' { $$ = make_binop(OP_CALL, $1, $3); }
| "++" expr { $$ = make_unop(OP_PREINC, $2); }
| expr "++" { $$ = make_unop(OP_POSTINC, $1); }
| VALUE { $$ = make_ident($1); }
| '(' expr ')' { $$ = $2; }
A couple of notes:
I used %prec NEG on the unary minus production in order to separate that production from the subtraction production. I also used a %prec declaration to modify the precedence of the call production (the default would be ')'), although in this particular case that's unnecessary. It is necessary to put '(' into the precedence list, though. ( is the lookahead symbol which is used in precedence comparisons.
For many unary operators, I used bison %precedence declaration in the precedence list, rather than %right or %left. Really, there is no such thing as associativity with unary operators, so I think that it's more self-documenting to use %precedence, which doesn't resolve conflicts involving reductions and shifts in the same precedence level. However, even though there is no such thing as associativity between unary operators, the nature of the precedence resolution algorithm is that you can put prefix operators and postfix operators in the same precedence level and choose whether the postfix or prefix operators have priority by using %right or %left, respectively. %right is almost always correct. I did that with ++, because I was a bit lazy by the time I got to that point.
This does "work" (I think). It certainly resolves all the conflicts; bison happily produces a parser without warnings. And the tests that I tried worked at least as I expected them to:
? a++->
=> [-> [++/post a]]
? a->++
=> [++/post [-> a]]
? 3*f(a)+2
=> [+ [* 3 [CALL f a]] 2]
? 3*f(a)->+2
=> [+ [-> [* 3 [CALL f a]]] 2]
? 2+<-f(a)*3
=> [+ 2 [<- [* [CALL f a] 3]]]
? 2+<-f(a)*3->
=> [+ 2 [<- [-> [* [CALL f a] 3]]]]
But there are some expressions where the operator precedence, while "correct", might not be easily explained to a novice user. For example, although the arrow operators look somewhat like parentheses, they don't group that way. Furthermore, the decision as to which of the two operators has higher precedence seems to me to be totally arbitrary (and indeed I might have done it differently from what you expected). Consider:
? <-2*f(a)->+3
=> [<- [+ [-> [* 2 [CALL f a]]] 3]]
? <-2+f(a)->*3
=> [<- [* [-> [+ 2 [CALL f a]]] 3]]
? 2+<-f(a)->*3
=> [+ 2 [<- [* [-> [CALL f a]] 3]]]
There's also something a bit odd about how the arrow operators override normal operator precedence, so that you can't just drop them into a formula without changing its meaning:
? 2+f(a)*3
=> [+ 2 [* [CALL f a] 3]]
? 2+f(a)->*3
=> [* [-> [+ 2 [CALL f a]]] 3]
If that's your intention, fine. It's your language.
Note that there are operator precedence problems which are not quite so easy to solve by just listing operators in precedence order. Sometimes it would be convenient for a binary operator to have different binding power on the left- and right-hand sides.
A classic (but perhaps controversial) case is the assignment operator, if it is an operator. Assignment must associate to the right (because parsing a = b = 0 as (a = b) = 0 would be ridiculous), and the usual expectation is that it greedily accepts as much to the right as possible. If assignment had consistent precedence, then it would also accept as much to the left as possible, which seems a bit strange, at least to me. If a = 2 + b = 7 is meaningful, my intuitions say that its meaning should be a = (2 + (b = 7)) [Note 2]. That would require differential precedence, which is a bit complicated but not unheard of. C solves this problem by restricting the left-hand side of the assignment operators to (syntactic) lvalues, which cannot be binary operator expressions. But in C++, it really does mean a = ((2 + b) = 7), which is semantically valid if 2 + b has been overloaded by a function which returns a reference.
Notes
Precedence declarations do not really add any power to the parser generator. The languages it can produce a parser for are exactly the same languages; it produces the same sort of parsing machine (a pushdown automaton); and it is at least theoretically possible to take that pushdown automaton and reverse engineer a grammar out of it. (In practice, the grammars produced by this process are usually monstrous. But they exist.)
All that the precedence declarations do is resolve parsing conflicts (typically in an ambiguous grammar) according to some user-supplied rules. So it's worth asking why it's so much simpler with precedence declarations than by writing an unambiguous grammar.
The simple hand-waving answer is that precedence rules only apply when there is a conflict. If the parser is in a state where only one action is possible, that's the action which remains, regardless of what the precedence rules might say. In a simple expression grammar, an infix operator followed by a prefix operator is not at all ambiguous: the prefix operator must be shifted, because there is no reduce action for a partial sequence ending with an infix operator.
But when we're writing a grammar, we have to specify explicitly what constructs are possible at each point in the grammar, which we usually do by defining a bunch of non-terminals, each corresponding to some parsing state. An unambiguous grammar for expressions already has split the expression non-terminal into a cascading series of non-terminals, one for each operator precedence value. But unary operators do not have the same binding power on both sides (since, as noted above, one side of the unary operator cannot take an operand). That means that a binary operator could well be able to accept a unary operator for one of its operands, and not be able to accept the same unary operator for its other operand. Which in turn means that we need to split all of our non-terminals again, corresponding to whether the non-terminal appears on the left or the right side of a binary operator.
That's a lot of work, and it's really easy to make a mistake. If you're lucky, the mistake will result in a parsing conflict; but equally it could result in the grammar not being able to recognise a particular construct which you would never think of trying, but which some irate language user feels is an absolute necessity. (Like 41 + not False)
It's possible that my intuitions have been permanently marked by learning APL at a very early age. In APL, all operators associate to the right, basically without any precedence differences.
Assume I build a Abstract Syntax Tree of simple arithmetic operators, like Div(left,right), Add(left,right), Prod(left,right),Sum(left,right), Sub(left,right).
However when I want to convert the AST to string, I found it is hard to remove those unnecessary parathesis.
Notice the output string should follow the normal math operator precedence.
Examples:
Prod(Prod(1,2),Prod(2,3)) let's denote this as ((1*2)*(2,3))
make it to string, it should be 1*2*2*3
more examples:
(((2*3)*(3/5))-4) ==> 2*3*3/5 - 4
(((2-3)*((3*7)/(1*5))-4) ==> (2-3)*3*7/(1*5) - 4
(1/(2/3))/5 ==> 1/(2/3)/5
((1/2)/3))/5 ==> 1/2/3/5
((1-2)-3)-(4-6)+(1-3) ==> 1-2-3-(4-6)+1-3
I find the answer in this question.
Although the question is a little bit different from the link above, the algorithm still applies.
The rule is: if the children of the node has lower precedence, then a pair of parenthesis is needed.
If the operator of a node one of -, /, %, and if the right operand equals its parent node's precedence, it also needs parenthesis.
I give the pseudo-code (scala like code) blow:
def toString(e:Expression, parentPrecedence:Int = -1):String = {
e match {
case Sub(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p + 1) // +1 !!
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Modulus and divide is similar to Sub except for p
case Sum(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p) //
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Prod is similar to Sum
....
}
}
For a simple expression grammar, you can eliminate (most) redundant parentheses using operator precedence, essentially the same way that you parse the expression into an AST.
If you're looking at a node in an AST, all you need to do is to compare the precedence of the node's operator with the precedence of the operator for the argument, using the operator's associativity in the case that the precedences are equal. If the node's operator has higher precedence than an argument, the argument does not need to be surrounded by parentheses; otherwise it needs them. (The two arguments need to be examined independently.) If an argument is a literal or identifier, then of course no parentheses are necessary; this special case can easily be handled by making the precedence of such values infinite (or at least larger than any operator precedence).
However, your example includes another proposal for eliminating redundant parentheses, based on the mathematical associativity of the operator. Unfortunately, mathematical associativity is not always applicable in a computer program. If your expressions involving floating point numbers, for example, a+(b+c) and (a+b)+c might have very different values:
(gdb) p (1000000000000000000000.0 + -1000000000000000000000.0) + 2
$1 = 2
(gdb) p 1000000000000000000000.0 + (-1000000000000000000000.0 + 2)
$2 = 0
For this reason, it's pretty common for compilers to avoid rearranging the order of application of multiplication and addition, at least for floating point arithmetic, and also for integer arithmetic in the case of languages which check for integer overflow.
But if you do really want to rearrange based on mathematical associativity, you'll need an additional check during the walk of the AST; before checking precedence, you'll want to check whether the node you're visiting and its left argument use the same operator, where that operator is known to be mathematically associative. (This assumes that only operators which group to the left are mathematically associative. In the unlikely case that you have a mathematically associative operator which groups to the right, you'll want to check the visited node and its right-hand argument.)
If that condition is met, you can rotate the root of the AST, turning (for example) PROD(PROD(a,b),□)) into PROD(a,PROD(b,□)). That might lead to additional rotations in the case that a is also a PROD node.
I have a boolean expression in prefix notation. Lets say it is or and A B or or C D E. When I convert it to infix notation I end up with
((A and B) or ((C or D) or E)). I want to reduce it to (A and B) or C or D or E. Should I reduce infix notation or is it actually easier to get reduced equation from prefix notation. What algorithms should I use?
Paranthesis can be removed in expression X % (X1 ? X2 ? .. ? Xn) % X(n+1) where Xi is a parenthesized expression or boolean value "?" and "%" are operators if and only if each "?" operator has precedence higher or equal to "%" operator.
For infix notation you would find innermost expression, check if parenthesis can be removed, save result, process parent expresion and continue until all parenthesis checks are done.
This turns into a mapping problem. Postfix notation makes parenthesis elimination easy. Translation between prefix, infix and postfix notations is trivial.
I'm trying to parse expressions for a simple scripting language, but I'm confused. Right now, only numbers and string literals can be parsed as an expression:
int x = 5;
double y = 3.4;
str t = "this is a string";
However, I am confused on parsing more complex expressions:
int a = 5 + 9;
int x = (5 + a) - (a ^ 2);
I'm thinking I would implement it like:
do {
// no clue what I would do here?
if (current token is a semi colon) break;
}
while (true);
Any help would be great, I have no clue where to start. Thanks.
EDIT:
My parser is a Recursive Descent Parser
My expression "class" is as follows:
typedef struct s_Expression {
char type;
Token *value;
struct s_Expression *leftHand;
char operand;
struct s_Expression *rightHand;
} ExpressionNode;
Someone mentioned that a Recursive Descent Parser is capable of parsing expressions without doing infix, to postfix. Preferably, I would like an Expression like this:
For example this:
int x = (5 + 5) - (a / b);
Would be parsed into this:
Note: this isn't valid C, this is just some pseudo ish code to get my point across simply :)
ExpressionNode lh;
lh.leftHand = new ExpressionNode(5);
lh.operand = '+'
lh.rightHand = new ExpressionNode(5);
ExpressionNode rh;
rh.leftHand = new ExpressionNode(a);
rh.operand = '/';
rh.rightHand = new ExpressionNode(b);
ExpressionNode res;
res.leftHand = lh;
res.operand = '-';
res.rightHand = rh;
I asked the question pretty late at night, so sorry if I wasn't clear and that I completely forgot what my original goal was.
There are multiple methods to do this. One I've used in the past is to read the input string (that is, the programming language code) and convert expressions from infix to reverse-polish notation. Here's a really good post about doing just that:
http://andreinc.net/2010/10/05/converting-infix-to-rpn-shunting-yard-algorithm/
Note that = is an operator also, so your parsing should really include the whole file of code, and not just certain expressions.
Once in reverse-polish, expressions are super-easy to evaluate. You just pop the stack, storing operands as you go, until you hit an operator. Pop as many operands from the stack as required by the operator and perform your operation.
The method I ended up using is Operator precedence parsing.
parse_expression_1 (lhs, min_precedence)
lookahead := peek next token
while lookahead is a binary operator whose precedence is >= min_precedence
op := lookahead
advance to next token
rhs := parse_primary ()
lookahead := peek next token
while lookahead is a binary operator whose precedence is greater
than op's, or a right-associative operator
whose precedence is equal to op's
rhs := parse_expression_1 (rhs, lookahead's precedence)
lookahead := peek next token
lhs := the result of applying op with operands lhs and rhs
return lhs
If you are building a recursive descent parser, implementing a shunting yard is a lot of unnecessary work because recursive descent is perfectly capable of evaluating expressions or outputting RPN by itself.
You haven't told us anything about your programming language, your Token structs, or even what you are trying to achieve, so it is really hard to give you sample code. Take a look at Eric White's implementation of a recursive descent parser in C#. If you provide more details we will be able to help more.
In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.
statement : variable ':=' expression
| functionCall
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
variable : ID
| ID'.'variable
| ID '[' expression ']'
;
How could I adapt the lexer for LL(1) parser to support k look ahead tokens?
Are there some elegant ways?
I know I can add some buffers for tokens. I'd like to discuss some details of programming.
this is the Parser:
class Parser
{
private Lexer lex;
private Token look;
public Parser(Lexer l)
{
lex = l;
move();
}
private void move()
{
look = lex.scan();
}
}
and the Lexer.scan() returns the next token from the stream.
In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.
In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)
Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)
Two notes:
1) I'm skeptical about the rule:
functionCall : ID'(' (expression ( ',' expression )*)* ')'
;
That would allow, for example:
function(a[3], b[2] c[x] d[y], e.foo)
which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.