Handling extra operators in Shunting-yard - parsing

Given an input like this: 3+4+
Algorithm turns it in to 3 4 + +
I can find the error when it's time to execute the postfix expression.
But, is it possible to spot this during the conversion?
(Wikipedia article and internet articles I've read do not handle this case)
Thank you

Valid expressions can be validated with a regular expression, aside from parenthesis mismatching. (Mismatched parentheses will be caught by the shunting-yard algorithm as indicated in the wikipedia page, so I'm ignoring those.)
The regular expression is as follows:
PRE* OP POST* (INF PRE* OP POST*)*
where:
PRE is a prefix operator or (
POST is a postfix operator or )
INF is an infix operator
OP is an operand (a literal or a variable name)
The Wikipedia page you linked includes function calls but not array subscripting; if you want to handle these cases, then:
PRE is a prefix operator or (
POST is a postfix operator or ) or ]
INF is an infix operator or ( or ,
OP is an operand, including function names
One thing to note in the above is that PRE and INF are in disjoint contexts; that is, if PRE is valid, then INF is not, and vice versa. This implies that using the same symbol for both a prefix operator and an infix operator is unambiguous. Also, PRE and POST are in disjoint contexts, so you can use the same symbol for prefix and postfix operators. However, you cannot use the same symbol for postfix and infix operators. As examples, consider the C/C++ operators:
- prefix or infix
+ prefix or infix
-- prefix or postfix
++ prefix or postfix
I implicitly used this feature above to allow ( to be used either as an expression grouper (effectively prefix) and as a function call (infix, because it comes between the function name and the argument list; the operator is "call".)
It's most common to implement that regular expression as a state machine, because there are only two states:
+-----+ +-----+
|State|-----OP---->|State|
| 1 |<----INF----| 2 |
| |---+ | |---+
+-----+ | +-----+ |
^ PRE ^ POST
| | | |
+------+ +------+
We could call State 1 "want operand" and State 2 "have operand". A simple implementation would just split the shunting yard algorithm as presented in wikipedia into two loops, something like this (if you don't like goto, it can be eliminated, but it really is the clearest way to present a state machine).
want_operand:
read a token. If there are no more tokens, announce an error.
if the token is an prefix operator or an '(':
mark it as prefix and push it onto the operator stack
goto want_operand
if the token is an operand (identifier or variable):
add it to the output queue
goto have_operand
if the token is anything else, announce an error and stop. (**)
have_operand:
read a token
if there are no more tokens:
pop all operators off the stack, adding each one to the output queue.
if a `(` is found on the stack, announce an error and stop.
if the token is a postfix operator:
mark it as postfix and add it to the output queue
goto have_operand.
if the token is a ')':
while the top of the stack is not '(':
pop an operator off the stack and add it to the output queue
if the stack becomes empty, announce an error and stop.
if the '(' is marked infix, add a "call" operator to the output queue (*)
pop the '(' off the top of the stack
goto have_operand
if the token is a ',':
while the top of the stack is not '(':
pop an operator off the stack and add it to the output queue
if the stack becomes empty, announce an error
goto want_operand
if the token is an infix operator:
(see the wikipeda entry for precedence handling)
(normally, all prefix operators are considered to have higher precedence
than infix operators.)
got to want_operand
otherwise, token is an operand. Announce an error
(*) The procedure as described above does not deal gracefully with parameter lists;
when the postfix expression is being evaluated and a "call" operator is found, it's
not clear how many arguments need to be evaluated. It might be that function names
are clearly identifiable, so that the evaluator can just attach arguments to the
"call" until a function name is found. But a cleaner solution is for the "," handler
to increment the argument count of the "call" operator (that is, the open
parenthesis marked as "infix"). The ")" handler also needs to increment the
argument count.
(**) The state machine as presented does not correctly handle function calls with 0
arguments. This call will show up as a ")" read in the want_operand state and with
a "call" operator on top of the stack. If the "call" operator is marked with an
argument count, as above, then the argument count must be 0 when the ")" is read,
and in this case, unlike the have_operand case, the argument count must not be
incremented.

Related

How is the conditional operator parsed?

So, the cppreference claims:
The expression in the middle of the conditional operator (between ? and :) is parsed as if parenthesized: its precedence relative to ?: is ignored.
However, it appears to me that the part of the expression after the ':' operator is also parsed as if it were between parentheses. I've tried to implement the ternary operator in my programming language (and you can see the results of parsing expressions here), and my parser pretends that the part of the expression after ':' is also parenthesized. For example, for the expression (1?1:0?2:0)-1, the interpreter for my programming language outputs 0, and this appears to be compatible with C. For instance, the C program:
#include <stdio.h>
int main() {
printf("%d\n",(1?1:0?2:0)-1);
}
Outputs 0.
Had I programmed the parser of my programming language that, when parsing the ternary operators, simply take the first already parsed node after ':' and take it as the third operand to '?:', it would output the same as ((1?1:0)?2:0)-1, that is 1.
My question is whether this would (pretending that the expression after the ':' is parenthesized) always be compatible with C?
"Pretends that it is parenthesised" is some kind of description of operator parenthesis. But of course that has to be interpreted relative to precedence relations (including associativity). So in a-b*c and a*b-c, the subtraction effectively acts as though its arguments are parenthesised, only the left-hand argument is treated that way in a-b-c and it is the comparison operator which causes grouping in a<b-c and a-b<c.
I'm sure you know all that since your parser seems to work for all these cases, but I say that because the ternary operator is right-associative and of lower precedence than any other operator [Note 1]. That means that the pseudo-parentheses imposed by operator precedence surround the right-hand argument (regardless of its dominating operator, since all operators have higher precedence), and also the left-hand argument unless its dominating operator is another conditional operator. But that wouldn't be the case in C, where the comma operator has lower precedence and would not be enclosed by the imaginary parentheses following the :.
It's important to understand what is meant by the precedence of a complex operator. In effect, to compute the precedence relations we first collapse the operator to a simple ?: which includes the enclosed (second) argument. This is not "as if the expression were parenthesized", because it is parenthesized. It is parenthesized between ? and :, which in this context are syntactically parenthetic.
In this sense, it is very similar to the usual analysis of the subscript operator as a postfix operator, although the brackets of the subscript operator enclose a second argument. The precedence of the subscript operator is logically what would result from considering it to be a single [], abstracting away the expression contained inside. This is also the same as the function call operator. That happens to be written with parentheses, but the precise symbols are not important: it is possible to imagine an alternative language in which function calls are written with different symbols, perhaps { and }. That wouldn't affect the grammar at all.
It might seem odd to think of ? and : to be "parenthetic", since they don't look parenthetic. But a parser doesn't see the shapes of the symbols. It is satisfied by being told that a ( is closed by a ) and, in this case, that a ? is closed by a :. [Note 2]
Having said all that, I tried your compiler on the conditional expression
d = 0 ? 0 : n / d
It parses this expression correctly, but the compiled code computes n / d before verifying whether d = 0 is true. That's not the way the conditional operator should work; in this case, it will lead to an unexpected divide by 0 exception. The conditional operator must first evaluate its left-hand argument, and then evaluate exactly one of the other two expressions.
Notes:
In C, this is not quite correct. The comma operator has lower precedence, and there is a more complex interaction with assignment operators, which logically have the same precedence and are also right-associative.
In C-like languages those symbols are not used for any other purpose, so it's OK to just regard them as strange-looking parentheses and leave it at that. But as the case of the function-call operator shows (or, for that matter, the unary - operator), it is sometimes possible to reuse operator symbols for more than one purpose.
As a curiosity, it is not strictly necessary that open and close parentheses be different symbols, as long as they are not used for any other purpose. So, for example, if | is not used as an operator symbol (as it is in C), then you could use | a | to mean the absolute value of a without creating any ambiguities.
A precise analysis of the circumstances in which symbol reuse leads to actual ambiguities is beyond the scope of this answer.

Error: Unexpected infix operator in expression, about a successfully compiled prefix operator

Playing around a little bit with infix operators, I was surprised about the following:
let (>~~~) = function null -> String.Empty | s -> s // compiles fine, see screenshot
match >~~~ input with .... // error: Unexpected infix operator in expression
and:
Changing the first characters of the prefix operator (to !~~~ for instance) fixes it. That I get an error that the infix operator is unexpected is rather weird. Hovering shows the definition to be string -> string.
I'm not too surprised about the error, F# requires (iirc) that the first character of a prefix operator must itself be one of the predefined prefix operators. But why does it compile just fine, and when I use it, the compiler complains?
Update: the F# compiler seems to know in other cases just fine when I use an invalid character in my operator definition, it says "Invalid operator definition. Prefix operator definitions must use a valid prefix operator name."
The rules for custom operators in F# are quite tight - so even though you can define custom operators, there is a lot of rules about how they will behave and you cannot change those. In particular:
Only some operators (mainly those with ! and ~) can be used as prefix operators. With ~ you can also overload unary operators +, -, ~ and ~~, so if you define an operator named ~+., you can then use it as e.g. +. 42.
Other operators (including those starting with >) can only be used as infix. You can turn any operator into ordinary function using parentheses, which is why e.g. (+) 1 2 is valid.
The ? symbols is special (it is used for dynamic invocation) and cannot appear as the first symbol of a custom operator.
I think the most intuitive way of thinking about this is that custom operators will behave like standard F# operators, but you can add additional symbols after the standard operator name.

ANTLR4 Context-sensitive rule: unexpected parsing/resynchronization when failing semantic predicate

I'm working on a grammar that is context-sensitive. Here is its description:
It describes the set of expressions.
Each expression contains one or more parts separated by logical operator.
Each part consists of optional field identifier followed by some comparison operator (that is also optional) and the list of values.
Values are separated by logical operator as well.
By default value is a sequence of characters. Sometimes (depending on context) set of possible characters for each value can be extended. It even can consume comparison operator (that is used for separating of field identifiers from list of values, according to 3rd rule) to treat it as value's character.
Here's the simplified version of a grammar:
grammar TestGrammar;
#members {
boolean isValue = false;
}
exprSet: (expr NL?)+;
expr: expr log_op expr
| part
| '(' expr ')'
;
part: (fieldId comp_op)? values;
fieldId: STRNG;
values: values log_op values
| value
| '(' values ')'
;
value: strng;
strng: ( STRNG
| {isValue}? comp_op
)+;
log_op: '&' '&';
comp_op: '=';
NL: '\r'? '\n';
WS: ' ' -> channel(HIDDEN);
STRNG: CHR+;
CHR: [A-Za-z];
I'm using semantic predicate in strng rule. It should extend the set of possible tokens depending on isValue variable;
The problem occurs when semantic predicate evaluates to false. I expect that 2 STRNG tokens with '=' token between them will be treated as part node. Instead of it, it parses each STRNG token as a value, and throws out '=' token when re-synchronizing.
Here's the input string and the resulting expression tree that is incorrect:
a && b=c
To look at correct expression tree it's enough to remove an alternative with semantic predicate from strng rule (that makes it static and so is inappropriate for my solution):
strng: ( STRNG
// | {isValue}? comp_op
)+;
Here's resulting expression tree:
BTW, when semantic predicate evaluates to true - the result is as expected: strng rule matches an extended set of tokens:
strng: ( STRNG
| {!isValue}? comp_op
)+;
Please explain why this happens in such way, and help to find out correct solution. Thanks!
What about removing one option from values? Otherwise the text a && b may be either a
expr -> expr log_op expr
or
expr -> part -> values log_op values
.
It seems Antlr resolves it by using the second option!
values
: //values log_op values
value
| '(' values ')'
;
I believe your expr rule is written in the wrong order. Try moving the binary expression to be the last alternative instead of the first.
Ok, I've realized that current approach is inappropriate for my task.
I've chosen another approach based on overriding of Lexer's nextToken() and emit() methods, as described in ANTLR4: How to inject tokens .
It has given me almost full control on the stream of tokens. I got following advantages:
assigning required types to tokens;
postpone sending tokens with yet undefined type to parser (by sending fake tokens on hidden channel);
possibility to split and merge tokens;
possibility to organize postponed tokens into queues.
Having all these possibilities I'm able to resolve all the ambiguities in the parser.
P.S. Thanks to everyone who tried to help, I appreciate it!

What exactly makes grammar rules left-recursive in antlr?

So, I was wondering what makes a parser like:
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING) | expression;
left recursive and invalid, while a parser like
line : expression EOF;
expression : m_expression (PLUS m_expression)?;
m_expression: basic (TIMES basic)?;
basic : NUMBER | VARIABLE | (OPENING expression CLOSING);
is valid and works even though the definition of 'basic' still refers to 'expression'. In particular, I'd like to be able to parse expressions in the form of
a+b+c
without introducing operations acting on more than two operands.
line calls expression calls m_expression calls basic which calls expression... that is indirectly left recursive and bad for both v3 antlr and v4. The definition of left recursion means you can get back to the same rule without consuming a token. You have the OPENING token in front of expression in the second instance.

Bison Shift/Reduce conflict for simple grammar

I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.

Resources