How can I check whether the given expression is an infix expression, postfix expression or prefix expression? - postfix-notation

I need algorithms that will check whether given expression is infix, postfix or prefix expression.
I have tried a method by checking first or last 2 terms of the string e.g.
+AB if there is an operator in the very first index of string then its a prefix
AB+ if there is an operator in the very last index of string then its
a postfix
else it is an infix.
But it doesn't feel appropriate so kindly suggest me a better algorithim.

If it starts with a valid infix operator it's infix, unless you're going to allow unary operators.
If it ends with a valid postfix operator it's postfix.
Otherwise it is either infix or invalid.
Note that (3) includes the case you mentioned in comments of an expression in parentheses. There are no parentheses in prefix or postfix. That's why they exist. (3) also includes the degenerate case of a single term, e.g. 1, but in that case it doesn't matter how you parse it.
You can only detect an invalid expression by parsing it fully.
If you're going to allow unary operators in infix notation I can only suggest that you try all three parses and stop when you get a success. Very possibly this is the strategy you should follow anyway.

check the first elements in the string.
1- if the first element is an operator, then it is for sure prefix expression
2- else, check the second element, if it is operator, then it is for sure infix
3- else, it is for sure postfix

Related

Validating expressions in the parser

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

How is the conditional operator parsed?

So, the cppreference claims:
The expression in the middle of the conditional operator (between ? and :) is parsed as if parenthesized: its precedence relative to ?: is ignored.
However, it appears to me that the part of the expression after the ':' operator is also parsed as if it were between parentheses. I've tried to implement the ternary operator in my programming language (and you can see the results of parsing expressions here), and my parser pretends that the part of the expression after ':' is also parenthesized. For example, for the expression (1?1:0?2:0)-1, the interpreter for my programming language outputs 0, and this appears to be compatible with C. For instance, the C program:
#include <stdio.h>
int main() {
printf("%d\n",(1?1:0?2:0)-1);
}
Outputs 0.
Had I programmed the parser of my programming language that, when parsing the ternary operators, simply take the first already parsed node after ':' and take it as the third operand to '?:', it would output the same as ((1?1:0)?2:0)-1, that is 1.
My question is whether this would (pretending that the expression after the ':' is parenthesized) always be compatible with C?
"Pretends that it is parenthesised" is some kind of description of operator parenthesis. But of course that has to be interpreted relative to precedence relations (including associativity). So in a-b*c and a*b-c, the subtraction effectively acts as though its arguments are parenthesised, only the left-hand argument is treated that way in a-b-c and it is the comparison operator which causes grouping in a<b-c and a-b<c.
I'm sure you know all that since your parser seems to work for all these cases, but I say that because the ternary operator is right-associative and of lower precedence than any other operator [Note 1]. That means that the pseudo-parentheses imposed by operator precedence surround the right-hand argument (regardless of its dominating operator, since all operators have higher precedence), and also the left-hand argument unless its dominating operator is another conditional operator. But that wouldn't be the case in C, where the comma operator has lower precedence and would not be enclosed by the imaginary parentheses following the :.
It's important to understand what is meant by the precedence of a complex operator. In effect, to compute the precedence relations we first collapse the operator to a simple ?: which includes the enclosed (second) argument. This is not "as if the expression were parenthesized", because it is parenthesized. It is parenthesized between ? and :, which in this context are syntactically parenthetic.
In this sense, it is very similar to the usual analysis of the subscript operator as a postfix operator, although the brackets of the subscript operator enclose a second argument. The precedence of the subscript operator is logically what would result from considering it to be a single [], abstracting away the expression contained inside. This is also the same as the function call operator. That happens to be written with parentheses, but the precise symbols are not important: it is possible to imagine an alternative language in which function calls are written with different symbols, perhaps { and }. That wouldn't affect the grammar at all.
It might seem odd to think of ? and : to be "parenthetic", since they don't look parenthetic. But a parser doesn't see the shapes of the symbols. It is satisfied by being told that a ( is closed by a ) and, in this case, that a ? is closed by a :. [Note 2]
Having said all that, I tried your compiler on the conditional expression
d = 0 ? 0 : n / d
It parses this expression correctly, but the compiled code computes n / d before verifying whether d = 0 is true. That's not the way the conditional operator should work; in this case, it will lead to an unexpected divide by 0 exception. The conditional operator must first evaluate its left-hand argument, and then evaluate exactly one of the other two expressions.
Notes:
In C, this is not quite correct. The comma operator has lower precedence, and there is a more complex interaction with assignment operators, which logically have the same precedence and are also right-associative.
In C-like languages those symbols are not used for any other purpose, so it's OK to just regard them as strange-looking parentheses and leave it at that. But as the case of the function-call operator shows (or, for that matter, the unary - operator), it is sometimes possible to reuse operator symbols for more than one purpose.
As a curiosity, it is not strictly necessary that open and close parentheses be different symbols, as long as they are not used for any other purpose. So, for example, if | is not used as an operator symbol (as it is in C), then you could use | a | to mean the absolute value of a without creating any ambiguities.
A precise analysis of the circumstances in which symbol reuse leads to actual ambiguities is beyond the scope of this answer.

Error: Unexpected infix operator in expression, about a successfully compiled prefix operator

Playing around a little bit with infix operators, I was surprised about the following:
let (>~~~) = function null -> String.Empty | s -> s // compiles fine, see screenshot
match >~~~ input with .... // error: Unexpected infix operator in expression
and:
Changing the first characters of the prefix operator (to !~~~ for instance) fixes it. That I get an error that the infix operator is unexpected is rather weird. Hovering shows the definition to be string -> string.
I'm not too surprised about the error, F# requires (iirc) that the first character of a prefix operator must itself be one of the predefined prefix operators. But why does it compile just fine, and when I use it, the compiler complains?
Update: the F# compiler seems to know in other cases just fine when I use an invalid character in my operator definition, it says "Invalid operator definition. Prefix operator definitions must use a valid prefix operator name."
The rules for custom operators in F# are quite tight - so even though you can define custom operators, there is a lot of rules about how they will behave and you cannot change those. In particular:
Only some operators (mainly those with ! and ~) can be used as prefix operators. With ~ you can also overload unary operators +, -, ~ and ~~, so if you define an operator named ~+., you can then use it as e.g. +. 42.
Other operators (including those starting with >) can only be used as infix. You can turn any operator into ordinary function using parentheses, which is why e.g. (+) 1 2 is valid.
The ? symbols is special (it is used for dynamic invocation) and cannot appear as the first symbol of a custom operator.
I think the most intuitive way of thinking about this is that custom operators will behave like standard F# operators, but you can add additional symbols after the standard operator name.

YACC: Stop parsing specific path

I'm using Python PLY to parse a specific language. For a grammar like:
IF LPAREN condition RPAREN LBRACE stmtlist RBRACE ELSE LBRACE stmtlist RBRACE
When I know the condition value, say True, then is there a way to stop parsing the stmtlist in the ELSE path?
Thanks,
You will have to continue the parse, because you need to find the end of the block enclosed by the second RBRACE; in other words, you need to parse to find the beginning of the next statement.
That said, when you analyze the results of the parse (to generate code, construct an AST, whatever you need to do), if you can determine that condition always evaluates to true (perhaps it is the expression 1 = 1), then you can suppress generating code for the second stmtlist.
Update:
Your grammar (the syntax of your language) is specified non-procedurally, so there's no place for you to attach conditional logic.
On the other hand, you specify semantic actions to take when a particular syntactic element of your grammar is matched, and you do this procedurally. In PLY, you do this by coding the body of grammar rule function(s). In the grammar rule function that matches the second stmtlist, you can write conditional code to skip over code generation, based on other information you have figured out about the input program (the input to your compiled language processor).

Handling extra operators in Shunting-yard

Given an input like this: 3+4+
Algorithm turns it in to 3 4 + +
I can find the error when it's time to execute the postfix expression.
But, is it possible to spot this during the conversion?
(Wikipedia article and internet articles I've read do not handle this case)
Thank you
Valid expressions can be validated with a regular expression, aside from parenthesis mismatching. (Mismatched parentheses will be caught by the shunting-yard algorithm as indicated in the wikipedia page, so I'm ignoring those.)
The regular expression is as follows:
PRE* OP POST* (INF PRE* OP POST*)*
where:
PRE is a prefix operator or (
POST is a postfix operator or )
INF is an infix operator
OP is an operand (a literal or a variable name)
The Wikipedia page you linked includes function calls but not array subscripting; if you want to handle these cases, then:
PRE is a prefix operator or (
POST is a postfix operator or ) or ]
INF is an infix operator or ( or ,
OP is an operand, including function names
One thing to note in the above is that PRE and INF are in disjoint contexts; that is, if PRE is valid, then INF is not, and vice versa. This implies that using the same symbol for both a prefix operator and an infix operator is unambiguous. Also, PRE and POST are in disjoint contexts, so you can use the same symbol for prefix and postfix operators. However, you cannot use the same symbol for postfix and infix operators. As examples, consider the C/C++ operators:
- prefix or infix
+ prefix or infix
-- prefix or postfix
++ prefix or postfix
I implicitly used this feature above to allow ( to be used either as an expression grouper (effectively prefix) and as a function call (infix, because it comes between the function name and the argument list; the operator is "call".)
It's most common to implement that regular expression as a state machine, because there are only two states:
+-----+ +-----+
|State|-----OP---->|State|
| 1 |<----INF----| 2 |
| |---+ | |---+
+-----+ | +-----+ |
^ PRE ^ POST
| | | |
+------+ +------+
We could call State 1 "want operand" and State 2 "have operand". A simple implementation would just split the shunting yard algorithm as presented in wikipedia into two loops, something like this (if you don't like goto, it can be eliminated, but it really is the clearest way to present a state machine).
want_operand:
read a token. If there are no more tokens, announce an error.
if the token is an prefix operator or an '(':
mark it as prefix and push it onto the operator stack
goto want_operand
if the token is an operand (identifier or variable):
add it to the output queue
goto have_operand
if the token is anything else, announce an error and stop. (**)
have_operand:
read a token
if there are no more tokens:
pop all operators off the stack, adding each one to the output queue.
if a `(` is found on the stack, announce an error and stop.
if the token is a postfix operator:
mark it as postfix and add it to the output queue
goto have_operand.
if the token is a ')':
while the top of the stack is not '(':
pop an operator off the stack and add it to the output queue
if the stack becomes empty, announce an error and stop.
if the '(' is marked infix, add a "call" operator to the output queue (*)
pop the '(' off the top of the stack
goto have_operand
if the token is a ',':
while the top of the stack is not '(':
pop an operator off the stack and add it to the output queue
if the stack becomes empty, announce an error
goto want_operand
if the token is an infix operator:
(see the wikipeda entry for precedence handling)
(normally, all prefix operators are considered to have higher precedence
than infix operators.)
got to want_operand
otherwise, token is an operand. Announce an error
(*) The procedure as described above does not deal gracefully with parameter lists;
when the postfix expression is being evaluated and a "call" operator is found, it's
not clear how many arguments need to be evaluated. It might be that function names
are clearly identifiable, so that the evaluator can just attach arguments to the
"call" until a function name is found. But a cleaner solution is for the "," handler
to increment the argument count of the "call" operator (that is, the open
parenthesis marked as "infix"). The ")" handler also needs to increment the
argument count.
(**) The state machine as presented does not correctly handle function calls with 0
arguments. This call will show up as a ")" read in the want_operand state and with
a "call" operator on top of the stack. If the "call" operator is marked with an
argument count, as above, then the argument count must be 0 when the ")" is read,
and in this case, unlike the have_operand case, the argument count must not be
incremented.

Resources