Why do we have to reverse the string when converting from infix to prefix - stack

In the first step itself of converting an infix to prefix can someone explain in simple terms why should we reverse the string? Is there any alternative method to convert?

Yes, you are absolutely right that if you have to convert infix to prefix then you have to scan the string from right to left.
Why not from left to right?
If you scan from left to right then you will require future knowledge of operators in the string.
Example 1 :
Infix : 2+3
Prefix : +23
Now, when you convert it from left to right then you should have the knowledge of + that is yet to appear in the string. This looks simple in this particular example, now consider another example given below.
Example 2:
Infix : 2+3*4/5-6*7
Prefix : -+2/*345*67
Now, if you scan from left to right then when you scan 0th index of string then the program should have knowledge of - which is going to appear in 7th index of string which could be a hectic job.
So the safest way to do is to scan the string from right to left.

In the first step itself of converting an infix to prefix can someone explain in simple terms why should we reverse the string?
Without stating which algorithm you're referring to you leave us to guessing, but a simple guess would be:
Read the Prefix expression in reverse order (from right to left)
If the symbol is an operand, then push it onto the Stack. Otherwise if the symbol is an operator, then pop two operands from the Stack and create a string by concatenating the two operands and the operator between them: string = (operand1 + operator + operand2)
And push the resultant string back to Stack
Repeat the above steps until the end of Prefix expression.
At the end stack will have only 1 string i.e resultant string
Note that in step 2 operand1 is the first to be popped.
Obviously why you can't read the string left-to-right is rather obvious in this algorithm: the first you'd read is an operator and you wouldn't have any operands to pop.
The algorithm works because it's kind of transforms it to postfix and uses the algorithm for converting postfix to infix. "Kind of" because in reversing you don't only put the operator after the operand, you also reverse the order of the operands (which is why in step 2 you consider the top operand to be the left operand).
Is there any alternative method to convert?
Yes, as the above algorithm is basically to abstractly evaluate a postfix expression and express it as infix. You can do the same thing directly with a prefix expression.
Output an open parenthesis
Read the first token, if it's an operand output it and we're done. Otherwise:
Convert input prefix expression (recursively using this algorithm)
Output token read in 2
Convert input prefix expression (recursively using this algorithm)
Output an close parenthesis

Related

How to parse dot operator in language syntax?

Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.
The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.
One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.

ANTLR Tries to Match an Expression That Wasn't Specified as an Option

I'm trying to understand how ANTLR grammars work and I've come across a situation where it behaves unexpectedly and I can't explain why or figure out how to fix it.
Here's the example:
root : title '\n' fields EOF;
title : STR;
fields : field_1 field_2;
field_1 : 'a' | 'b' | 'c';
field_2 : 'd' | 'e' | 'f';
STR : [a-z]+;
There are two parts:
A title that is a lowercase string with no special characters
A two character string representing a set of possible configurations
When I go to test the grammar, here's what happens: first I write the title and, on a new line, give the character for the first field. So far so good. The parse tree looks as I would expect up to this point.
When I add the next field is when the problem comes up. ANTLR decides to reinterpret the line as an instance of STR instead of a concatenation of the fields that I was expecting.
I do not understand why ANTLR tries to force an unrelated terminal expression when it wasn't specified as an option by the grammar. Shouldn't it know to only look for characters matching the field rules since it is descended from the fields node in the parse tree? What's going on here and how do I write my ANTLR grammars so they don't have this problem?
I've read that ANTLR tries to match the format greedily from the top of the grammar to the bottom, but this doesn't explain why this is happening because the STR terminal is the very last line in the file. If ANTLR gives special precedence to matching terminals, how do I format the grammar so that it interprets it properly? As far as I understand, regexes do not work for non-terminals so it seems that have to define it how it is now.
A note of clarification: this is just an example of a possible grammar that I'm trying to make work with the text format as is, so I'm not looking for answers like adding a space between the fields or changing the title to be uppercase.
What I didn't understand before is that there are two steps in generating a parser:
Scanning the input for a list of tokens using the lexer rules (uppercase statements) and then...
Constructing a parse tree using the parser rules (lowercase statements) and generated tokens
My problem was that there was no way for ANTLR to know I wanted that specific string to be interpreted differently when it was generating the tokens. To fix this problem, I wrote a new lexer rule for the fields string so that it would be identifiable as a token. The key was making the FIELDS rule appear before the STR rule because ANTLR checks them in the order they appear.
root : title FIELDS EOF;
title : STR;
FIELDS : [a-c] [d-f];
STR : [a-z]+;
Note: I had to bite the bullet and read the ANTLR Mega Tutorial to figure this out.

why we need both Look Ahead symbol and read ahead symbol in Compiler

well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.

How does groovy distinguish division from strings?

Groovy supports / as a division operator:
groovy> 1 / 2
===> 0.5
It supports / as a string delimiter, which can even be multiline:
groovy> x = /foo/
===> foo
groovy:000> x = /foo
groovy:001> bar/
===> foo
bar
Given this, why can't I evaluate a slashy-string literal in groovysh?
groovy:000> /foo/
groovy:001>
clearly groovysh thinks this is unterminated for some reason.
How does groovy avoid getting confused between division and strings? What does this code mean:
groovy> f / 2
Is this a function call f(/2 .../) where / is beginning a multiline slashy-string, or f divided by 2?
How does Groovy distinguish division from strings?
I'm not entirely sure how Groovy does it, but I'll describe how I'd do it, and I'd be very surprised if Groovy didn't work in a similar way.
Most parsing algorithms I've heard of (Shunting-yard, Pratt, etc) recognize two distinct kinds of tokens:
Those that expect to be preceded by an expression (infix operators, postfix operators, closing parentheses, etc). If one of these is not preceded by an expression, it's a syntax error.
Those that do not expect to be preceded by an expression (prefix operators, opening parentheses, identifiers, literals, etc). If one of these is preceded by an expression, it's a syntax error.
To make things easier, from this point onward I'm going to refer to the former kind of token as an operator and the latter as a non-operator.
Now, the interesting thing about this distinction is that it's made not based on what the token actually is, but rather on the immediate context, particularly the preceding tokens. Because of this, the same token can be interpreted very differently depending on its position in the code, and whether the parser classifies it as an operator or a non-operator. For example, the '-' token, if in an operator position, denotes a subtraction, but the same token in a non-operator position is a negation. There is no issue deciding whether a '-' is a subtraction operator or not, because you can tell based on its context.
The same is, in general, true for the '/' character in Groovy. If preceded by an expression, it's interpreted as an operator, which means it's a division. Otherwise, it's a non-operator, which makes it a string literal. So, you can generally tell if a '/' is a division or not, by looking at the token that immediately precedes it:
The '/' is a division if it follows an identifier, literal, postfix operator, closing parenthesis, or other token that denotes the end of an expression.
The '/' begins a string if it follows a prefix operator, infix operator, opening parenthesis, or other such token, or if it begins a line.
Of course, it isn't quite so simple in practice. Groovy is designed to be flexible in the face of various styles and uses, and therefore things like semicolons or parentheses are often optional. This can make parsing somewhat ambiguous at times. For example, say our parser comes across the following line:
println / foo
This is most likely an attempt to print a multiline string: foo is the beginning of a string being passed to println as an argument, and the optional parentheses around the argument list are left out. Of course, to a simple parser it looks like a division. I expect the Groovy parser can tell the difference by reading ahead to the following lines to see which interpretation does not give an error, but for something like groovysh that is literally impossible (since, as a repl, it doesn't yet have access to more lines), so it's forced to just guess.
Why can't I evaluate a slashy-string literal in groovysh?
As before, I don't know the exact reason, but I do know that because groovysh is a repl, it's bound to have more trouble with the more ambiguous rules. Even so, a simple single-line slashy-string is pretty unambiguous, so I believe something else may be going on here. Here is the result of me playing with various forms in groovysh:
> /foo - unexpected char: '/' # line 2, column 1.
> /foo/ - awaits further input
> /foo/bar - unexpected char: '/' # line 2, column 1.
> /foo/bar/ - awaits further input
> /foo/ + 'bar' - unexpected char: '/' # line 2, column 1.
> 'foo' + /bar/ - evaluates to 'foobar'
> /foo/ - evaluates to 'foo'
> /foo - awaits further input
> /foo/bar - Unknown property: bar
It appears that something strange happens when a '/' character is the first character in a line. The pattern it appears to follow (as far as I can tell) is this:
A slash as the first character of a line begins a strange parsing mode.
In this mode, every line that ends with a slash followed by nothing but whitespace causes the repl to await further lines.
On the first line that ends with something other than a slash (or whitespace following a slash), the error unexpected char: '/' # line 2, column 1. is printed.
I've also noticed a couple of interesting points regarding this:
Both forward slashes (/) and backslashes (\) appear to count, and seem to be completely interchangeable, in this special mode.
This does not appear to happen at all in groovyConsole or in actual Groovy files.
Putting any whitespace before the opening slash character causes groovysh to interpret it correctly, but only if the opening slash is a forward slash, not a backslash.
So, I personally expect that this is just a quirk of groovysh, either a bug or some under-documented feature I haven't heard about.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

Resources