How to parse dot operator in language syntax? - parsing

Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.

The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.

One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.

Related

Is there a guide/convention to the markup being used in the lua official documentation?

I consistently have a really hard time reading official documentation when it's related to coding. I generally don't understand it unless it's paired with an example. I am seeking clarification on what kind of conventions are inplace when reading docs, if any. Take the example below from the lua manual(https://www.lua.org/manual/5.1/manual.html#2.1)
:
stat ::= if exp then block {elseif exp then block} [else block] end
The first word, Stat, is defined as a statement and "this set includes assignments, control structures, function calls, and variable declarations."
::= Is not defined in the docs, it can be googled thankfully.
Exp is linked and explained.
Block has a section as well.
But then they do {} and []. They literally stated "Square brackets are used to index a table" just a few lines above. And that squiggly brackets are for writing a table. So what am I supposed to deduce from this? That {} and [] are being used to denote separate sections as a markup to make it easier to see certain components? Or that {elseif exp then block} is a table with those values inside of itself and [else block] is a key-value indexing a table? If I was writing a doc where that was indeed the case, wouldn't I write it this way?
Then I see
var ::= prefixexp `[´ exp `]´`
' ' defines a string, but I have to make the assumption that '[' ']' is used as a way to highlight the fact that because they were talking about what square brackets do in the previous section they are simply highlighting their position and this should not be included in the code. I only know to make this assumption though cause I know it doesn't work when you put them in there.
But then I see this:
chunk ::= {stat [`;´]}
Similarly they are talking about the placement of the semicolon before listing that code, but the entirety of the line of code was also newly explained and being talked about. Why would I assume that its written without the parenthesis if its written with the parenthesis? And I see they are using {} and [] again, and I have no idea what they are referencing because its not stated explicitly that we're talking about a table...its simply using the code itself to explain whether its talking about a table or not with the {}, but we have that first set of code where {} is being used and its not talking about a table.
What is the convention being used? What are they actually trying to do/show by using {} and [] in the first line of code?
As stated both at the beginning of the Lua documentation and in the section on the Lua grammar, Lua presents its grammar in extended BNF format.
EBNF has its own punctuation with its own meaning, like ::= as you discovered. But as a grammar, there needs to be a distinction between the EBNF meaning of a piece of punctuation and "this punctuation appears in the language defined by the grammar". The former meaning is therefore always assumed; the latter meaning can only be achieved by quoting the punctuation.
So this:
var ::= prefixexp `[´ exp `]´`
Means a prefixexp followed by an open bracket followed by exp followed by a close bracket.
By contrast, this:
funcname ::= Name {`.´ Name} [`:´ Name]
Means Name followed by zero or more sub-sequences of . followed by Name, followed by an optional sub-sequence of : followed by Name. Because those are what {} and [] mean to EBNF.

Lex won't recognize double operators- !=, :=, <<, etc. Can I give a Lex expression precedence?

Trying to parse operators (+, -, =, <<, !=), using states like
%{
%}
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]
DOUBOP [":="|".."|"<<"|">>"|"<>"|"<="|">="|"=>"|"**"|"!="|"{:"|"}:"|"\-"]
and later on
{DOUBOP} { printf("%s (operator)\n", yytext); }
{OP} { printf("%s (operator)\n", yytext); }
but Lex is identifying operators like "<<" as "<" and "<". I thought since it was in double quotes this would work, but I see that's not the case.
Is there anyway I can give a regular expression precedence, ie have lex check for a double operator first, and then a single operator?
Thanks in advance.
[...] is a character class, not an eccentric type of parenthesis. If you want to parenthesize a sub-expression in a pattern, use ordinary parentheses. In this case, however, parentheses are not necessary. (Indeed, most of the quotes aren't necessary either, but they don't hurt and some of them would be useful.)
"==" recognises the two character-sequence consisting of two equal signs. "=="|"++" recognizes either two equal signs or two plus signs.
By contrast, ["=="] recognises a single character, which could be either a quote or an equals sign. Since a character class is a set, the fact that each of those appears twice is irrelevant (although I think it would save a lot of grief if flex issued a warning). Similarly, ["=="|"<<"] recognises a single character if it is a quote, an equals sign, a vertical bar or a less than sign.
Flex pattern syntax is documented in the flex manual. It differs in a few ways from regexes in other systems, so it's worth reading the short document. However, character classes are mostly the same in all regex syntaxes in common use, especially the use of square brackets to delimit the set.
An easier way is to put all single characters together, and run the * command on the end up curly braces.
i.e.
OP ["+"|";"|":"|","|"*"|"/"|"="|"("|")"|"{"|"}"|"*"|"#"|"$"|
"<"|">"|"&"|"|"|"!"|]*

Parsing optional semicolon at statement end

I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.

Help with Shift/Reduce conflict - Trying to model (X A)* (X B)*

Im trying to model the EBNF expression
("declare" "namespace" ";")* ("declare" "variable" ";")*
I have built up the yacc (Im using MPPG) grammar, which seems to represent this, but it fails to match my test expression.
The test case i'm trying to match is
declare variable;
The Token stream from the lexer is
KW_Declare
KW_Variable
Separator
The grammar parse says there is a "Shift/Reduce conflict, state 6 on KW_Declare". I have attempted to solve this with "%left PrologHeaderList PrologBodyList", but neither solution works.
Program : Prolog;
Prolog : PrologHeaderList PrologBodyList;
PrologHeaderList : /*EMPTY*/
| PrologHeaderList PrologHeader;
PrologHeader : KW_Declare KW_Namespace Separator;
PrologBodyList : /*EMPTY*/
| PrologBodyList PrologBody;
PrologBody : KW_Declare KW_Variable Separator;
KW_Declare KW_Namespace KW_Variable Separator are all tokens with values "declare", "naemsapce", "variable", ";".
It's been a long time since I've used anything yacc-like, but here are a couple of suggestions that may or may not help.
It seems that you need a 2-token lookahead in this situation. The parser gets to the last PrologHeader, and it has to decide whether the next construct is a PrologHeader or a PrologBody, and it can't tell that from the KW_Declare. If there's a directive to increase lookahead in this situation, it will probably solve the problem.
You could also introduce context into your actions: rather than define PrologHeaderList and PrologBodyList, define PrologRuleList and have the actions throw an error if a header appears after a body. Ugly, but sometimes you have to do it: what appears simple in a grammar may not be simple in the generated parser.
A hackish approach might be to combine the tokens: rather than KW_Declare and KW_Variable, have your lexer recognize the space and use KW_Declare_Variable. Since both are keywords, you're not going to run into namespace collision problems.
The grammar at the top is regular so IIRC you can plot it out as a DFA (or a NDA and convert it to a DFA) and then convert the DFA to a grammar. It's bean a while so I'll leave the work as an exercise for the reader.

Create a Print Function

I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.
You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)

Resources