Can alternative rules overlap in their intial tokens in PEG? - parsing

I think I just realized that a problem I've run into several times and for which I've always found some odd work-around might just be fundamental. I'm hoping someone can either validate that understanding or show me what I'm doing wrong.
I'm trying to create a parser that will handle several variants of a syntax. But I guess there is ambiguity because in the simplest case, each variant allows the same input. Since I was mapping them to the same syntax tree, I had hoped that it wouldn't matter. Of course it does.
I'm using Peg.js and trying to build a grammar that will accept, for simplicity's sake, inputs like abc/cd/e, x/yz, or just foo, but also accepts abc.cd.e, x.yz, and, of course still foo. I'm guessing the problem is that both of my variants accept foo.
This is what I was trying to do:
Expression = SlashTerm / DotTerm
SlashTerm = Name ('/' SlashTerm)?
DotTerm = Name ('.' DotTerm)?
Name = [a-z0-9_\-]i+
And this recognizes foo/bar, but not foo.bar. If I switch to Expression = DotTerm / SlashTerm, of course it recognizes foo.bar and not foo/bar.
So the question is there any better way of handling this than manually forcing the choice with something like
Expression = DotTerm / SlashTerm
SlashTerm = Name ('/' SlashTerm)?
DotTerm = Name '.' DotNode
DotNode = Name ('.' DotNode)?
Name = [a-z0-9_\-]i+
While there is no issue adding the additional rule, I don't like that this required me to switch from Expression = SlashTerm / DotTerm to Expression = DotTerm / SlashTerm. I thought either one should work since there is no ambiguity. Perhaps I misunderstand the choice operation, but my impression is that using Expression = SlashTerm / DotTerm, when it tried to parse foo.bar, it would get through foo, hit the ., not have a match and backtrack to the top of SlashTerm, which wouldn't match so that it would move to DotTerm, where it would find a match. Since that doesn't work, my understanding is incorrect, and I'm hoping someone can explain why.
I'd also love to hear about better ways to do this.
My real grammar is of course much more complex; SlashTerm is much more involved. But the DotTerm equivalent really is that simple. The DotTerm equivalent syntax (which I can obviously handle easily without a full-fledged parser) is the only supported syntax now, but I'm trying to expand it to a much more robust language. I'd like to not branch the code based on the choice of syntax, and want to easily be able to deprecate it later. Including a few rules in my grammar seems the simplest way to do do. But if there is another simple way to do this, I'd love to hear it.

But SlashTerm does match. It doesn't match the entire input, but it successfully matches a part of the input. Consequently, SlashTerm / DotTerm successfully matches, too. And PEG doesn't backtrack over a successfully matched component. For PEG to backtrack, the component being matched must fail.
So one solution is to insist that the alternatives match the first delimiter, unless there isn't one:
Expression = SlashTerm / DotTerm / Name
SlashTerm = Name ('/' Name)+
DotTerm = Name ('.' Name)+
Name = [a-z0-9_\-]i+
In that case, neither SlashTerm nor DotTerm will match a single word, so the choice operator must fall through to the third alternative. Similarly, foo.bar cannot match SlashTerm, so it will fail and the choice operator will fallback to the second alternative as expected.
In the above, I replaced the recursion with an iteration, which seemed to me to be simpler. If that doesn't fit your grammar model, you can adapt your original solution with one extra non-terminal:
Expression = SlashTerm / DotTerm / Name
SlashTerm = Name '/' SlashRest
DotTerm = Name '.' DotRest
SlashRest = Name ('/' SlashRest)?
DotRest = Name ('.' DotRest)?
Name = [a-z0-9_\-]i+

Related

How to parse dot operator in language syntax?

Let's say I'm writing a parser that parses the following syntax:
foo.bar().baz = 5;
The grammar rules look something like this:
program: one or more statement
statement: expression followed by ";"
expression: one of:
- identifier (\w+)
- number (\d+)
- func call: expression "(" ")"
- dot operator: expression "." identifier
Two expressions have a problem, the func call and the dot operator. This is because the expressions are recursive and look for another expression at the start, causing a stack overflow. I will focus on the dot operator for this quesition.
We face a similar problem with the plus operator. However, rather than using an expression you would do something like this to solve it (look for a "term" instead):
add operation: term "+" term
term: one of:
- number (\d+)
- "(" expression ")"
The term then includes everything except the add operation itself. To ensure that multiple plus operators can be chained together without using parenthesis, one would rather do:
add operation: term, one or more of ("+" followed by term)
I was thinking a similar solution could for for the dot operator or for function calls.
However, the dot operator works a little differently. We always evaluate from left-to-right and need to allow full expressions so that you can do function calls etc. in-between. With parenthesis, an example might be:
(foo.bar()).baz = 5;
Unfortunately, I do not want to require parenthesis. This would end up being the case if following the method used for the plus operator.
How could I go about implementing this?
Currently my parser never peeks ahead, but even if I do look ahead, it still seems tricky to accomplish.
The easy solution would be to use a bottom-up parser which doesn't drop into a bottomless pit on left recursion, but I suppose you have already rejected that solution.
I don't understand your objection to using a looping construct, though. Postfix modifiers like field lookup and function call are not really different from binary operators like addition (except, of course, for the fact that they will not need to claim an eventual right operand). Plus and minus intermingle freely, which you can parse with a repetition like:
additive: term ( '+' term | '-' term )*
Similarly, postfix modifiers can be easily parsed with something like:
postfixed: atom ( '.' ID | '(' opt-expr-list `)` )*
I'm using a form of extended BNF: parentheses group; | separates alternatives and binds less stringly than concatenation; and * means "zero or more repetitions" of the atom on its left.
Another postfix operator which falls into the same category is array/map subscripting ('[' expr ']'), although you might also have other postfix operators.
Note that like the additive syntax above, selecting the appropriate alternative does not require looking beyond the next token. It's hard to parse without being able to peek one token into the future. Fortunately, that's very little overhead.
One way could be for the dot operator to parse a non-dot expression, that is, a rule that is the same as expression but without the dot operator. This prevents recursion.
Then, when the non-dot expression has been parsed, check if a dot and an identifier follows. If this is not the case, we are done. If this is the case, wrap the current node up in a dot operation node. Then, keep track of the entire string text that has been parsed for this operation so far. Then revert everything back to before the operation was being parsed, and now re-parse a "custom expression", where the first directly-nested expression would really be trying to match the exact string that was parsed before rather than a real expression. Repeat until there are no more dot-identifier pairs (this should happen automatically by the new "custom expression").
This is messy, complicated and possibly slow, and I'm not entirely sure if it'll work but I'll try it out. I'd appreciate alternative solutions.

Retaining separator while imploding

I have a syntax definition that looks like this
keyword LegendOperation = 'or' | 'and';
syntax LegendData
= legend_data: LegendKey '=' {ID LegendOperation}+ Newlines
;
I need to implode this into a way that allows me to retain the information on whether the separator for the ID is 'or' or 'and' but I didn't find anything in the docs on whether the separator is retained and if it can be used by implode. Initially, I did something like the below to try and keep that informatioACn.
syntax LegendData
= legend_data_or: LegendKey '=' {ID 'or'}+ Newlines
> legend_data_and: LegendKey '=' {ID 'and'}+ Newlines
;
The issue that I run into is that there are three forms of text that it needs to be able to parse
. = Background
# = Crate and Target
# = Crate or Wall
And when it tries to parse the first line it hits an ambiguity error when it should instead parse it as a legend_data_or with a single element (perhaps I misunderstood how to use the priorities). Honestly, I would prefer to be able to use that second format, but is there a way to disambiguate it?
Either a way to implode the first syntax while retaining the separator or a way to disambiguate the second format would help me with this issue.
I did not manage to come up with an elegant solution in the end. Discussing with others the best we could come up with was
syntax LegendOperation
= legend_or: 'or' ID
| legend_and: 'and' ID
;
syntax LegendData
= legend_data: LegendKey '=' ID LegendOperation* Newlines
;
Which works and allows us to retain the information on the separator but requires post-processing to turn into a usable datatype.

In a tree-sitter grammar, how do I match strings except for reserved keywords in identifiers?

This might be related to me not understanding the Keyword Extraction feature, which from the docs seems to be about avoiding an issue where no space exists between a keyword and the following expression. But say I have a fairly standard identifier regex for variable names, function names, etc.:
/\w*[A-Za-z]\w*/
How do I keep this from matching a reserved keyword like IF or ELSE or something like that? So this expression would produce an error:
int IF = 5;
while this would not:
int x = 5;
There is a pull request pending since 2019 to add an EXCLUDE feature, but this is not currently implemented as of time of writing this (April 2021 - if some time has passed and you're reading this, please do re-check this!). And since treesitter also does not support negative lookbehind in its regular expressions, this has to be handled at the semantic level. One thing you can do to make this check easier is to enumerate all your reserved words then add them as an alternative to your identifier regex:
keyword: $ => choice('IF', 'THEN', 'ELSE'),
name: $ => /\w*[A-Za-z]\w*/,
identifier: $ => choice($.keyword, $.name)
According to rule 4 of treesitter's match rules, in the expression int IF = 5; the IF token would match (identifier keyword) rather than (identifier name) since it is a more specific match. This means you can do an easy query for illegal (identifier keyword) nodes and surface the error to the user in your language server or from wherever it is you're using the treesitter grammar.
Note that this approach does run the risk of creating many conflicts between your (identifier keyword) match and the actual language constructs that use those keywords. If so, you'll have to handle the whole thing at the semantic level: scan all identifiers to check whether they're a reserved word.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?
An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)
Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

xext grammar with chosen predicates

I am trying to understand a xtext grammar I have found (below). I have two questions:
The XFeatureCall has return Type XExpression but this is overruled by {XFeatureCall} so I could set "returns XFeatureCall" as well?. Or is it actually necessary to do it this way?
Line 8 and 14 start with "=>". Are these "chosen predicates" or something else that did not come to my attention so far? I could not find this variation of chosen predicates in the xtext documentation. So I would appreciate clarification in its application.
xtext grammar:
StaticEquals:':=';
XFeatureCall returns XExpression:
// Same as Xbase...
{XFeatureCall}
(declaringType=[JvmDeclaredType|StaticQualifier])?
('<' typeArguments+=JvmArgumentTypeReference (',' typeArguments+=JvmArgumentTypeReference)* '>')?
(feature=[JvmIdentifiableElement|IdOrSuper]|'class')
(=>explicitOperationCall?='('
(
featureCallArguments+=XShortClosure
| featureCallArguments+=XExpression (',' featureCallArguments+=XExpression)*
)?
')')?
=>featureCallArguments+=XClosure?
// ... Except with this additional optional clause that allows static members to be set with := operator
({XAssignment.assignable = current} StaticEquals value = XAssignment)?;
First question: In fact in this case your rule returns a XFeatureCall but XFeatureCall has XExpression as its supertype. It is useful for example if you have:
SomeRule: (parts+=XFeatureCall)* (parts+=XOtherFeatureCall)*
Let XOtherFeatureCall also extend XExpression, and parts be a list of XExpressions.
Second question: it is a priority operator and means that what follows should be parsed now, even if there are other parsing solutions. See this classic example:
if a
if b
do;
else
doelse;
else could be parsed for the inner if or the outer if. Of course we want it in the inner if. Setting a rule such as:
=>'else' else=ElseExpression
forces the grammar to parse the else as soon as it finds it instead of returning to the outer rule that could consume a else too. So it solves an ambiguity.

Resources