reduce/reduce conflicts using ocamlyacc - parsing

I am struggling with a grammar that involves typed expressions as well as variable access. The result type of this access is not ascertainable during parsing and is evaluated in a second step. This evaluation is not a problem, but it seems hard to write unambiguous parser rules.
All operations that work on different types (e.g. compare operators) produce a reduce/reduce conflict. Clearly, this is because the parser can not decide if "x.a = y.b" should be parsed as "bool_expr EUQAL bool_expr" or as "num_expr EUQAL num_expr" because the type is uncertain. However, the result type of the comp_op rule is certain (as it is always a boolean value).
Is there any solution to this problem without throwing all type information away during parsing and always check it in the evaluation phase?
Here is a shortened grammar example (using ocamllex and ocamlyacc):
comp_op:
| bool_expr EQUAL bool_expr { T.Equiv (T.Wrapper $1, T.Wrapper $3) }
| num_expr EQUAL num_expr { T.Equiv (T.Wrapper $1, T.Wrapper $3) }
/* the wrapper type basically just wraps the expressions to get a common type */
bool_expr:
| TRUE { T.Bool true }
| FALSE { T.Bool false }
| access { T.BoolAccess $1 }
num_expr:
| NUMBER { T.Num $1 }
| access { T.NumAccess $1 }
access:
/* some more complex rules describing the access to variables */
Thanks for your help.

As ygrek says, you should not try to mix parsing and typing. It's much easier to write your parser with only one syntactic category for expressions, and then to have a separate type-checking pass that will sort that out.
Theoretically, this comes from the fact that the distinctions made by typing rules are much finer than what traditional parsing technologies can express. They have been attempt at specifying typing rules more declaratively using, for example, attribute grammars, but your usual LL/LR technology is certainly not a good fit, it's like parsing nested parentheses with a regular expression.
Finally, you should use menhir instead of ocamlyacc, because it's just better. You will have more readable and expressive grammars (named parameters, parametrized rules...), better error reporting and grammar debugging features.

As already said, you will have a hard time writing a "type-correct parser" -- depending on your language this might even be impossible.
Anyway, the problem here is, that your grammar does not know the type of the "access" production; as far as I understood, this production resembles reading from variables, the type of which is unknown at parse-time.
The way I see it, you either abandon the 100% type-correct parsing OR you find a way of "magically" knowing the type of your variables.
You could keep track of type declarations and let the lexer look up the type of a variable it encounters; the lexer then would send a variable-identifier-token based on the type of the variable.
I'm not sure if this approach works as I don't know what your language looks like.

Related

Better to be more explicit or less explicit in a parsing grammar?

Let's say I have a SQL-like language that supports seeing if two expressions are equal only if they are the same type, and if they're not the same type it'll raise an error. Examples would be:
1 = 1 # true
1 = 1.2 # false
1 = '1' # error
1 = '1'::int # true
What would the best production for this be?
EQ: expr '=' expr
Or something much more detailed, which would attempt to catch type errors at the lexing(parsing?) stage, such as:
EQ: numeric_expr '=' numeric_expr
| string_expr '=' string_expr
| ...etc
SO is notoriously hostile at questions which ask for an opinion. But in this case, I think there is an objective preference, so I'll hazard offering an answer.
You cannot, in general, detect a type mismatch in a context-free grammar, for the simple reason that the type of a variable (or fieldname, or whatever) is not part of its syntax. Looking up the type of a variable is, pretty well by definition, not context-free.
Of course, you can do some very limited typechecking on constant expressions, but that's not a particularly interesting case. Most SQL queries do not involve comparing two literals values.
Trying to classify expression productions syntactically by type will almost certainly lead to grammar conflicts. But even if you somehow manage to do it, you will then have to try to construct a sensible error message, since flagging a type mismatch as "Syntax error" is highly misleading to the programmer. And producing good error messages during a parse is much harder than you might think.
By contrast, it is very easy to catch type errors by analysing the parse tree created by the parser, at least if there is some mechanism to identify the type of a named object. Furthermore, if you are using a typechecker to check for type mismatches, rather than a general-purpose parser, it is almost trivial to produce a meaningful error message.
So it's not really a question of "explicitness". It's a question of detecting errors at the correct point during program analysis.

Advantages and disadvantages of creating named tokens for single characters

Let's say I have a simple syntax where you can assign a number to an identifier using = sign.
I can write the parser in two ways.
I can include the character token = directly in the rule or I can create a named token for it and use the lexer to recognize it.
Option #1:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
// parser
%token IDENTIFIER NUMBER
%%
assignment : IDENTIFIER '=' NUMBER ;
Option #2:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
= { return EQUAL_SIGN; }
// parser
%token IDENTIFIER NUMBER EQUAL_SIGN
%%
assignment : IDENTIFIER EQUAL_SIGN NUMBER ;
Both ways of writing the parser work and I cannot quite find an information about good practices concerning such situation.
The first snippet seems to be more readable but this is not my highest concern.
Is any of these two options faster or beneficial in other way? Are they technical reasons (other than readability) to prefer one over another?
Is there maybe a third, better way?
I'm asking about problems concerning huge parsers, where optimization may be a real issue, not just such toy examples as is shown here.
Aside from readability, it basically makes no difference. There really is no optimisation issue, no matter how big your grammar is. Once the grammar has been compiled, tokens are small integers, and one small integer is pretty well the same as any other small integer.
But I wouldn't underrate the importance of readability. For one thing, it's harder to see bugs in unreadable code. It's surprisingly common for a grammar bug to be the result of simply typing the wrong name for some punctuation character. It's much easier to see that '{' expr '{' is wrong than if it were written T_LBRC expr T_LBRC, and furthermore the named symbols are much harder to interpret for someone whose first language isn't English.
Bison's parse table compression requires token numbers to be consecutive integers, which is done by passing incoming token codes through a lookup table, hardly a major overhead. Not using character codes doesn't avoid this lookup, though, because the token numbers 1 through 255 are reserved anyway.
However, Bison's C++ API using named token constructors requires token names and single-character token codes cannot be used as token names (although they're not forbidden, since you're not required to use named constructors).
Given that use case, Bison recently introduced an option which generates consecutively numbered token codes in order to avoid the recoding; this option is not compatible with single-character tokens being represented as themselves. It's possible that not having to recode the token is a marginal speed-up, but it's hard to believe that it's significant, but if you're not going to use single-quoted tokens, you might as well go for it.
Personally, I don't think the extra complexity is justified, at least for the C API. If you do choose to go with token names, perhaps in order to use the C++ API's named constructors, I'd strongly recommend using Bison aliases in order to write your grammar with double-quoted tokens (also recommended for multi-character operator and keyword tokens).

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

How do you deal with keywords in Lex?

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.
On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:
optional : OPTIONAL identifier '=' expression ;
If I then define identifier as, say:
identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */
| IDENTIFIER ;
It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...
Is there an idiomatic way to solve this?
Is there an idiomatic way to solve this?
Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.
The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.
You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.
You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.
Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).
This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:
identifier : keyword | IDENTIFIER ;
keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;
If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.

What is a "primary expression"?

In the PowerShell Language Specification, the grammar has a term called "Primary Expression".
primary-expression:
value
member-access
element-access
invocation-expression
post-increment-expression
post-decrement-expression
Semantically, what is a primary expression intended to describe?
As I understand it there are two parts to this:
Formal grammars break up things like expressions so operator precedence is implicit.
Eg. if the grammar had
expression:
value
expression * expression
expression + expression
…
There would need to be a separate mechanism to define * as having higher precedence than +. This
becomes significant when using tools to directly transform the grammar into a tokeniser/parser.1
There are so many different specific rules (consider the number of operators in PowerShell) that using fewer rules would be harder to understand because each rule would be so long.
1 I suspect this is not the case with PowerShell because it is highly context sensitive (express vs command mode, and then consider calling non-inbuilt executables). But such grammars across languages tend to have a lot in common, so style can also be carried over (don't re-invent the wheel).
My understaning of it is that an expression is an arrangement of commands/arguments and/or operators/operands that, taken together will execute and effect an action or produce a result (even if it's $null) than can be assigned to a variable or send down the pipeline. The term "primary expression" is used to differentiate the whole expression from any sub-expressions $() that may be contained within it.
My very limited experience says that primary refers to expressions (or sub-expressions) that can't be parsed in more than one way, regardless of precedence or associativity. They have enough syntactic anchors that they are really non-negotiable--ID '(' ID ')' for example, can only match that. The parens make it guaranteed, and there are no operators to allow for any decision on precedence.
In matching an expression tree, it's common to have a set of these sub-expressions under a "primary_expr" rule. Any sub-expressions can be matched any way you like because they're all absolutely determined by syntax. Everything after that will have to be ordered carefully in the grammar to guarantee correct precedence and associativity.
There's some support for this interpretation here.
think $( lots of statements ) would be like a complex expression
and in powershell most things can be an expression like
$a = if ($y) { 1 } else { 2}
so primary expression is likely the lowest common denominator of classic expressions
a single thing that returns a value
whether an explicit value, calling something.. getting a variable, or a property from a variable $x.y or the result of increment operation $z++
but even a simple math "expression" wouldn't match this 4 +2 + (3 / 4 ) , that would be more of a complex expression.
So I was thinking of the WHY behind it, and at first it was mentioned that it could be to help determine command/expression mode, but with investigation that wasn't it. Then i thought maybe it was what could be passed in command mode as an argument explicitly, but thats not it, because if you pass $x++ as a parameter to a cmdlet you get a string.
I think its probably just the lowest common denominator of expressions, a building block, so the next question is what other expressions does the grammar contain, and where is this one used?

Resources