How do you deal with keywords in Lex? - parsing

Suppose you have a language which allows production like this: optional optional = 42, where first "optional" is a keyword, and the second "optional" is an identifier.
On one hand, I'd like to have a Lex rule like optional { return OPTIONAL; }, which would later be used in YACC like this, for example:
optional : OPTIONAL identifier '=' expression ;
If I then define identifier as, say:
identifier : OPTIONAL | FIXED32 | FIXED64 | ... /* couple dozens of keywords */
| IDENTIFIER ;
It just feels bad... besides, I would need two kinds of identifiers, one for when keywords are allowed as identifiers, and another one for when they aren't...
Is there an idiomatic way to solve this?

Is there an idiomatic way to solve this?
Other than the solution you have already found, no. Semi-reserved keywords are definitely not an expected use case for lex/yacc grammars.
The lemon parser generator has a fallback declaration designed for cases like this, but as far as I know, that useful feature has never been added to bison.
You can use a GLR grammar to avoid having to figure out all the different subsets of identifier. But of course there is a performance penalty.

You've already discovered the most common way of dealing with this in lex/yacc, and, while not pretty, its not too bad. Normally you call your rule that matches an identifier or (set of) keywords whateverName, and you may have more than one of them -- as different contexts may have different sets of keywords they can accept as a name.
Another way that may work if you have keywords that are only recognized as such in easily identifiable places (such as at the start of a line) is to use a lex start state so as to only return a KEYWORD token if the keyword is in that context. In any other context, the keyword will just be returned as an identifier token. You can even use yacc actions to set the lexer state for somewhat complex contexts, but then you need to be aware of the possible one-token lexer lookahead done by the parser (rules might not run until after the token after the action is already read).

This is a case where the keywords are not reserved. A few programming languages allowed this: PL/I, FORTRAN. It's not a lexer problem, because the lexer should always know which IDENTIFIERs are keywords. It's a parser problem. It usually causes too much ambiguity in the language specification and parsing becomes a nightmare. The grammar would have this:
identifier : keyword | IDENTIFIER ;
keyword : OPTIONAL | FIXED32 | FIXED64 | ... ;
If you have no conflicts in the grammar, then you are OK. If you have conflicts, then you need a more powerful parser generator, such as LR(k) or GLR.

Related

Advantages and disadvantages of creating named tokens for single characters

Let's say I have a simple syntax where you can assign a number to an identifier using = sign.
I can write the parser in two ways.
I can include the character token = directly in the rule or I can create a named token for it and use the lexer to recognize it.
Option #1:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
// parser
%token IDENTIFIER NUMBER
%%
assignment : IDENTIFIER '=' NUMBER ;
Option #2:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
= { return EQUAL_SIGN; }
// parser
%token IDENTIFIER NUMBER EQUAL_SIGN
%%
assignment : IDENTIFIER EQUAL_SIGN NUMBER ;
Both ways of writing the parser work and I cannot quite find an information about good practices concerning such situation.
The first snippet seems to be more readable but this is not my highest concern.
Is any of these two options faster or beneficial in other way? Are they technical reasons (other than readability) to prefer one over another?
Is there maybe a third, better way?
I'm asking about problems concerning huge parsers, where optimization may be a real issue, not just such toy examples as is shown here.
Aside from readability, it basically makes no difference. There really is no optimisation issue, no matter how big your grammar is. Once the grammar has been compiled, tokens are small integers, and one small integer is pretty well the same as any other small integer.
But I wouldn't underrate the importance of readability. For one thing, it's harder to see bugs in unreadable code. It's surprisingly common for a grammar bug to be the result of simply typing the wrong name for some punctuation character. It's much easier to see that '{' expr '{' is wrong than if it were written T_LBRC expr T_LBRC, and furthermore the named symbols are much harder to interpret for someone whose first language isn't English.
Bison's parse table compression requires token numbers to be consecutive integers, which is done by passing incoming token codes through a lookup table, hardly a major overhead. Not using character codes doesn't avoid this lookup, though, because the token numbers 1 through 255 are reserved anyway.
However, Bison's C++ API using named token constructors requires token names and single-character token codes cannot be used as token names (although they're not forbidden, since you're not required to use named constructors).
Given that use case, Bison recently introduced an option which generates consecutively numbered token codes in order to avoid the recoding; this option is not compatible with single-character tokens being represented as themselves. It's possible that not having to recode the token is a marginal speed-up, but it's hard to believe that it's significant, but if you're not going to use single-quoted tokens, you might as well go for it.
Personally, I don't think the extra complexity is justified, at least for the C API. If you do choose to go with token names, perhaps in order to use the C++ API's named constructors, I'd strongly recommend using Bison aliases in order to write your grammar with double-quoted tokens (also recommended for multi-character operator and keyword tokens).

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

reduce/reduce conflicts using ocamlyacc

I am struggling with a grammar that involves typed expressions as well as variable access. The result type of this access is not ascertainable during parsing and is evaluated in a second step. This evaluation is not a problem, but it seems hard to write unambiguous parser rules.
All operations that work on different types (e.g. compare operators) produce a reduce/reduce conflict. Clearly, this is because the parser can not decide if "x.a = y.b" should be parsed as "bool_expr EUQAL bool_expr" or as "num_expr EUQAL num_expr" because the type is uncertain. However, the result type of the comp_op rule is certain (as it is always a boolean value).
Is there any solution to this problem without throwing all type information away during parsing and always check it in the evaluation phase?
Here is a shortened grammar example (using ocamllex and ocamlyacc):
comp_op:
| bool_expr EQUAL bool_expr { T.Equiv (T.Wrapper $1, T.Wrapper $3) }
| num_expr EQUAL num_expr { T.Equiv (T.Wrapper $1, T.Wrapper $3) }
/* the wrapper type basically just wraps the expressions to get a common type */
bool_expr:
| TRUE { T.Bool true }
| FALSE { T.Bool false }
| access { T.BoolAccess $1 }
num_expr:
| NUMBER { T.Num $1 }
| access { T.NumAccess $1 }
access:
/* some more complex rules describing the access to variables */
Thanks for your help.
As ygrek says, you should not try to mix parsing and typing. It's much easier to write your parser with only one syntactic category for expressions, and then to have a separate type-checking pass that will sort that out.
Theoretically, this comes from the fact that the distinctions made by typing rules are much finer than what traditional parsing technologies can express. They have been attempt at specifying typing rules more declaratively using, for example, attribute grammars, but your usual LL/LR technology is certainly not a good fit, it's like parsing nested parentheses with a regular expression.
Finally, you should use menhir instead of ocamlyacc, because it's just better. You will have more readable and expressive grammars (named parameters, parametrized rules...), better error reporting and grammar debugging features.
As already said, you will have a hard time writing a "type-correct parser" -- depending on your language this might even be impossible.
Anyway, the problem here is, that your grammar does not know the type of the "access" production; as far as I understood, this production resembles reading from variables, the type of which is unknown at parse-time.
The way I see it, you either abandon the 100% type-correct parsing OR you find a way of "magically" knowing the type of your variables.
You could keep track of type declarations and let the lexer look up the type of a variable it encounters; the lexer then would send a variable-identifier-token based on the type of the variable.
I'm not sure if this approach works as I don't know what your language looks like.

Is this the job of the lexer?

Let's say I was lexing a ruby method definition:
def print_greeting(greeting = "hi")
end
Is it the lexer's job to maintain state and emit relevant tokens, or should it be relatively dumb? Notice in the above example the greeting param has a default value of "hi". In a different context, greeting = "hi" is variable assignment which sets greeting to "hi". Should the lexer emit generic tokens such as IDENTIFIER EQUALS STRING, or should it be context-aware and emit something like PARAM_NAME EQUALS STRING?
I tend to make the lexer as stupid as I possibly can and would thus have it emit the IDENTIFIER EQUALS STRING tokens. At lexical analysis time there is (most of the time..) no information available about what the tokens should represent. Having grammar rules like this in the lexer only polutes it with (very) complex syntax rules. And that's the part of the parser.
I think that lexer should be "dumb" and in your case should return something like this: DEF IDENTIFIER OPEN_PARENTHESIS IDENTIFIER EQUALS STRING CLOSE_PARENTHESIS END.
Parser should do validation - why split responsibilities.
Don't work with ruby, but do work with compiler & programming language design.
Both approches work, but in real life, using generic identifiers for variables, parameters and reserved words, is more easier ("dumb lexer" or "dumb scanner").
Later, you can "cast" those generic identifiers into other tokens. Sometimes in your parser.
Sometimes, lexer / scanners have a code section, not the parser , that allow to do several "semantic" operations, incduing casting a generic identifier into a keyword, variable, type identifier, whatever. Your lexer rules detects an generic identifier token, but, returns another token to the parser.
Another similar, common case, is when you have an expression or language that uses "+" and "-" for binary operator and for unary sign operator.
Distinction between lexical analysis and parsing is an arbitrary one. In many cases you wouldn't want a separate step at all. That said, since the performance is usually the most important issue (otherwise parsing would be mostly trivial task) then you need to decide, and probably measure, whether additional processing during lexical analysis is justified or not. There is no general answer.

ANTLR grammar: parser- and lexer literals

What's the difference between this grammar:
...
if_statement : 'if' condition 'then' statement 'else' statement 'end_if';
...
and this:
...
if_statement : IF condition THEN statement ELSE statement END_IF;
...
IF : 'if';
THEN: 'then';
ELSE: 'else';
END_IF: 'end_if';
....
?
If there is any difference, as this impacts on performance ...
Thanks
In addition to Will's answer, it's best to define your lexer tokens explicitly (in your lexer grammar). In case you're mixing them in your parser grammar, it's not always clear in what order the tokens are tokenized by the lexer. When defining them explicitly, they're always tokenized in the order they've been put in the lexer grammar (from top to bottom).
The biggest difference is one that may not matter to you. If your Lexer rules are in the lexer then you can use inheritance to have multiple lexer's share a common set of lexical rules.
If you just use strings in your parser rules then you can not do this. If you never plan to reuse your lexer grammar then this advantage doesn't matter.
Additionally I, and I'm guessing most Antlr veterans, are more accustom to finding the lexer rules in the actual lexer grammar rather than mixed in with the parser grammar, so one could argue the readability is increased by putting the rules in the lexer.
There is no runtime performance impact after the Antlr parser has been built to either approach.
The only difference is that in your first production rule, the keyword tokens are defined implicitly. There is no run-time performance implication for tokens defined implicitly vs. explicitly.
Yet another difference: when you explicitly define your lexer rules you can access them via the name you gave them (e.g. when you need to check for a specific token type). Otherwise ANTLR will use arbitrary numbers (with a prefix).

Resources