Handling identifiers that begin with a reserved word - parsing

I am presently writing my own lexer and am wondering how to correctly handle the situation where an identifier begins with a reserved word. Presently the the lexer matches the whole first part as a reserved word and then the rest separately because the reserved word is the longest match ('self' vs 's' in the example below).
For example with the rules:
RESERVED_WORD := self
IDENTIFIER_CHAR := [A-Z]|[a-z]
Applied to:
selfIdentifier
'self' is matched as RESERVED_WORD and 'I' and onwards is matched as IDENTIFIER_CHAR when the whole string should be matched as IDENTIFIER_CHARs

The standard answer in most lexer generators is that the regex that matches the longest sequence wins. To break a tie between two regexes that match the exact same amount, is to prefer the first regex in the order in which they appear in your definitions file.
You can simulate this effect in your lexer. Then "selfIdentifier" would be treated as an identifier.
If you are writing a efficient lexer, you'll have a single finite state machine that branches from one state to another based on the current character class. In this case, you'll have several states that can be terminal states, and are terminal states if the FSA cannot shift to another state. You can assign a token type to each such terminal state; the token type will be unique.

Related

Does placement of tokens in the Lexer matter?

Is there any difference in either semantics or performance in where tokens are included in the `lexer file? For example:
EQUAL : '=' // Equal, also var:=val which is unsupported
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
Vs.
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
EQUAL : '=' // Equal, also var:=val which is unsupported
In this example, the order won’t matter. If the Lexer finds :=, then it will generate a NAMED_EQUAL token (because it is a longer sequence of characters than =).
The Lexer will prefer the rule that matches the longest sequence of input characters.
The only time order matters is if multiple Lexer rules match the same length sequence of characters, and then the Lexer will generate a token for the first Lexer rule in the grammar (so, for example, be sure to put keywords before something like an ID rule, as it’s quite likely that your keyword will also match the ID rule, but by occurring before ID, the keyword will be selected.
-- EDIT --
All that said... as #rici mentions in his comment, in this particular case, order is unimportant for an entirely different reason.
The Lexer attempts to match input at the beginning of the file (or at the character after the last recognized token.
I think of it like this: The Lexer chooses a character and then rules out all the Lexer rules that can't start with this character. Then is looks at the sequence of that character and the next character and rules out all the Lexer rules that can't begin with that sequence of characters. If does this repeatedly, until it has a sequence of characters that can't match any Lexer rule. At that point, we know that everything up to (but excluding) that character had to match one or more Lexer rules. If there's only one rule then that's the generated token. IF there are multiple Lexer rules that matched, then the first one is selected.
In your case the ':' would have immediately ruled out a match of the EQUAL token (it can't begin with a ':'), but will still leave open the possibility that it might match the NAMED_EQUAL token. If the next character is a '=' then it knows that it could match the NAMED_EQUAL rule (but maybe you have other rules that could start with ":=", so it looks at the next character (we'll guess it's a space). ":+ "does not match theNAMED_EQUALrule, and for this example, doesn't match ANY rules. Now it backs up and says ":+" matched theNAMED_EQUALSrule (and no others), so it creates aNAMED_EQUALS` token, and starts the whole process again starting with the space that it's couldn't match.

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.

context sensitive tokenization of code

I am working on a parser for a language that has
identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)
Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.
But filename can only follow a specific operator.
My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?
If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...
Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.
In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.
EDIT:
Just to give some concrete code this side of the hyperlink:
You would trigger your FILENAME_MODE when you encounter the operator...
{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
You would define your rule to parse a filename:
<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
...switching back to the INITIAL state in the action.

How do you write a lexer parser where identifiers may begin with keywords?

Suppose you have a language where identifiers might begin with keywords. For example, suppose "case" is a keyword, but "caser" is a valid identifier. Suppose also that the lexer rules can only handle regular expressions. Then it seems that I can't place keyword rules ahead of the identifier rule in the lexer, because this would parse "caser" as "case" followed by "r". I also can't place keyword lexing rules after the identifier rule, since the identifier rule would match the keywords, and the keyword rules would never trigger.
So, instead, I could make a keyword_or_identifier rule in the lexer, and have the parser determine if a keyword_or_identifier is a keyword or an identifier. Is this what is normally done?
I realize that "use a different lexer that has lookahead" is an answer (kind of), but I'm also interested in how this is done in a traditional DFA-based lexer, since my current lexer seems to work that way.
Most lexers, starting with the original lex, match alternatives as follows:
Use the longest match.
If there are two or more alternatives which tie for the longest match, use the first one in the lexer definition.
This allows the following style:
"case" { return CASE; }
[[:alpha:]][[:alnum:]]* { return ID; }
If the input pattern is caser, then the second alternative will be used because it's the longest match. If the input pattern is case r, then the first alternative will be used because both of them match case, and by rule (2) above, the first one wins.
Although this may seem a bit arbitrary, it's consistent with the DFA approach, mostly. First of all, a DFA doesn't stop the first time it reaches an accepting state. If it did, then patterns like [[:alpha:]][[:alnum:]]* would be useless, because they enter an accepting state on the first character (assuming its alphabetic). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they back up until the last accepting state. (See below.)
A given DFA state may be accepting because of two different rules, but that's not a problem either; only the first accepting rule is recorded.
To be fair, this is slightly different from the mathematical model of a DFA, which has a transition for every symbol in every state (although many of them may be transitions to a "sink" state), and which matches a complete input depending on whether or not the automaton is in an accepting state when the last symbol of the input is read. The lexer model is slightly different, but can easily be formalized as well.
The only difficulty in the theoretical model is "back up to the last accepting state". In practice, this is generally done by remembering the state and input position every time an accepting state is reached. This does mean that it may be necessary to rewind the input stream, possibly by an arbitrary amount.
Most languages do not require backup very often, and very few require indefinite backup. Some lexer generators can generate faster code if there are no backup states. (flex will do this if you use -Cf or -CF.)
One common case which leads to indefinite backup is failing to provide an appropriate error return for string literals:
["][^"\n]*["] { return STRING; }
/* ... */
. { return INVALID; }
Here, the first pattern will match a string literal starting with " if there is a matching " on the same line. (I omitted \-escapes for simplicity.) If the string literal is unterminated, the last pattern will match, but the input will need to be rewound to the ". In most cases, it's pointless trying to continue lexical analysis by ignoring an unmatched "; it would make more sense to just ignore the entire remainder of the line. So not only is backing up inefficient, it also is likely to lead to an explosion of false error messages. A better solution might be:
["][^"\n]*["] { return STRING; }
["][^"\n]* { return INVALID_STRING; }
Here, the second alternative can only succeed if the string is unterminated, because if the string is terminated, the first alternative will match one more character. Consequently, it doesn't even matter in which order these alternatives appear, although everyone I know would put them in the same order I did.

Resources