context sensitive tokenization of code

context sensitive tokenization of code - parsing

I am working on a parser for a language that has
identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)
Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.
But filename can only follow a specific operator.
My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?
If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...

Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.
In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.
EDIT:
Just to give some concrete code this side of the hyperlink:
You would trigger your FILENAME_MODE when you encounter the operator...
{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
You would define your rule to parse a filename:
<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
...switching back to the INITIAL state in the action.

Related

Does placement of tokens in the Lexer matter?

Is there any difference in either semantics or performance in where tokens are included in the `lexer file? For example:
EQUAL : '=' // Equal, also var:=val which is unsupported
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
Vs.
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
EQUAL : '=' // Equal, also var:=val which is unsupported

In this example, the order won’t matter. If the Lexer finds :=, then it will generate a NAMED_EQUAL token (because it is a longer sequence of characters than =).
The Lexer will prefer the rule that matches the longest sequence of input characters.
The only time order matters is if multiple Lexer rules match the same length sequence of characters, and then the Lexer will generate a token for the first Lexer rule in the grammar (so, for example, be sure to put keywords before something like an ID rule, as it’s quite likely that your keyword will also match the ID rule, but by occurring before ID, the keyword will be selected.
-- EDIT --
All that said... as #rici mentions in his comment, in this particular case, order is unimportant for an entirely different reason.
The Lexer attempts to match input at the beginning of the file (or at the character after the last recognized token.
I think of it like this: The Lexer chooses a character and then rules out all the Lexer rules that can't start with this character. Then is looks at the sequence of that character and the next character and rules out all the Lexer rules that can't begin with that sequence of characters. If does this repeatedly, until it has a sequence of characters that can't match any Lexer rule. At that point, we know that everything up to (but excluding) that character had to match one or more Lexer rules. If there's only one rule then that's the generated token. IF there are multiple Lexer rules that matched, then the first one is selected.
In your case the ':' would have immediately ruled out a match of the EQUAL token (it can't begin with a ':'), but will still leave open the possibility that it might match the NAMED_EQUAL token. If the next character is a '=' then it knows that it could match the NAMED_EQUAL rule (but maybe you have other rules that could start with ":=", so it looks at the next character (we'll guess it's a space). ":+ "does not match theNAMED_EQUALrule, and for this example, doesn't match ANY rules. Now it backs up and says ":+" matched theNAMED_EQUALSrule (and no others), so it creates aNAMED_EQUALS` token, and starts the whole process again starting with the space that it's couldn't match.

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?

Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.

Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.

Handling identifiers that begin with a reserved word

I am presently writing my own lexer and am wondering how to correctly handle the situation where an identifier begins with a reserved word. Presently the the lexer matches the whole first part as a reserved word and then the rest separately because the reserved word is the longest match ('self' vs 's' in the example below).
For example with the rules:
RESERVED_WORD := self
IDENTIFIER_CHAR := [A-Z]|[a-z]
Applied to:
selfIdentifier
'self' is matched as RESERVED_WORD and 'I' and onwards is matched as IDENTIFIER_CHAR when the whole string should be matched as IDENTIFIER_CHARs

The standard answer in most lexer generators is that the regex that matches the longest sequence wins. To break a tie between two regexes that match the exact same amount, is to prefer the first regex in the order in which they appear in your definitions file.
You can simulate this effect in your lexer. Then "selfIdentifier" would be treated as an identifier.
If you are writing a efficient lexer, you'll have a single finite state machine that branches from one state to another based on the current character class. In this case, you'll have several states that can be terminal states, and are terminal states if the FSA cannot shift to another state. You can assign a token type to each such terminal state; the token type will be unique.

Tokenizing numbers for a parser

I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
SYMBOL (such as <, :=, ( and the like
WHITESPACE (tab, newlines, spaces...)
REMARK (a comment between /* ... */ or after // through the new line)
NUMBER
IDENT (such as the name of a function or a variable)
STRING (Something enclosed between "....")
Now, do you think this makes sense?
Also, I am struggling with the NUMBER token-type. Do you think it makes more sense to further split it up into a NUMBER and a FLOAT token-type? Without a FLOAT token-type, I'd receive NUMBER (eg 402), a SYMBOL (.) followed by another NUMBER (eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909? Should it return the SYMBOL - first, followed by the NUMBER 909 or should it return a NUMBER -909 right away?

It depends upon your target language.
The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.
As for lexing "-[0-9]+" as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.
While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.

You are best served by making your token types closely match your grammar's terminal symbols.
Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.

From my experience with actual lexers:
Make sure to check if you actually need comment / whitespace tokens. Compilers typically don't need them, while IDEs often do (to color comments green, for example).
Usually there's no single "operator" token; instead, there's a token for each distinct operator. So there's a PLUS token and AMPERSAND token and LESSER_THAN token etc.. This means that you only care about the lexeme (the actual text matched) when the token is an identifier or some sort of literal.
Avoid splitting literals. If "hello world" is a string literal, parse it as a single token. If -3.058e18 is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).

I think that the answer to your question is strictly tied to the semantic of NUMBER.
What a NUMBER should be? An always positive integer, a float...
I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.

It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.
int a = a + 2.0
then the tokens would be (discarding whitespace)
int
a
=
a
+
2.0
So you wouldn't run into the situation where you interpret the . as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart