Tokenizing numbers for a parser - parsing

I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
SYMBOL (such as <, :=, ( and the like
WHITESPACE (tab, newlines, spaces...)
REMARK (a comment between /* ... */ or after // through the new line)
NUMBER
IDENT (such as the name of a function or a variable)
STRING (Something enclosed between "....")
Now, do you think this makes sense?
Also, I am struggling with the NUMBER token-type. Do you think it makes more sense to further split it up into a NUMBER and a FLOAT token-type? Without a FLOAT token-type, I'd receive NUMBER (eg 402), a SYMBOL (.) followed by another NUMBER (eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909? Should it return the SYMBOL - first, followed by the NUMBER 909 or should it return a NUMBER -909 right away?

It depends upon your target language.
The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.
As for lexing "-[0-9]+" as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.
While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.

You are best served by making your token types closely match your grammar's terminal symbols.
Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.

From my experience with actual lexers:
Make sure to check if you actually need comment / whitespace tokens. Compilers typically don't need them, while IDEs often do (to color comments green, for example).
Usually there's no single "operator" token; instead, there's a token for each distinct operator. So there's a PLUS token and AMPERSAND token and LESSER_THAN token etc.. This means that you only care about the lexeme (the actual text matched) when the token is an identifier or some sort of literal.
Avoid splitting literals. If "hello world" is a string literal, parse it as a single token. If -3.058e18 is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).

I think that the answer to your question is strictly tied to the semantic of NUMBER.
What a NUMBER should be? An always positive integer, a float...
I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.

It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.
int a = a + 2.0
then the tokens would be (discarding whitespace)
int
a
=
a
+
2.0
So you wouldn't run into the situation where you interpret the . as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.

Advantages and disadvantages of creating named tokens for single characters

Let's say I have a simple syntax where you can assign a number to an identifier using = sign.
I can write the parser in two ways.
I can include the character token = directly in the rule or I can create a named token for it and use the lexer to recognize it.
Option #1:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
// parser
%token IDENTIFIER NUMBER
%%
assignment : IDENTIFIER '=' NUMBER ;
Option #2:
// lexer
[A-Za-z_][A-Za-z_0-9]* { return IDENTIFIER; }
[0-9]+ { return NUMBER; }
= { return EQUAL_SIGN; }
// parser
%token IDENTIFIER NUMBER EQUAL_SIGN
%%
assignment : IDENTIFIER EQUAL_SIGN NUMBER ;
Both ways of writing the parser work and I cannot quite find an information about good practices concerning such situation.
The first snippet seems to be more readable but this is not my highest concern.
Is any of these two options faster or beneficial in other way? Are they technical reasons (other than readability) to prefer one over another?
Is there maybe a third, better way?
I'm asking about problems concerning huge parsers, where optimization may be a real issue, not just such toy examples as is shown here.
Aside from readability, it basically makes no difference. There really is no optimisation issue, no matter how big your grammar is. Once the grammar has been compiled, tokens are small integers, and one small integer is pretty well the same as any other small integer.
But I wouldn't underrate the importance of readability. For one thing, it's harder to see bugs in unreadable code. It's surprisingly common for a grammar bug to be the result of simply typing the wrong name for some punctuation character. It's much easier to see that '{' expr '{' is wrong than if it were written T_LBRC expr T_LBRC, and furthermore the named symbols are much harder to interpret for someone whose first language isn't English.
Bison's parse table compression requires token numbers to be consecutive integers, which is done by passing incoming token codes through a lookup table, hardly a major overhead. Not using character codes doesn't avoid this lookup, though, because the token numbers 1 through 255 are reserved anyway.
However, Bison's C++ API using named token constructors requires token names and single-character token codes cannot be used as token names (although they're not forbidden, since you're not required to use named constructors).
Given that use case, Bison recently introduced an option which generates consecutively numbered token codes in order to avoid the recoding; this option is not compatible with single-character tokens being represented as themselves. It's possible that not having to recode the token is a marginal speed-up, but it's hard to believe that it's significant, but if you're not going to use single-quoted tokens, you might as well go for it.
Personally, I don't think the extra complexity is justified, at least for the C API. If you do choose to go with token names, perhaps in order to use the C++ API's named constructors, I'd strongly recommend using Bison aliases in order to write your grammar with double-quoted tokens (also recommended for multi-character operator and keyword tokens).

context sensitive tokenization of code

I am working on a parser for a language that has
identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)
Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.
But filename can only follow a specific operator.
My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?
If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...
Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.
In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.
EDIT:
Just to give some concrete code this side of the hyperlink:
You would trigger your FILENAME_MODE when you encounter the operator...
{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
You would define your rule to parse a filename:
<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
...switching back to the INITIAL state in the action.

Is the word "lexer" a synonym for the word "parser"?

The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.

Resources