Does placement of tokens in the Lexer matter? - parsing

Is there any difference in either semantics or performance in where tokens are included in the `lexer file? For example:
EQUAL : '=' // Equal, also var:=val which is unsupported
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
Vs.
NAMED_ARGUMENT : ':='; // often used when calling custom Macro/Function
EQUAL : '=' // Equal, also var:=val which is unsupported

In this example, the order won’t matter. If the Lexer finds :=, then it will generate a NAMED_EQUAL token (because it is a longer sequence of characters than =).
The Lexer will prefer the rule that matches the longest sequence of input characters.
The only time order matters is if multiple Lexer rules match the same length sequence of characters, and then the Lexer will generate a token for the first Lexer rule in the grammar (so, for example, be sure to put keywords before something like an ID rule, as it’s quite likely that your keyword will also match the ID rule, but by occurring before ID, the keyword will be selected.
-- EDIT --
All that said... as #rici mentions in his comment, in this particular case, order is unimportant for an entirely different reason.
The Lexer attempts to match input at the beginning of the file (or at the character after the last recognized token.
I think of it like this: The Lexer chooses a character and then rules out all the Lexer rules that can't start with this character. Then is looks at the sequence of that character and the next character and rules out all the Lexer rules that can't begin with that sequence of characters. If does this repeatedly, until it has a sequence of characters that can't match any Lexer rule. At that point, we know that everything up to (but excluding) that character had to match one or more Lexer rules. If there's only one rule then that's the generated token. IF there are multiple Lexer rules that matched, then the first one is selected.
In your case the ':' would have immediately ruled out a match of the EQUAL token (it can't begin with a ':'), but will still leave open the possibility that it might match the NAMED_EQUAL token. If the next character is a '=' then it knows that it could match the NAMED_EQUAL rule (but maybe you have other rules that could start with ":=", so it looks at the next character (we'll guess it's a space). ":+ "does not match theNAMED_EQUALrule, and for this example, doesn't match ANY rules. Now it backs up and says ":+" matched theNAMED_EQUALSrule (and no others), so it creates aNAMED_EQUALS` token, and starts the whole process again starting with the space that it's couldn't match.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

context sensitive tokenization of code

I am working on a parser for a language that has
identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),
integers (any number of digits and possibly carets ^),
some operators,
filename (a number of alphanumeric characters and possibly slashes, and dots)
Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.
But filename can only follow a specific operator.
My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?
If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...
Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.
In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.
EDIT:
Just to give some concrete code this side of the hyperlink:
You would trigger your FILENAME_MODE when you encounter the operator...
{FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
You would define your rule to parse a filename:
<FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
...switching back to the INITIAL state in the action.

Grammar: Precedence of grammar alternatives

This is a very basic question about grammar alternatives. If you have the following alternative:
Myalternative: 'a' | .;
Myalternative2: 'a' | 'b';
Would 'a' have higher priority over the '.' and over 'b'?
I understand that this may also depend on the behaviour of the parser generated by this syntax but in pure theoretical grammar terms could you imagine these rules being matched in parallel i.e. test against 'a' and '.' at the same time and select the one with highest priority? Or is the 'a' and . ambiguous due to the lack of precedence in grammars?
The answer depends primarily on the tool you are using, and what the semantics of that tool is. As written, this is not a context-free grammar in canonical form, and you'd need to produce that to get a theoretical answer, because only in that way can you clarify the intended semantics.
Since the question is tagged antlr, I'm going to guess that this is part of an Antlr lexical definition in which . is a wildcard character. In that case, 'a' | . means exactly the same thing as ..
Since MyAlternative matches everything that MyAlternative2 matches, and since MyAlternative comes first in the Antlr lexical definition, MyAlternative2 can never match anything. Any single character will be matched by MyAlternative (unless there is some other lexical rule which matches a longer sequence of input characters).
If you put the definition of MyAlternative2 first in the grammar file, then a or b would be matched as MyAlternative2, while any other character would be matched as MyAlternative.
The question of precedence within alternatives is meaningless. It doesn't matter whether MyAlternative considers the match of an a to be a match of a or a match of .. It is, in the end, a match of MyAlternative, and that symbol can only have one associated action.
Between lexical rules, there is a precedence relationship: The first one wins. (More accurately, as far as I know, Antlr obeys the usual convention that the longest match wins; between two rules which both match the same longest sequence, the first one in the grammar file wins.) That is not in any way influenced by alternative bars in the rules themselves.

Handling identifiers that begin with a reserved word

I am presently writing my own lexer and am wondering how to correctly handle the situation where an identifier begins with a reserved word. Presently the the lexer matches the whole first part as a reserved word and then the rest separately because the reserved word is the longest match ('self' vs 's' in the example below).
For example with the rules:
RESERVED_WORD := self
IDENTIFIER_CHAR := [A-Z]|[a-z]
Applied to:
selfIdentifier
'self' is matched as RESERVED_WORD and 'I' and onwards is matched as IDENTIFIER_CHAR when the whole string should be matched as IDENTIFIER_CHARs
The standard answer in most lexer generators is that the regex that matches the longest sequence wins. To break a tie between two regexes that match the exact same amount, is to prefer the first regex in the order in which they appear in your definitions file.
You can simulate this effect in your lexer. Then "selfIdentifier" would be treated as an identifier.
If you are writing a efficient lexer, you'll have a single finite state machine that branches from one state to another based on the current character class. In this case, you'll have several states that can be terminal states, and are terminal states if the FSA cannot shift to another state. You can assign a token type to each such terminal state; the token type will be unique.

Practical difference between parser rules and lexer rules in ANTLR?

I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?

Resources