How do you write a lexer parser where identifiers may begin with keywords? - parsing

Suppose you have a language where identifiers might begin with keywords. For example, suppose "case" is a keyword, but "caser" is a valid identifier. Suppose also that the lexer rules can only handle regular expressions. Then it seems that I can't place keyword rules ahead of the identifier rule in the lexer, because this would parse "caser" as "case" followed by "r". I also can't place keyword lexing rules after the identifier rule, since the identifier rule would match the keywords, and the keyword rules would never trigger.
So, instead, I could make a keyword_or_identifier rule in the lexer, and have the parser determine if a keyword_or_identifier is a keyword or an identifier. Is this what is normally done?
I realize that "use a different lexer that has lookahead" is an answer (kind of), but I'm also interested in how this is done in a traditional DFA-based lexer, since my current lexer seems to work that way.

Most lexers, starting with the original lex, match alternatives as follows:
Use the longest match.
If there are two or more alternatives which tie for the longest match, use the first one in the lexer definition.
This allows the following style:
"case" { return CASE; }
[[:alpha:]][[:alnum:]]* { return ID; }
If the input pattern is caser, then the second alternative will be used because it's the longest match. If the input pattern is case r, then the first alternative will be used because both of them match case, and by rule (2) above, the first one wins.
Although this may seem a bit arbitrary, it's consistent with the DFA approach, mostly. First of all, a DFA doesn't stop the first time it reaches an accepting state. If it did, then patterns like [[:alpha:]][[:alnum:]]* would be useless, because they enter an accepting state on the first character (assuming its alphabetic). Instead, DFA-based lexers continue until there are no possible transitions from the current state, and then they back up until the last accepting state. (See below.)
A given DFA state may be accepting because of two different rules, but that's not a problem either; only the first accepting rule is recorded.
To be fair, this is slightly different from the mathematical model of a DFA, which has a transition for every symbol in every state (although many of them may be transitions to a "sink" state), and which matches a complete input depending on whether or not the automaton is in an accepting state when the last symbol of the input is read. The lexer model is slightly different, but can easily be formalized as well.
The only difficulty in the theoretical model is "back up to the last accepting state". In practice, this is generally done by remembering the state and input position every time an accepting state is reached. This does mean that it may be necessary to rewind the input stream, possibly by an arbitrary amount.
Most languages do not require backup very often, and very few require indefinite backup. Some lexer generators can generate faster code if there are no backup states. (flex will do this if you use -Cf or -CF.)
One common case which leads to indefinite backup is failing to provide an appropriate error return for string literals:
["][^"\n]*["] { return STRING; }
/* ... */
. { return INVALID; }
Here, the first pattern will match a string literal starting with " if there is a matching " on the same line. (I omitted \-escapes for simplicity.) If the string literal is unterminated, the last pattern will match, but the input will need to be rewound to the ". In most cases, it's pointless trying to continue lexical analysis by ignoring an unmatched "; it would make more sense to just ignore the entire remainder of the line. So not only is backing up inefficient, it also is likely to lead to an explosion of false error messages. A better solution might be:
["][^"\n]*["] { return STRING; }
["][^"\n]* { return INVALID_STRING; }
Here, the second alternative can only succeed if the string is unterminated, because if the string is terminated, the first alternative will match one more character. Consequently, it doesn't even matter in which order these alternatives appear, although everyone I know would put them in the same order I did.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

Is my lexer doing too much -- is it doing the work of the parser?

My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:
Sally M. Smith
Tom V. Jones
John Doe
Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?
<INITIAL>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(FIRSTNAME); }
\t { BEGIN MI_STATE; }
. { BEGIN JUNK_STATE; }
}
<MI_STATE>{
[A-Z]\. { yylval.strval = strdup(yytext); return(MI); }
\t { BEGIN LASTNAME_STATE; }
. { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(LASTNAME); }
\n { BEGIN INITIAL; return EOL; }
. { BEGIN JUNK_STATE; }
}
<JUNK_STATE>. { printf("JUNK: %s\n", yytext); }
You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.
In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.
Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.
Yes; your lexer is parsing. The main evidence is that it's implementing identical rules in different start states. Two rules have exactly the same pattern.
The purpose of start states in the context of lexing is to modify the lexical grammar in order to shield the parser from certain differences. It works with the parser. For instance, say you had some document language in which $ shifts into math expression mode, which has different tokenizing rules. The lexer still just returns tokens in math mode; it doesn't try to parse the math expressions. It is the parser which will determine that, if the brackets are balanced, then another $ can shift out of math mode.
In your code the rules for returning a last name and first name are identical; you have used to start state to handle phrase structure syntax: the fact that the last name comes later than the first name.
Another bit of telltale evidence that the lexer is parsing is that the lexer itself is handling all of the start condition changes. In our $...$ math mode example, we might have the lexer shift into a start state when it sees the $ symbol. However, if the lexer also recognizes the end of math mode, then that is evidence it is parsing the math mode expression. The end can only be recognized by following the nested phrase structure grammar of math mode expressions. The way you would handle that would be for the lexer to expose a function lex_end_math_mode(). When the parser processes and reduces the entire math mode expression, it calls this function to tell the lexer to switch back to the lexical syntax outside of math mode. The math-mode-terminating dollar sign would likely also appear as a token visible to the parser, though the leading one might not. So that is to say, the parser parses math_mode_expr : expr '$': a math mode expression followed by a required dollar sign to end math mode. The action for that rule would include the call to lex_end_math_mode, so the lexer returns to the tokenization rules outside of math mode for scanning the next token after the closing $.
There is no right or wrong answer because it's all parsing. Every grammar that is divided into tokens and phrase structure rules could be expressed by a unified grammar which includes the rules for the token structure.
Why we often use a design which separates lexical scanning and parsing is that:
Unifying the lexical and phrase structure grammar into one will turn LL(1) into LL(k). A recursive-descent parser then needs to look k symbols ahead to make parsing decisions. For instance if you're parsing C with this holistic approach, you need to treat int as a reserved keyword. That requires four symbols of lookahead: you have to recognize i, n, t, and then if the next symbol indicates that the token has ended, you treat that as the keyword, otherwise an identifier.
Performance: lexical scanning uses efficient techniques tailored to that task, which take advantage of the restriction that the lexical grammar is regular.
Correspondence to spec: if you have a language whose specification is described in terms of a lexical grammar separate from a phrase structure grammar, then if you implement it that way, features of your code are more readily traceable to features of requirement spec. You may be able to write unit tests which separately show that the lexing and parsing obeys the spec.
Schooling: people who went through a CS program that included a course on compiler construction had separate lexing and parsing drilled into their heads, and whenever it comes up in their subsequent career, they just lean on that wisdom. They are never confronted with situations in which they recognize it as not being a good approach, and don't question it.
Whatever works in your individual situations with whatever you're parsing overrules the theory. If it's convenient for you to recognize some phrase-like fragments in the lexer, and you're able to convince yourself that it's a clean approach, then by all means do it that way.

Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its
input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?
I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.
But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).
So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163
If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").
Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.
As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)
A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Checking input grammar and deciding a result

Say I have a string "abacabacabadcdcdcd" and I want to apply a simple set of rules:
abaca->a
dcd->d
From left to right s.t. the string ends up being "abad". This output will be used to make a decision. After the rules are applied, if the output string does not match preset strings such as "abad", the original string would be discarded. ex. Every string should distill down to "abad", kick if it doesn't.
I have this hard-coded right now as regex, but there are many instances of these small rule sets. I am looking for something that will take a set of simple rules and compile (or just a function?) into something I can feed the string to and retrieve a result. The rule sets are independent of each other.
The input is tightly controlled, and the rules in use will be simple. Speed is the most important aspect.
I've looked at Bison and ANTLR, but I don't think I need anything nearly that powerful...
What am I looking for?
Edit: Should mention that the strings are made up of a couple letters. Usually 5, i.e. "abcde". There are no spaces, etc. Just letters.
If it is going to go fast, you can start out with a map, that contains your rules as key value pairs of strings. You can then compile this map to a sort of state machine, a tree with char keys, where the associated value is either a replacement string, or another tree.
You then go char by char through your string. Look up the current char in the tree. If you find another tree, look up the next character in that tree, etc.
At some point, either:
the lookup will fail, and then you know that the string you've seen so far is not the prefix of any rule. You can skip the current character and continue with the next.
or you get a replacement string. In that case, you can replace the characters between the current char and the last one you looked up inclusive by the replacement string.
The only difficulty is if the replacement can itself be part of a pattern to replace. Example:
ab -> e
cd -> b
The input:
acd -> ab (by rule 2)
ab -> e (by rule 1) ????
Now the question is if you want to reconsider ab to give e?
If this is so, you must start over from the beginning after each replacement. In addition, it will be hard to tell whether the replacement ever ends, except if all the rules you have are such that the right hand side is shorter than the left hand side. For, in that case, a finite string will get reduced in a finite amount of time.
But if we don't need to reconsider, the algorithm above will go straight through the string.

Is it acceptable to store the previous state as a global variable?

One of the biggest problems with designing a lexical analyzer/parser combination is overzealousness in designing the analyzer. (f)lex isn't designed to have parser logic, which can sometimes interfere with the design of mini-parsers (by means of yy_push_state(), yy_pop_state(), and yy_top_state().
My goal is to parse a document of the form:
CODE1 this is the text that might appear for a 'CODE' entry
SUBCODE1 the CODE group will have several subcodes, which
may extend onto subsequent lines.
SUBCODE2 however, not all SUBCODEs span multiple lines
SUBCODE3 still, however, there are some SUBCODES that span
not only one or two lines, but any number of lines.
this makes it a challenge to use something like \r\n
as a record delimiter.
CODE2 Moreover, it's not guaranteed that a SUBCODE is the
only way to exit another SUBCODE's scope. There may
be CODE blocks that accomplish this.
In the end, I've decided that this section of the project is better left to the lexical analyzer, since I don't want to create a pattern that matches each line (and identifies continuation records). Part of the reason is that I want the lexical parser to have knowledge of the contents of each line, without incorporating its own tokenizing logic. That is to say, if I match ^SUBCODE[ ][ ].{71}\r\n (all records are blocked in 80-character records) I would not be able to harness the power of flex to tokenize the structured data residing in .{71}.
Given these constraints, I'm thinking about doing the following:
Entering a CODE1 state from the <INITIAL> start condition results
in calls to:
yy_push_state(CODE_STATE)
yy_push_state(CODE_CODE1_STATE)
(do something with the contents of the CODE1 state identifier, if such contents exist)
yy_push_state(SUBCODE_STATE) (to tell the analyzer to expect SUBCODE states belonging to the CODE_CODE1_STATE. This is where the analyzer begins to masquerade as a parser.
The <SUBCODE1_STATE> start condition is nested as follows: <CODE_STATE>{ <CODE_CODE1_STATE> { <SUBCODE_STATE>{ <SUBCODE1_STATE> { (perform actions based on the matching patterns) } } }. It also sets the global previous_state variable to yy_top_state(), to wit SUBCODE1_STATE.
Within <SUBCODE1_STATE>'s scope, \r\n will call yy_pop_state(). If a continuation record is present (which is a pattern at the highest scope against which all text is matched), yy_push_state(continuation_record_states[previous_state]) is called, bringing us back to the scope in 2. continuation_record_states[] maps each state with its continuation record state, which is used by the parser.
As you can see, this is quite complicated, which leads me to conclude that I'm massively over-complicating the task.
Questions
For states lacking an extremely clear token signifying the end of its scope, is my proposed solution acceptable?
Given that I want to tokenize the input using flex, is there any way to do so without start conditions?
The biggest problem I'm having is that each record (beginning with the (SUB)CODE prefix) is unique, but the information appearing after the (SUB)CODE prefix is not. Therefore, it almost appears mandatory to have multiple states like this, and the abstract CODE_STATE and SUBCODE_STATE states would act as groupings for each of the concrete SUBCODE[0-9]+_STATE and CODE[0-9]+_STATE states.
I would look at how the OMeta parser handles these things.

Resources