Context dependent lexer - flex-lexer

I see the following in bash's parse.y. This means that the lexical analysis will be context dependent. How to use flex to do such kind of context depdendent analysis? Will this kind of context depdedent requirement make the flex code too messy? Thanks.
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3006
/* Handle special cases of token recognition:
IN is recognized if the last token was WORD and the token
before that was FOR or CASE or SELECT.
DO is recognized if the last token was WORD and the token
before that was FOR or SELECT.
ESAC is recognized if the last token caused `esacs_needed_count'
to be set
`{' is recognized if the last token as WORD and the token
before that was FUNCTION, or if we just parsed an arithmetic
`for' command.
`}' is recognized if there is an unclosed `{' present.
`-p' is returned as TIMEOPT if the last read token was TIME.
`--' is returned as TIMEIGN if the last read token was TIMEOPT.
']]' is returned as COND_END if the parser is currently parsing
a conditional expression ((parser_state & PST_CONDEXPR) != 0)
`time' is returned as TIME if and only if it is immediately
preceded by one of `;', `\n', `||', `&&', or `&'.
*/

(F)lex provides start conditions to allow for context-dependent lexical analysis.
If you avoid the temptation to reproduce the parsing logic as a hand-written state machine in the lexical scanner, then start conditions can certainly simplify the implementation of context-dependent scanners.
For the particular application of conditionally-recognised keywords -- often called "semi-reserved words" -- context-dependent lexical analysis is often not the best solution. Instead, consider writing the scanner to always recognise the keywords and then add rules in the grammar to treat the words as identifiers in contexts in which the keyword is not possible. See this answer for an example.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Designing a Language Lexer

I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.

How to write parser which handles import statements?

I'm using lex & yacc to write a VHDL parser. VHDL has some languages features which make it context sensitive in a manner similar to C. For example, typedef-like constructs which impact whether the parser should tokenize something as an IDENTIFIER vs. TYPEDEF_NAME.
The difficulty comes in when you need to build a symbol table based on another file which is referenced by "use" statements (similar to "import" in Java or Python).
library ieee;
use ieee.std_logic_1164.all;
-- code which uses something defined in ieee.std_logic_1164 package
In C, this is fairly straight-forward because the preprocessor has already combined all of the header files into a single translation unit which can be scanned from top to bottom. But 'use' statements in VHDL are not preprocessor commands.
So, somehow, as I'm parsing the file, I have to recognize when I see a use statement and then go off and parse the relevant file, and then continue parsing the original file with that symbol table.
Is there an elegant way to do this with lex/yacc? I know there is yyrestart but I'm not sure if that's going down the right track.
If you are using flex, then it is pretty easy.
The basic mechanism (including two functioning code samples) is described in the "Multiple Input Buffers" chapter of the flex manual. You can also take a glance at this question on SO.
The parser (yacc/bison) reduction which recognizes the use construction can include the code which calls yy_push_buffer. In the example code, the end of the included file is recognized by the scanner (lex/flex), which simply pops the buffer stack.
Depending on the formal rules of file inclusion, you might want the parser to know that the included file has finished, in order to avoid having syntactic constructs which start in the included file and continue in the includer. (C allows this, even though it is almost always an error; I don't know anything about VHDL, but there are definitely languages which do not allow it.) One possibility is to recursively call the parser in order to parse the included file, which will require a re-entrant ("pure") parser. In that case, the scanner should return an end-of-included-file token when it hits the end of the included file, because your included file grammar production will need to be terminated with such a token.
You may need to worry about the possibility that the parser has already requested the next input token. Most LALR(1) grammars do not depend on the lookahead token for semi-colon terminated statements, and bison usually doesn't request a lookahead token in a context in which it doesn't need it. But that behaviour is not guaranteed by all Posix-compatible yacc implementations and you might be using one which doesn't.
In that case, you would have to preserve the lookahead token so that you can reread it after the included file has been parsed. That would most conveniently be done by stashing the lookahead token somewhere the scanner can see it, and having the scanner return that token (if set) when it sees the end of the included file. In a bison action, you can find the lookahead token in yychar and its semantic value and location (if locations are enabled) are in yylval and yylloc. If bison has not read the lookahead token, the value of yychar will be YYEMPTY, and the simplest possible bison implementation would assert(yychar == YYEMPTY) when it is about to push the input buffer. If the assert fails, you'll need to implement a more sophisticated strategy.

Representing statement-terminating newlines in a grammar?

A lot of programming languages have statements terminated by line-endings. Usually, though, line endings are allowed in the middle of a statement if the parser can't make sense of the line; for example,
a = 3 +
4
...will be parsed in Ruby and Python* as the statement a = 3+4, since a = 3+ doesn't make any sense. In other words, the newline is ignored since it leads to a parsing error.
My question is: how can I simply/elegantly accomplish that same behavior with a tokenizer and parser? I'm using Lemon as a parser generator, if it makes any difference (though I'm also tagging this question as yacc since I'm sure the solution applies equally to both programs).
Here's how I'm doing it now: allow a statement terminator to occur optionally in any case where there wouldn't be syntactic ambiguity. In other words, something like
expression ::= identifier PLUS identifier statement_terminator.
expression ::= identifier PLUS statement_terminator identifier statement_terminator.
... in other words, it's ok to use a newline after the plus because that won't have any effect on the ambiguity of the grammar. My worry is that this would balloon the size of the grammar and I have a lot of opportunities to miss cases or introduce subtle bugs in the grammar. Is there an easier way to do this?
EDIT*: Actually, that code example won't work for Python. Python does in fact ignore the newline if you pass in something like this, though:
print (1, 2,
3)
You could probably make a parser generator get this right, but it would probably require modifying the parser generator's skeleton.
There are three plausible algorithms I know of; none is perfect.
Insert an explicit statement terminator at the end of the line if:
a. the previous token wasn't a statement terminator, and
b. it would be possible to shift the statement terminator.
Insert an explicit statement terminator prior to an unshiftable token (the "offending token", in Ecmascript speak) if:
a. the offending token is at the beginning of a line, or is a } or is the end-of-input token, and
b. shifting a statement terminator will not cause a reduction by the empty-statement production. [1]
Make an inventory of all token pairs. For every token pair, decide whether it is appropriate to replace a line-end with a statement terminator. You might be able to compute this table by using one of the above algorithms.
Algorithm 3 is the easiest to implement, but the hardest to work out. And you may need to adjust the table every time you modify the grammar, which will considerably increase the difficulty of modifying the grammar. If you can compute the table of token pairs, then inserting statement terminators can be handled by the lexer. (If your grammar is an operator precedence grammar, then you can insert a statement terminator between any pair of tokens which do not have a precedence relationship. However, even then you may wish to make some adjustments for restricted contexts.)
Algorithms 1 and 2 can be implemented in the parser if you can query the parser about the shiftability of a token without destroying the context. Recent versions of bison allow you to specify what they call "LAC" (LookAhead Correction), which involves doing just that. Conceptually, the parser stack is copied and the parser attempts to handle a token; if the token is eventually shifted, possibly after some number of reductions, without triggering an error production, then the token is part of the valid lookahead. I haven't looked at the implementation, but it's clear that it's not actually necessary to copy the stack to compute shiftability. Regardless, you'd have to reverse-engineer the facility into Lemon if you wanted to use it, which would be an interesting exercise, probably not too difficult. (You'd also need to modify the bison skeleton to do this, but it might be easier starting with the LAC implementation. LAC is currently only used by bison to generate better error messages, but it does involve testing shiftability of every token.)
One thing to watch out for, in all of the above algorithms, is statements which may start with parenthesized expressions. Ecmascript, in particular, gets this wrong (IMHO). The Ecmascript example, straight out of the report:
a = b + c
(d + e).print()
Ecmascript will parse this as a single statement, because c(d + e) is a syntactically valid function call. Consequently, ( is not an offending token, because it can be shifted. It's pretty unlikely that the programmer intended that, though, and no error will be produced until the code is executed, if it is executed.
Note that Algorithm 1 would have inserted a statement terminator at the end of the first line, but similarly would not flag the ambiguity. That's more likely to be what the programmer intended, but the unflagged ambiguity is still annoying.
Lua 5.1 would treat the above example as an error, because it does not allow new lines in between the function object and the ( in a call expression. However, Lua 5.2 behaves like Ecmascript.
Another classical ambiguity is return (and possibly other statements) which have an optional expression. In Ecmascript, return <expr> is a restricted production; a newline is not permitted between the keyword and the expression, so a return at the end of a line has a semicolon automatically inserted. In Lua, it's not ambiguous because a return statement cannot be followed by another statement.
Notes:
Ecmascript also requires that the statement terminator token be parsed as a statement terminator, although it doesn't quite say that; it does not allow the semicolons in the iterator clause of a for statement to be inserted automatically. Its algorithm also includes mandatory semicolon insertion in two context: after a return/throw/continue/break token which appears at the end of a line, and before a ++/-- token which appears at the beginning of a line.

Resources