Representing statement-terminating newlines in a grammar? - parsing

A lot of programming languages have statements terminated by line-endings. Usually, though, line endings are allowed in the middle of a statement if the parser can't make sense of the line; for example,
a = 3 +
4
...will be parsed in Ruby and Python* as the statement a = 3+4, since a = 3+ doesn't make any sense. In other words, the newline is ignored since it leads to a parsing error.
My question is: how can I simply/elegantly accomplish that same behavior with a tokenizer and parser? I'm using Lemon as a parser generator, if it makes any difference (though I'm also tagging this question as yacc since I'm sure the solution applies equally to both programs).
Here's how I'm doing it now: allow a statement terminator to occur optionally in any case where there wouldn't be syntactic ambiguity. In other words, something like
expression ::= identifier PLUS identifier statement_terminator.
expression ::= identifier PLUS statement_terminator identifier statement_terminator.
... in other words, it's ok to use a newline after the plus because that won't have any effect on the ambiguity of the grammar. My worry is that this would balloon the size of the grammar and I have a lot of opportunities to miss cases or introduce subtle bugs in the grammar. Is there an easier way to do this?
EDIT*: Actually, that code example won't work for Python. Python does in fact ignore the newline if you pass in something like this, though:
print (1, 2,
3)

You could probably make a parser generator get this right, but it would probably require modifying the parser generator's skeleton.
There are three plausible algorithms I know of; none is perfect.
Insert an explicit statement terminator at the end of the line if:
a. the previous token wasn't a statement terminator, and
b. it would be possible to shift the statement terminator.
Insert an explicit statement terminator prior to an unshiftable token (the "offending token", in Ecmascript speak) if:
a. the offending token is at the beginning of a line, or is a } or is the end-of-input token, and
b. shifting a statement terminator will not cause a reduction by the empty-statement production. [1]
Make an inventory of all token pairs. For every token pair, decide whether it is appropriate to replace a line-end with a statement terminator. You might be able to compute this table by using one of the above algorithms.
Algorithm 3 is the easiest to implement, but the hardest to work out. And you may need to adjust the table every time you modify the grammar, which will considerably increase the difficulty of modifying the grammar. If you can compute the table of token pairs, then inserting statement terminators can be handled by the lexer. (If your grammar is an operator precedence grammar, then you can insert a statement terminator between any pair of tokens which do not have a precedence relationship. However, even then you may wish to make some adjustments for restricted contexts.)
Algorithms 1 and 2 can be implemented in the parser if you can query the parser about the shiftability of a token without destroying the context. Recent versions of bison allow you to specify what they call "LAC" (LookAhead Correction), which involves doing just that. Conceptually, the parser stack is copied and the parser attempts to handle a token; if the token is eventually shifted, possibly after some number of reductions, without triggering an error production, then the token is part of the valid lookahead. I haven't looked at the implementation, but it's clear that it's not actually necessary to copy the stack to compute shiftability. Regardless, you'd have to reverse-engineer the facility into Lemon if you wanted to use it, which would be an interesting exercise, probably not too difficult. (You'd also need to modify the bison skeleton to do this, but it might be easier starting with the LAC implementation. LAC is currently only used by bison to generate better error messages, but it does involve testing shiftability of every token.)
One thing to watch out for, in all of the above algorithms, is statements which may start with parenthesized expressions. Ecmascript, in particular, gets this wrong (IMHO). The Ecmascript example, straight out of the report:
a = b + c
(d + e).print()
Ecmascript will parse this as a single statement, because c(d + e) is a syntactically valid function call. Consequently, ( is not an offending token, because it can be shifted. It's pretty unlikely that the programmer intended that, though, and no error will be produced until the code is executed, if it is executed.
Note that Algorithm 1 would have inserted a statement terminator at the end of the first line, but similarly would not flag the ambiguity. That's more likely to be what the programmer intended, but the unflagged ambiguity is still annoying.
Lua 5.1 would treat the above example as an error, because it does not allow new lines in between the function object and the ( in a call expression. However, Lua 5.2 behaves like Ecmascript.
Another classical ambiguity is return (and possibly other statements) which have an optional expression. In Ecmascript, return <expr> is a restricted production; a newline is not permitted between the keyword and the expression, so a return at the end of a line has a semicolon automatically inserted. In Lua, it's not ambiguous because a return statement cannot be followed by another statement.
Notes:
Ecmascript also requires that the statement terminator token be parsed as a statement terminator, although it doesn't quite say that; it does not allow the semicolons in the iterator clause of a for statement to be inserted automatically. Its algorithm also includes mandatory semicolon insertion in two context: after a return/throw/continue/break token which appears at the end of a line, and before a ++/-- token which appears at the beginning of a line.

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}
It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

How to handle assignment and variable syntax in an interpreter

Most interpreters let you type the following at their console:
>> a = 2
>> a+3
5
>>
My question is what mechanisms are usually used to handle this syntax? Somehow the parser is able to distinguish between an assignment and an expression even though they could both start with a digit or letter. It's only when we retrieve the second token that you know if you have an assignment or not. In the past, I've looked ahead two tokens and if the second token isn't an equals I push the tokens back into the lexical stream and assume it's an expression. I suppose one could treat the assignment as an expression which I think some languages do. I thought of using left-factoring but I can't see it working.
eg
assignment = variable A
A = '=' expression | empty
Update I found this question on StackOverflow which address the same question: How to modify parsing grammar to allow assignment and non-assignment statements?
From how you're describing your approach - doing a few tokens of lookahead to decide how to handle things - it sounds like you're trying to write some sort of top-down parser along the lines of an LL(1) or an LL(2) parser, and you're trying to immediately decide whether the expression you're parsing is a variable assignment or an arithmetical expression. There are several ways that you could parse expressions like these quite naturally, and they essentially involve weakening one of those two assumptions.
The first way we could do this would be to switch from using a top-down parser like an LL(1) or LL(2) parser to something else like an LR(0) or SLR(1) parser. Those parsers work bottom-up by reading larger prefixes of the input string before deciding what they're looking at. In your case, a bottom-up parser might work by seeing the variable and thinking "okay, I'm either going to be reading an expression to print or an assignment statement, but with what I've seen so far I can't commit to either," then scanning more tokens to see what comes next. If they see an equals sign, great! It's an assignment statement. If they see something else, great! It's not. The nice part about this is that if you're using a standard bottom-up parsing algorithm like LR(0), SLR(1), LALR(1), or LR(1), you should probably find that the parser generally handles these sorts of issues quite well and no special-casing logic is necessary.
The other option would be to parse the entire expression assuming that = is a legitimate binary operator like any other operation, and then check afterwards whether what you parsed is a legal assignment statement or not. For example, if you use Dijkstra's shunting-yard algorithm to do the parsing, you can recover a parse tree for the overall expression, regardless of whether it's an arithmetical expression or an assignment. You could then walk the parse tree to ask questions like
if the top-level operation is an assignment, is the left-hand side a single variable?
if the top-level operation isn't an assignment, are there nested assignment statements buried in here that we need to get rid of?
In other words, you'd parse a broader class of statements than just the ones that are legal, and then do a postprocessing step to toss out anything that isn't valid.

Is it logical for several parser reduction functions to get called along the way of reducing one expression?

I wrote a parser that analyzes code and reduces it as would by lex and yacc (pretty much.)
I am wondering about one aspect of the matter. If I have a set of rules such as the following:
unary: IDENTIFIER
| IDENTIFIER '(' expr_list ')'
The first rule with just an IDENTIFIER can be reduced as soon as an identifier is found. However, the second rule can only be reduced if the input also includes a valid list of expressions written between parenthesis.
How is a parser expected to work in a case like this one?
If I reduce the first identifier immediately, I can keep the result and throw it away if I realize that the second rule does match. If the second rule does not match, then I can return the result of the early reduction.
This also means that both reduction functions are going to be called if the second rule applies.
Are we instead expected to hold on the early reduction and apply it only if the second, longer rule applies?
For those interested, I put a more complete version of my parser grammar in this answer: https://codereview.stackexchange.com/questions/41769/numeric-expression-parser-calculator/41786#41786
Bottom-up parsers (like bison and yacc) do not reduce until they reach the end of the production. They don't have to guess which reduction they will use until they need it. So it's really not an issue to have two productions with the same prefix. In this sense, the algorithm is radically different from a top-down algorithm, used in recursive descent parsing, for example.
In order for the fragment which you provide to be parseable by an LALR(1) parser-generator -- that is, a bottom-up parser with the ability to examine only one (1) token beyond the end of a production -- the grammar must be such that there is no place in which unary might be followed by (. As long as that is true, the fact that the parser can see a ( is sufficient to prevent the unit reduction unary: IDENTIFIER from happening in a context in which it should reduce with the other unary production.
(That's a bit of an oversimplification, but I don't think that it would be correct to reproduce a standard text on LALR parsing here on SO.)

Resources