How is newline handled in whitespace sensitive languages? - parsing

I studied and built a toy java compiler recently, which discards all comment tokens and whitespace tokens before parsing stage. However, I am curious about how whitespace sensitive languages, such as Python and Swift, handle newlines. Those languages terminate with newlines for statements, so the newline tokens cannot be simply discarded. But how do they handle the situation below?
foo(
bar
)
Do they have to make lots of grammar cases for it? Like foo ( bar ), foo NEWLINE ( bar NEWLINE ) etc.?
In the java compiler I built, this is handled by eliminating newline tokens and they all become foo ( bar ). But how is this handled in whitespace sensitive languages?

Since no one has answered, I did another round of googling and found this blog. According to it, Python maintains a stack for brackets ()[], and a newline is ignored if there are incomplete brackets. So this can be achieved:
foo(
bar
)
Also, for empty lines or lines with only comments, the newline is skipped too. Or when the newline is escaped, no newline token will be generated. This achieves
# comment only
# only spaces
foo(\ # escaped
bar\
)
Otherwise, the tokenizer will generate a NEWLINE token.
The blog also pointed me to the source code tokenizer.c, where I find the above logic at line 1660.
As for other languages, I am not sure, but I think the stack strategy is general enough to be adopted by all languages.

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}
It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

Is there a way to do context sensitive parsing in tatsu

context sensitive '%' ..... eol comments
I'm starting with the grammar for PDF described here
https://github.com/caradoc-org/caradoc/blob/master/doc/grammar/grammar.pdf
which seems to lack the definition of eol comments.
PDF has end of line comments which start with the '%' character except inside string_literal (and another rule stream).
string_literal = "(" string_content ")";
where string_content can include the '%' character and also eol, but not "()" etc. The PDF language also has some special cases which otherwise look like comments eg
'%PDF-1.5' eol;
or
"%%EOF" [eol];
is there a way to handle the context sensitivity in a tatsu grammar?
I'll stay away from "Context Sensitive" in this answer, because the phrase has meaning in Language Theory.
PEG is perfectly capable of parsing a sub-language (say, Python string formatting expressions) within another language.
In fact, the original PEG definition does not use a tokenizer, because PEG grammars can parse the token sub-language.
If you think of sub-grammars, then the context is provided by the rule that knows that a sub-grammar has to be invoked.
With TatSu, there are features that allow tokenization to happen before the parsing (the Buffer class) for efficiency, and convenience, but using those features is not mandatory.
The only cases that cannot be handled easily as a grammar-within-a-grammar are preprocessing with macro capabilities, because those require an interpretation phase before the text for the inner grammar can be parsed.

Common Lisp lexer generator that allows state variables

Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.
What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?
Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.
You can recognize a C-style multiline comment with the following regular expression:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]
It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.

Which style of multiline comments used on Dart?

Which style of multiline comments used on Dart?
I know the C-style of the multiline comments. This style does not allow multiline comments inside multiline comments (nested comments).
That is the 'C' style comments end at the first */ encountered in multiline comments.
Examples:
Vaild C-style comment:
/*
*/
Not valid C-style comment:
/*
/**/
*/
In Dart both styles are valid but as I know in most popular languages ​​used only the C-style comments.
Here is my question.
From whence this style in Dart language? From a historical point of view and practical.
P.S.
I am writing PEG parser for Dart and was surprised when I found it in the grammar.
This rule does not allow in my parser auto recognize multilne comment as terminal because it recursive call himself.
MULTI_LINE_COMMENT <- '/*' (MULTI_LINE_COMMENT / !'*/' .)* '*/' ;
Also how this multiline comment can be described in Bison/Flex terminology?
This question arrives because in PEG parser terminology the comments are part of white spaces. And the white spaces in most cases can be assumed as terminals because they does not change behaviour (they does not branch and are not recursive by human logic, i.e produced directly into tokens by lexical scaners).
I know that in PEG parsers there is no division on terminals and not-terminals but for better error reporting some euristic analysis of grammar rules never prevents
From whence this style in Dart language?
I believe they added this because it makes it easier to comment out large blocks of code which may already contain block comments. Most other grammatical constructs nest, so it always seemed strange that C-style block comments did not to me.
I think C originally worked that way because it made it easier to lex on old PDP-11s with almost no memory. We don't have that limitation anymore, so we can have a more user-friendly comment syntax.

What to do when unescapable character(s) are escaped?

In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').

Resources