Parser Combinators: Handling Whitespace In Parser without excessive Backtracking - parsing

I am moving from a separate lexer and parser to a combined parser that operates on an array of characters. One problem is how to properly handle whitespace.
Problem
Take a language consisting of a sequence of characters 'a' and 'b'. Whitespace is allowed in the input but does not effect the meaning of the program.
My current approach to parse such a language is:
var token = function(p) {
return attempt(next(
parse.many(whitespace),
p));
};
var a = token(char('a'));
var b = token(char('b'));
var prog = many(either(a, b));
This works, but requires unnecessary backtracking. For a program such as '___ba_alb' (Whitespace was getting stripped in the post so let _ be a space), in matching 'b', the whitespace is parsed twice, first for a and when a fails, again for b. Simply removing attempt does not work as either will never reach b if any whitespace is consumed.
Attempts
My first though was to move the token call outside of the either:
var a = char('a');
var b = char('b');
var prog = many(token(either(a, b)));
This works, but now prog cannot be reused easily. In building a parser library, this seems to require defining parsers twice: one version that actually consumes 'a' or 'b' and can be used in other parsers, and one version that correctly handles whitespace. It also clutters up parser definitions by requiring them to have explicit knowledge of how each parser operates and how it handles whitespace.
Question
The intended behavior is that an arbitrary amount of whitespace can be consumed before a token. If parsing the token fails, it backtracks to the start of the token instead of the start of the whitespace.
How can this be expressed without preprocessing the input to produce a token stream? Are there any good, real world code examples of this? The closest I found was Higher-Order Functions for Parsing but this does not seem to address my specific concern.

I solved this problem in a JSON parser that I built. Here's the key parts of my solution, in which I followed the 'write the parsers twice' approach somewhat:
define the basic parsers for each token -- number, string, etc.
define a token combinator -- that takes in a basic token parser and outputs a whitespace-munching parser. The munching should occur after so that whitespace is only parsed once, as you noted:
function token(parser) {
// run the parser, then munch whitespace
}
use the token combinator to produce the munching token parsers
use the parsers from 3. to build the parsers for the rest of the language
I didn't have a problem with having two similar versions of the parsers, since each version was in fact distinct. It was slightly annoying, but a tiny cost. You can check out the full code here.

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}
It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

Is my lexer doing too much -- is it doing the work of the parser?

My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:
Sally M. Smith
Tom V. Jones
John Doe
Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?
<INITIAL>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(FIRSTNAME); }
\t { BEGIN MI_STATE; }
. { BEGIN JUNK_STATE; }
}
<MI_STATE>{
[A-Z]\. { yylval.strval = strdup(yytext); return(MI); }
\t { BEGIN LASTNAME_STATE; }
. { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(LASTNAME); }
\n { BEGIN INITIAL; return EOL; }
. { BEGIN JUNK_STATE; }
}
<JUNK_STATE>. { printf("JUNK: %s\n", yytext); }
You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.
In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.
Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.
Yes; your lexer is parsing. The main evidence is that it's implementing identical rules in different start states. Two rules have exactly the same pattern.
The purpose of start states in the context of lexing is to modify the lexical grammar in order to shield the parser from certain differences. It works with the parser. For instance, say you had some document language in which $ shifts into math expression mode, which has different tokenizing rules. The lexer still just returns tokens in math mode; it doesn't try to parse the math expressions. It is the parser which will determine that, if the brackets are balanced, then another $ can shift out of math mode.
In your code the rules for returning a last name and first name are identical; you have used to start state to handle phrase structure syntax: the fact that the last name comes later than the first name.
Another bit of telltale evidence that the lexer is parsing is that the lexer itself is handling all of the start condition changes. In our $...$ math mode example, we might have the lexer shift into a start state when it sees the $ symbol. However, if the lexer also recognizes the end of math mode, then that is evidence it is parsing the math mode expression. The end can only be recognized by following the nested phrase structure grammar of math mode expressions. The way you would handle that would be for the lexer to expose a function lex_end_math_mode(). When the parser processes and reduces the entire math mode expression, it calls this function to tell the lexer to switch back to the lexical syntax outside of math mode. The math-mode-terminating dollar sign would likely also appear as a token visible to the parser, though the leading one might not. So that is to say, the parser parses math_mode_expr : expr '$': a math mode expression followed by a required dollar sign to end math mode. The action for that rule would include the call to lex_end_math_mode, so the lexer returns to the tokenization rules outside of math mode for scanning the next token after the closing $.
There is no right or wrong answer because it's all parsing. Every grammar that is divided into tokens and phrase structure rules could be expressed by a unified grammar which includes the rules for the token structure.
Why we often use a design which separates lexical scanning and parsing is that:
Unifying the lexical and phrase structure grammar into one will turn LL(1) into LL(k). A recursive-descent parser then needs to look k symbols ahead to make parsing decisions. For instance if you're parsing C with this holistic approach, you need to treat int as a reserved keyword. That requires four symbols of lookahead: you have to recognize i, n, t, and then if the next symbol indicates that the token has ended, you treat that as the keyword, otherwise an identifier.
Performance: lexical scanning uses efficient techniques tailored to that task, which take advantage of the restriction that the lexical grammar is regular.
Correspondence to spec: if you have a language whose specification is described in terms of a lexical grammar separate from a phrase structure grammar, then if you implement it that way, features of your code are more readily traceable to features of requirement spec. You may be able to write unit tests which separately show that the lexing and parsing obeys the spec.
Schooling: people who went through a CS program that included a course on compiler construction had separate lexing and parsing drilled into their heads, and whenever it comes up in their subsequent career, they just lean on that wisdom. They are never confronted with situations in which they recognize it as not being a good approach, and don't question it.
Whatever works in your individual situations with whatever you're parsing overrules the theory. If it's convenient for you to recognize some phrase-like fragments in the lexer, and you're able to convince yourself that it's a clean approach, then by all means do it that way.

"concurrently" lexing and parsing vs lexing then parsing?

I've seen a few languages that will eat a token, then parse the token and then when they need to check the next token whilst parsing, they request it from the lexer.
So you have if (x == 3) you lex, check what it is in this case an if, lex again and make sure its a (, parse an expression which requests 3 in this case till it finishes parsing an expression, and then you lex and expect a closing parenthesis.
The other alternative is you lex this input stream as keyword, symbol, identifier, equality, number, symbol and then you give that token list to the parser which will parse it into an AST.
What are the pros/cons of these two techniques?
For most grammars, it doesn't really matter whether you lex the entire input into a token list as a first pass, then take tokens from the list during a parse pass, or lex on demand. The second method avoids the need for an in-memory token list, the first method means that you can parse several times a bit faster, which you might want to do in an interpreter.
However if the grammar require more than one token of lookahead or isn't left-right then you might need to lex more. Whilst natural languages have some odd parse rules ("time flies like an arrow, fruit flies like bananas"), computer languages are usually designed to be parseable with a simple recursive descent parser with one token of lookahead.

Practical difference between parser rules and lexer rules in ANTLR?

I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers

I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

Resources