Common Lisp lexer generator that allows state variables - parsing

Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.
What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?
Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.

You can recognize a C-style multiline comment with the following regular expression:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]
It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}
It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

Is it possible to remove the internal control of lexer by the parser for parsing heredoc in shell?

To deal with heredoc in shell (e.g., bash), the grammar rule will change the variable need_here_doc via push_heredoc().
| LESS_LESS WORD
{
source.dest = 0;
redir.filename = $2;
$$ = make_redirection (source, r_reading_until, redir, 0);
push_heredoc ($$);
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n539
static void
push_heredoc (r)
REDIRECT *r;
{
if (need_here_doc >= HEREDOC_MAX)
{
last_command_exit_value = EX_BADUSAGE;
need_here_doc = 0;
report_syntax_error (_("maximum here-document count exceeded"));
reset_parser ();
exit_shell (last_command_exit_value);
}
redir_stack[need_here_doc++] = r;
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2794
need_here_doc is used in read_token(), which is called by yylex(). This makes the behavior of yylex() non-automomous.
Is it normal to design a parser that can change how yylex() behaves?
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
if (need_here_doc)
gather_here_documents ();
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3285
current_token = read_token (READ);
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2761
Is it normal to design a parser that can change how yylex() behaves?
Sure. It might not be ideal, but it's extremely common.
The Posix shell syntax is far from the ideal candidate for a flex/bison parser, and about the only thing you can say for the bash implementation using flex and bison is that it demonstrates how flexible those tools can be if pushed to their respective limits. Here-docs are not the only place where "lexical feedback" is necessary.
But even in more disciplined languages, lexical feedback can be useful. Or its alternative: writing partial parsing logic into the lexical scanner in order for it to know when the parse would require a different set of lexical rules.
Possibly the most well-known (or most frequently-commented) lexical feedback is the parsing of C-style cast expressions, which require the lexer to know whether the foo in (foo) is a typename or not. (This is usually implemented by way of a symbol table shared between the parser and the lexer but the precise implementation details are tricky.)
Here are a few other examples, which might be considered relatively benign uses of lexical feedback, although they certainly increase the coupling between lexer and parser.
Python (and Haskell) require the lexical scanner to reformulate leading whitespace into INDENT or DEDENT tokens. But if the line break occurs within parentheses, the whitespace handling is suppressed (including the NEWLINE token itself).
Ecmascript (Javascript) and other languages allow regular expression literals to be written surrounded by /s. But the / could also be a division operator or the first character in a /= mutation operator. The lexical decision depends on the parse context. (This could be guessed by the lexical scanner from the recent token history, which would count as reproducing part of the parsing logic in the lexical scanner.)
Similar to the above, many languages overload < in ways which complicate the logic in the lexical scanner. The use as a template bracket rather than a comparison operator might be dealt with in the scanner -- in C++, for example, it will depend on features like whether the preceding identifier was a template or not -- but that doesn't actually change lexical context. However, the use of an angle bracket to indicate the start of an X/HTML literal (or template) definitely changes lexical context. As with the regex example above, it will be necessary to know whether or not a comparison operator would be syntactically valid or not.
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
The Posix shell syntax is most certainly not LALR(1), or even context-free. But most languages could not be parsed scannerlessly with an LALR(1) parser, and many languages turn out not to have context-free grammars if you take all syntactic considerations into account. (Cf. C-style cast expressions, above.) Perhaps shell is further from the platonic ideal than most. But then, it grew over the years from a kernel intended to be simple to type, rather than formally analysable. (No comment from me about whether this excuse can be extended to Perl, which I don't plan to discuss here.)
What I'd say in general is that languages which embed other languages (regular expressions, HTML fragments, Flex/Bison semantic actions, shell arithmetic expansions, etc., etc.) present challenges for a simplistic parser/scanner model. Despite lots of interesting work and solid experimentation, my sense is that language embedding still lacks a good implementable formal structure. And since most languages do have embedded sublanguages, there is and will continue to be a certain adhockery in their parser implementations. In part, that's what makes this field of study so much fun.

Making a Lua pattern case insensitive with LPeg

I have an app that (among other things) supports plain-text searches and searches using Lua patterns. As a convenience, the app supports case-insensitive searches. Here is an image snippet:
The code that transforms the given Lua pattern into a case-insensitive Lua pattern isn't too pretty. It basically worries about whether or not a character is preceded by an odd or even number of escapes (%) and whether or not it is located inside of square brackets. The pattern shown in the image becomes %a[bB][bB]%%[cC][%abB%%cC]
I haven't had a chance to learn LPeg yet, and I suppose this could be my motivator.
My question is whether this is something that LPeg could have handled easily?
Yes, but for an easier entry into the LPeg world, consider LPeg's "re" module, which gives you a regex-like syntax and which you can specify a set of rules, as in a grammar (think Yacc, etc.). You'd basically write rules for escaped characters, bracket groups and regular characters. Then, you could associate functions to the rules, that would emit either the same text they consumed as the input or the case-insensitive modified version.
The structure of your rules would take care of the even-odd distinction automatically, bracket context, etc. LPeg uses "ordered choice", so if you add your escape rule first, it will handle %[ correctly and avoid mixing it up with the brackets rule, for example.

EBNF Grammar for list of words separated by a space

I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!
You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.
I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt
Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.
There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.
Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.
It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Resources