Which style of multiline comments used on Dart? - parsing

Which style of multiline comments used on Dart?
I know the C-style of the multiline comments. This style does not allow multiline comments inside multiline comments (nested comments).
That is the 'C' style comments end at the first */ encountered in multiline comments.
Examples:
Vaild C-style comment:
/*
*/
Not valid C-style comment:
/*
/**/
*/
In Dart both styles are valid but as I know in most popular languages ​​used only the C-style comments.
Here is my question.
From whence this style in Dart language? From a historical point of view and practical.
P.S.
I am writing PEG parser for Dart and was surprised when I found it in the grammar.
This rule does not allow in my parser auto recognize multilne comment as terminal because it recursive call himself.
MULTI_LINE_COMMENT <- '/*' (MULTI_LINE_COMMENT / !'*/' .)* '*/' ;
Also how this multiline comment can be described in Bison/Flex terminology?
This question arrives because in PEG parser terminology the comments are part of white spaces. And the white spaces in most cases can be assumed as terminals because they does not change behaviour (they does not branch and are not recursive by human logic, i.e produced directly into tokens by lexical scaners).
I know that in PEG parsers there is no division on terminals and not-terminals but for better error reporting some euristic analysis of grammar rules never prevents

From whence this style in Dart language?
I believe they added this because it makes it easier to comment out large blocks of code which may already contain block comments. Most other grammatical constructs nest, so it always seemed strange that C-style block comments did not to me.
I think C originally worked that way because it made it easier to lex on old PDP-11s with almost no memory. We don't have that limitation anymore, so we can have a more user-friendly comment syntax.

Related

Common Lisp lexer generator that allows state variables

Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.
What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?
Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.
You can recognize a C-style multiline comment with the following regular expression:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]
It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.

Multiline comments in a recursive descent parser

I'm trying to wrap my head around how to handle C-style multiline comments (/* */) with a recursive descent parser. Because these comments can appear anywhere, how do you account for them? For example, suppose you're parsing a sentence into word tokens, what do we do if there's a comment inside a word?
Ex.
This is a sentence = word word word word
vs
This is a sen/*sible*/tence = ???
Thanks!
In C, like pretty well every other programming language, a comment is effectively whitespace; a comment cannot occur within a token.
So comments cannot interrupt the parsing of a token, and thus only need to be recognized and ignored.

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.
I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt
Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.
There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.
Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.
It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

How to parse comments with EBNF grammars

When defining the grammar for a language parser, how do you deal with things like comments (eg /* .... */) that can occur at any point in the text?
Building up your grammar from tags within tags seems to work great when things are structured, but comments seem to throw everything.
Do you just have to parse your text in two steps? First to remove these items, then to pick apart the actual structure of the code?
Thanks
Normally, comments are treated by the lexical analyzer outside the scope of the main grammar. In effect, they are (usually) treated as if they were blanks.
One approach is to use a separate lexer. Another, much more flexible way, is to amend all your token-like entries (keywords, lexical elements, etc.) with an implicit whitespace prefix, valid for the current context. This is how most of the modern Packrat parsers are dealing with whitespaces.

Why do interpreted/scripting languages rarely have multi-line comments?

Of the interpreted languages I know (Python, Perl, R, bash), multi-line comments seem to usually involve some misuse of another feature of the language (e.g. multiline strings).
Is there something inherent to the type of parsing which interpreters do that makes multiline comments hard? It doesn't seem like it should be significantly different from, say, multiline strings.
No, there's no reason for scripting languages to not support multiline comments. JavaScript, Groovy, Lua, PHP, REXX, Smalltalk, and Dart all support multiline comments.
In truth, I'm sure there is no significance when it comes to implementing any form of multi-line comments (based on the methods that most parsers utilize to read/execute the script). In my personal opinion, scripts need multi-line comments the most because of distribution and explanation (most high level languages are usually compiled and only a percentage of it is open sourced anyways). I do know that Lua, a scripting language, does provide multi-line comments:
--[==[
COMMENT
]==]--
I'm sure it's just a fluke that many languages don't support this. It is commonly acceptable to just use single line comments to create a multi-line comment.
//*****************************************\\
//** **\\
//** JOHN SMITH **\\
//** COPYRIGHT 2008-2011 **\\
//** **\\
//*****************************************\\
Lots of people will also utilize single line comments to create a cool image (ASCII ART) using the comment character to start off the image (kind of what is displayed above, where // is the commenting character(s)/phrase).
In Python we simply use a multiline string as a comment.
Why have two kinds of syntax when one will do?
Not all interpreted languages lack multi-line comments. Ruby, for example, has them.
I suspect this is largely the preference of the language designer. Some just don't see multi-line comments as a necessary feature. (Many code editors these days offer shortcuts that comment/uncomment large blocks of code using single-line comments.)
Also, multi-line comments add complexity to a parser. It has to deal with the possibility of nested comments, for example. Why add complexity if you don't have to?
I suspect it's just a bit of language bias. C style multiline comments tend to show up in c-esque languages, but most other langs don't have an equivalent.
As for the parsers themselves. There's no reason I can think of that parsers don't implement multi line comments, other than that the lang designer just didn't want to. Most parser generators can easily handle the construct.
There are two wrong assumptions in your question.
The first wrong assumption is that “interpreted/scripting languages rarely have multi-line comments”. You're suffering from confirmation bias. There are compiled languages with single-line comments (e.g. Fortran, many Lisp dialects) and interpreted languages with multi-line comments (e.g. Perl, Python).
The second wrong assumption is that “misuse of another feature” is involved. Languages are better designed as a whole, there's no need to introduce an extra feature for multiline comments if some feature that exists anyway will do the trick. For example, in Python, multiline strings exist, and an instruction consisting only of a string is a no-op, so multiline strings make fine comments. In Perl, one way to have multiline comments is through Pod, a documentation format; comments are a kind of documentation, so it's quite natural to use =pod … =cut for multiline comments (multiline strings, through here documents <<'EOF'; … EOF, are another method).

Resources