Is the alternative operator in ABNF commutative?

Is the alternative operator in ABNF commutative? - parsing

Is the alternative operator (/) in Augmented Backus-Naur Form commutative?
For example, is s = a / b the same as s = b / a?

I haven't found any primary sources on BNF or ABNF which explicitly specify / semantics when both sides would yield valid matches. They don't allude to context-free grammars and their allowance for non-determinism either. If anyone knows of clarifying references please share.
EDIT: Tony's answer points out RFC 3501 from 2003 specifies the semantics of ABNF alternation, at least as it's used in that document.
RFC 5234: Augmented BNF for Syntax Specifications: ABNF (2008)
The introduction contrasts BNF and ABNF (with emphasis added here):
Over the years, a modified version of Backus-Naur Form (BNF), called Augmented BNF (ABNF), has been popular among many Internet specifications. It balances compactness and simplicity with reasonable representational power. In the early days of the Arpanet, each specification contained its own definition of ABNF. This included the email specifications, RFC733 and then RFC822 , which came to be the common citations for defining ABNF. The current document separates those definitions to permit selective reference.
The differences between standard BNF and ABNF involve naming rules, repetition, alternatives, order-independence, and value ranges.
"Selective reference" and "order-independence" may relate to alternation ordering semantics, but it's unclear.
RFC 822: Standard for the Format of ARPA Internet Text Messages (1982)
Unless I'm missing something, the cited RFCs don't specify / semantics either. Section 2.2 evades the problem.
2.2. RULE1 / RULE2: ALTERNATIVES
Elements separated by slash ("/") are alternatives. There-
fore "foo / bar" will accept foo or bar.
Various rule definitions show they recognize the practical importance of avoiding ambiguity. For example, here's how RFC 822 defines optional-field and its dependencies:
optional-field =
/ "Message-ID" ":" msg-id
/ "Resent-Message-ID" ":" msg-id
/ "In-Reply-To" ":" *(phrase / msg-id)
/ "References" ":" *(phrase / msg-id)
/ "Keywords" ":" #phrase
/ "Subject" ":" *text
/ "Comments" ":" *text
/ "Encrypted" ":" 1#2word
/ extension-field ; To be defined
/ user-defined-field ; May be pre-empted
extension-field =
<Any field which is defined in a document
published as a formal extension to this
specification; none will have names beginning
with the string "X-">
user-defined-field =
<Any field which has not been defined
in this specification or published as an
extension to this specification; names for
such fields must be unique and may be
pre-empted by published extensions>
The Syntax and Semantics of the Proposed International Algebraic Language of the Zurich ACM-GAMM Conference (Backus 1958)
BNF comes from IAL notation. The paper introduces ̅o̅r "metalinguistic connective", which is intuitively related to /. However, it also dodges the ambiguous choice problem and presumably just uses it carefully.
Recommendation
Due to unspecified semantics my suggestion is to treat every possible match in an alternation rule as valid. When the grammar isn't carefully designed to avoid ambiguity this interpretation can result in multiple valid parse trees for the same input. Addressing ambiguous parses as they occur is safer than forging ahead with an unintentionally valid parse tree.
Alternatively, if you have influence over how the grammar is specified you could consider a notation with clearer semantics. For example, Parsing Expression Grammar: A Recognition-Based Syntactic Foundation (Ford 2004) gives alternatives deterministic prioritized choice semantics (left-most match wins).

Some RFCs clarify this explicitly, with for example IMAPv4's RFC3501 including specification of PEG-like behaviour in RFC 3501 section 9:
In the case of alternative or optional rules in which a later rule
overlaps an earlier rule, the rule which is listed earlier MUST take
priority. For example, "\Seen" when parsed as a flag is the \Seen
flag name and not a flag-extension, even though "\Seen" can be parsed
as a flag-extension. Some, but not all, instances of this rule are
noted below.
I don't know how common such disambiguation (hah) is, though. Many other RFCs I've looked at (I've been implementing an ABNF parser library in recent days) just leave it unspecified. Many RFC ABNF grammars are unambiguous (e.g. RFC8259 (JSON)); however, many are ambiguous (e.g. RFC5322 (Internet Messages)) and require fixups to work with an ambiguity-preserving parser :-(

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}

It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).

Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.

Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

What is the definitive EBNF syntax?

I am currently using the lark parser for python to try and read in some problem specifications. I am getting confused about what the "proper" syntax is for Extended Backus-Naur form, especially about how the LHS and RHS are separated. The wikipedia page uses an equals = sign, lark expects just a colon; see lark cheat sheet. Other sources use the ::= separator - e.g. the atom ebnf package.
Is there a definitive answer? The official ISO spec seems to suggest that the "defining-symbol" should be = but there seems to be wriggle room in the spec. So why all the different versions?

Since the world hasn't yet appointed a Lord High Commissioner of Grammar Formalisms, there is no definitive syntax. You're certainly free to use the ISO "Extended BNF" standard, particularly if you're writing some other ISO standard, but don't expect it to be implemented by a parser generator, even one which extends normal BNF. (There's no definitive standard for BNF, either.)
I have no way of knowing what was going on in the minds of the authors of the ISO standard, but I suspect that their expectations were realistic: it's intended to allow precise description of syntaxes for standards documents, but there are many features which are not suitable for automated implementation (including a way of writing rule restrictions in English to be used when the formalism isn't sufficiently general). It's often possible to automatically extract (most of) a grammar from an ISO standard, but the task is neither simple nor -- as far as I can see -- intended to be simple, since most ISO standards are not distributed as plain text documents and extracting formatted text from either PDF or HTML formats presents its own challenges.
The options you present for punctuation are most of the common ones, although mathematicians often write BNF using ⇒ to separate left- and right-hand sides. (Unfortunately, most keyboards lack that useful character.)
I'm personally not fond of the ::= separator, although it is used by various parser generators. It seems to me to be way too much typing for a simple punctuator, and it is also annoyingly difficult to align with alternatives flagged with |. But to each their own.

EBNF Grammar for list of words separated by a space

I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!

You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.

I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt

Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.

There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.

Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.

It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart