Combined unparser/parser generator

Combined unparser/parser generator - parsing

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.

I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt

Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.

There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.

Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.

It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Related

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.

That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Trying to understand lexers, parse trees and syntax trees

I'm reading the "Dragon Book" and I think I understand the main point of a lexer, parse tree and syntax tree and what errors they're typically supposed to catch (assuming we're using a context-free language), but I need someone to catch me if I'm wrong. My understanding is that a lexer simply tokenizes the input and catches errors that have to do with invalid constructs in code, such as a semi-colon being passed in language that doesn't contain semi-colons. the parse tree is used to verify that the syntax is followed and the code is in the correct order, and the syntax tree is used to actually evaluate the statements and expressions in the code and generate things like 3-address code or machine code. Are any of these correct?
Side-note: Are a concrete-syntax tree and a parse tree the same thing?
Side-Side note: When constructing the AST, does the entire program get built into one giant AST, or is a different AST constructed per statement/expression?

Strictly speaking, lexers are parsers too. The difference between lexers and parsers is in what they operate on. In the world of a lexer, everything is made of individual characters, which it then tokenizes by matching them to the regular grammar it understands. To a parser, the world is made up of tokens which it makes a syntax tree out of by matching them to the context-free grammar it understands. In this sense, they are both doing the same kind of thing but on different levels. In fact, you could build parsers on top of parsers, operating at higher and higher levels so one symbol in the highest level grammar could represent something incredibly complex at the lowest-level.
To your other questions:
Yes, a concrete syntax tree is a parse tree.
Generally, parsers make
one tree for the entire file, since that represents a sentence in the
CFG. This isn't necessarily always the case though.

EBNF Grammar for list of words separated by a space

I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!

You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.

Parsing Context Sensitive Language

i am reading the Definitive ANTLR reference by Terence Parr, where he says:
Semantic predicates are a powerful
means of recognizing context-sensitive
language structures by allowing
runtime information to drive
recognition
But the examples in the book are very simple. What i need to know is: can ANTLR parse context-sensitive rules like:
xAy --> xBy
If ANTLR can't parse these rules, is there is another tool that deals with context-sensitive grammars?

ANTLR parses only grammars which are LL(*). It can't parse using grammars for full context-sensitive languages such as the example you provided. I think what Parr meant was that ANTLR can parse some languages that require some (left) context constraints.
In particular, one can use semantic predicates on "reduction actions" (we do this for GLR parsers
used by our DMS Software Reengineering Toolkit but the idea is similar for ANTLR, I think) to inspect any data collected by the parser so far, either as ad hoc side effects of other semantic actions, or in a partially-built parse tree.
For our DMS-based DMS-based Fortran front end, there's a context-sensitive check to ensure that DO-loops are properly lined up. Consider:
DO 20, I= ...
DO 10, J = ...
...
20 CONTINUE
10 CONTINUE
From the point of view of the parser, the lexical stream looks like this:
DO <number> , <variable> = ...
DO <number> , <variable> = ...
...
<number> CONTINUE
<number> CONTINUE
How can the parser then know which DO statement goes with which CONTINUE statement?
(saying that each DO matches its closest CONTINUE won't work, because FORTRAN can
share a CONTINUE statement with multiple DO-heads).
We use a semantic predicate "CheckMatchingNumbers" on the reduction for the following rule:
block = 'DO' <number> rest_of_do_head newline
block_of_statements
<number> 'CONTINUE' newline ; CheckMatchingNumbers
to check that the number following the DO keyword, and the number following the CONTINUE keyword match. If the semantic predicate says they match, then a reduction for this rule succeeds and we've aligned the DO head with correct CONTINUE. If the predicate fails, then no reduction is proposed (and this rule is removed from candidates for parsing the local context); some other set of rules has to parse the text.
The actual rules and semantic predicates to handle FORTRAN nesting with shared continues is more complex than this but I think this makes the point.
What you want is full context-sensitive parsing engine. I know people have built them, but I don't know of any full implementations, and don't expect them to be fast.
I did follow Quinn Taylor Jackson's MetaS grammar system for awhile; it sounded like a practical attempt to come close.

It is comparatively easy to write a context-sensitive parser in Prolog. This program parses the string [a,is,less,than,b,and,b,is,less,than,c], converting it into [a,<,b,<,c]:
:- initialization(main).
:- set_prolog_flag('double_quotes','chars').
main :-
rewrite_system([a,is,less,than,b,and,b,is,less,than,c],X),writeln('\nFinal output:'),writeln(X).
rewrite_rule([[A,<,B],and,[B,<,C]],[A,<,B,<,C]).
rewrite_rule([A,is,less,than,B],[A,<,B]).
rewrite_rule([[A,<,B],and,C,than,D],[[A,<,B],and,A,is,C,than,D]).
rewrite_rule([A,<,B],[[A,<,B]]).
rewritten(A) :- atom(A);bool(A).
bool(A) :- atom(A).
bool([A,<,B,<,C]) :- atom(A),atom(B),atom(C).
bool([A,and,B]) :- bool(A),bool(B).
% this predicate is from https://stackoverflow.com/a/8312742/975097
replace(ToReplace, ToInsert, List, Result) :-
once(append([Left, ToReplace, Right], List)),
append([Left, ToInsert, Right], Result).
rewrite_system(Input,Output) :-
rewritten(Input),Input=Output;
rewrite_rule(A,B),
replace(A,B,Input,Input1),
writeln(Input1),
rewrite_system(Input1,Output).
Using the same algorithm, I also wrote an adaptive parser that "learns" new rewrite rules from its input.

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".

Example:
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1

I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.

(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure

Using
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
|
V
Lexical Analyzer
(token stream)
|
V
Syntax Analyzer
(syntax tree)
|
V
Semantic Analyzer
(syntax tree)
|
V
...
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart