How can I detect dead productions from an ANTLR4 grammar? - parsing

I have an ANTLR4 grammar containing a large number of productions I don't want to use. I'd like to clean them out of the grammar file. ANTLR4 doesn't seem to allow you to specify a "goal" symbol, but if it could, I'd like to identify and remove any productions that aren't reachable from that goal symbol.
Is there a way to identify these kinds of unused productions so I can remove them from the grammar file?

There’s no such functionality in ANTLR itself. However, the ANTLR plugin for IntelliJ gives a warning when productions aren’t used:

Use Visual Studio Code along with my antlr4-vscode extension and enable code lenses (preferences: antlr4.referencesCodeLens.enabled). It will give you a reference count for each rule:
Or you can directly run AntlrLanguageSupport.countReferences(fileName, symbol) from the underlying antlr4-graps library in a node shell. More API details in the API doc file.

Related

Testing grammar for ambiguities

I'm writing a grammar for a formal language. Ideally I'd want that grammar to be unambiguous, but that might not be possible. In either case, I want to know about all possible ambiguities while developing the grammar. How can I do that?
So far, most of the time when I developed a language, I'd turn to Bison, write a LR(1) grammar for it, run Bison in verbose mode and have a look at all the shift-reduce and reduce-reduce conflicts it tells me about. Make sure that I agree with its choice in each case.
But now I'm in a project where Bison doesn't have a code generator for one of the required target languages, and where ANTLR is already being used. Plus the language isn't LR(1), and rewriting it as LR(1) would entail additional syntax checks after the parser is done, thus reducing the expressiveness of the grammar as a tool to describe the language.
So I'm now working with ANTLR, fed it my grammar, and all seems to be working well. But ANTLR doesn't seem to check for ambiguities at compile time. For example, the following grammar is ambiguous:
grammar test;
lst: '(' ')' {System.out.println("a");}
| '(' elts ')' {System.out.println("b");} ;
elts: elt (',' elt)* ;
elt: 'x' | /* empty */ ;
The input () could be interpreted as the empty list, or it could be interpreted as a list consisting of a single empty element. The generated parser chooses the former interpretation, but I'd like to be able to manually verify that choice.
Is there command line switch I can use to get ANTLR to tell me about ambiguities?
Or perhaps an option I can set in the grammar file?
Or should I use some other tool to check the grammar for ambiguities?
If so, is there one which can already read ANTLR syntax, or do I have to strip out all the actions and boil this down to BNF myself?
The ANTLRErrorListener.reportAmbiguity method suggests that ANTLR might be able to perform some ambiguity testing at runtime. But I guess that's only going to tell you whether the parsing of a given input is ambiguous. Is there some strategy how I could leverage this to detect all ambiguities, using a carefully selected set of inputs?
Well, as far as I know, ANTLR has no real option to check for ambiguity, other than the errors it produced IF you write an ambiguous grammar and feed an input that triggers the ambiguity. I do, however know a few tools which can check for ambiguity. They all have different syntax, and I don't know any tool which uses ANTLR grammar.
A software called AtoCC has a tool called KfG which can check for ambiguity.
ACLA (Ambiguity Checking with Language Approximations).
Context Free Grammar Tool.
Personally, I find tool 3 easiest to use, but is the most limited as well. It is important, however to note that none of the tools can be 100% sure; if the tools says you're grammar is ambiguous, it is ambiguous, however if they say you're grammar is unambiguous, they might still be ambiguous, as they have no way of testing an infinite number of ways, that your language can be written.
Hope this helps.

How to write parser which handles import statements?

I'm using lex & yacc to write a VHDL parser. VHDL has some languages features which make it context sensitive in a manner similar to C. For example, typedef-like constructs which impact whether the parser should tokenize something as an IDENTIFIER vs. TYPEDEF_NAME.
The difficulty comes in when you need to build a symbol table based on another file which is referenced by "use" statements (similar to "import" in Java or Python).
library ieee;
use ieee.std_logic_1164.all;
-- code which uses something defined in ieee.std_logic_1164 package
In C, this is fairly straight-forward because the preprocessor has already combined all of the header files into a single translation unit which can be scanned from top to bottom. But 'use' statements in VHDL are not preprocessor commands.
So, somehow, as I'm parsing the file, I have to recognize when I see a use statement and then go off and parse the relevant file, and then continue parsing the original file with that symbol table.
Is there an elegant way to do this with lex/yacc? I know there is yyrestart but I'm not sure if that's going down the right track.
If you are using flex, then it is pretty easy.
The basic mechanism (including two functioning code samples) is described in the "Multiple Input Buffers" chapter of the flex manual. You can also take a glance at this question on SO.
The parser (yacc/bison) reduction which recognizes the use construction can include the code which calls yy_push_buffer. In the example code, the end of the included file is recognized by the scanner (lex/flex), which simply pops the buffer stack.
Depending on the formal rules of file inclusion, you might want the parser to know that the included file has finished, in order to avoid having syntactic constructs which start in the included file and continue in the includer. (C allows this, even though it is almost always an error; I don't know anything about VHDL, but there are definitely languages which do not allow it.) One possibility is to recursively call the parser in order to parse the included file, which will require a re-entrant ("pure") parser. In that case, the scanner should return an end-of-included-file token when it hits the end of the included file, because your included file grammar production will need to be terminated with such a token.
You may need to worry about the possibility that the parser has already requested the next input token. Most LALR(1) grammars do not depend on the lookahead token for semi-colon terminated statements, and bison usually doesn't request a lookahead token in a context in which it doesn't need it. But that behaviour is not guaranteed by all Posix-compatible yacc implementations and you might be using one which doesn't.
In that case, you would have to preserve the lookahead token so that you can reread it after the included file has been parsed. That would most conveniently be done by stashing the lookahead token somewhere the scanner can see it, and having the scanner return that token (if set) when it sees the end of the included file. In a bison action, you can find the lookahead token in yychar and its semantic value and location (if locations are enabled) are in yylval and yylloc. If bison has not read the lookahead token, the value of yychar will be YYEMPTY, and the simplest possible bison implementation would assert(yychar == YYEMPTY) when it is about to push the input buffer. If the assert fails, you'll need to implement a more sophisticated strategy.

How to write a language with Python-like indentation in syntax?

I'm writing a tool with it's own built-in language similar to Python. I want to make indentation meaningful in the syntax (so that tabs and spaces at line beginning would represent nesting of commands).
What is the best way to do this?
I've written recursive-descent and finite automata parsers before.
The current CPython's parser seems to be generated using something called ASDL.
Regarding the indentation you're asking for, it's done using special lexer tokens called INDENT and DEDENT. To replicate that, just implement those tokens in your lexer (that is pretty easy if you use a stack to store the starting columns of previous indented lines), and then plug them into your grammar as usual (like any other keyword or operator token).
Check out the python compiler and in particular compiler.parse.
I'd suggest ANTLR for any lexer/parser generation ( http://www.antlr.org ).
Also, this website ( http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/ ) has some more information, in particular:
Python’s indentation cannot be solved with a DFA. (I’m still perplexed at whether it can even be solved with a context-free grammar).
PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
CPython’s tokenizer is written in C. It’s ad-hoc, hand-written, and
complex. It is the only official implementation of Python lexing that
I know of.

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.
I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt
Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.
There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.
Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.
It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Enable/disable grammar rules in Yacc/Bison

Like the title says, I would like to enable/disable certain grammar rules in a yacc or bison grammar file.
Is there a way to do so?
If you mean, at compile time, yacc uses standard C /* */ comment syntax.
If you mean, at run time, you still have to work with the tables you have, so they need to include the entire grammar with the optional phrases.
So I would suggest making a fake terminal symbol. Rules that are optional would be preceded by the fake terminal. You would only return this terminal if you were including the optional productions.
A variation on this approach would involve defining two versions of a real terminal that actually exists. This only works for grammars that lead strings with terminals but if you have such an input then one terminal can mean one set of rules and another terminal might appear in two sets of rules, that is:
T_A dynamic_phrase_in_grammar;
always_on static_phrase_in_grammar;
always_on: T_A | T_B;
So, to enable the dynamic phrase, the real terminal is returned as T_A, to disable it, return as T_B.

Resources