How to combine antlr parser and lexer files - parsing

Essentially I would like to combine these g4 files:
https://github.com/apache/groovy/tree/master/src/antlr
into a single file which I can use with this clojure library:
https://github.com/aphyr/clj-antlr
which currently requires a combined parser/lexer file. How does one hack the files so they exist as correct grammar in a single file? I've rule out concatenating the files and removing the lexer and parase gammar prefixes as described here: https://github.com/antlr/antlr4/blob/master/doc/grammars.md

You cannot combine these grammars, because the lexer grammar uses an own super class and lexer modes. Both are not possible with a combined grammar.

Related

Difference between Parser.tokens and Lexer.tokens?

Normally when I export or grun a grammar to a target language it gives me two .tokens files. For example in the following:
lexer grammar TestLexer;
NUM : [0-9]+;
OTHER : ABC;
fragment NEWER : [xyz]+;
ABC : [abc]+;
I get a token for each non-fragment, and I get two identical files:
# Parser.tokens
NUM=1
OTHER=2
ABC=3
# Lexer.tokens
NUM=1
OTHER=2
ABC=3
Are these files always the same? I tried defining a token in the parser but since I've defined it as parser grammar it doesn't allow that, so I would assume these two files would always be the same, correct?
Grammars are always processed as individual lexer and parser grammars. If a combined grammar is used then it is temporarily split into two grammars and processed individually. Each processing step produces a tokens file (the list of found lexer tokens). The tokens file is the link between lexers and parsers. When you set a tokenVocab value actually the tokens file is used. That also means you don't need a lexer grammar, if you have a tokens file.
I'm not sure about the parser.tokens file. It might be useful for grammar imports.
And then you can specify a tocenVocab for lexer grammars too, which allows you to explicitly assign number values to tokens, which can come in handy if you have to check for token ranges (e.g. all keywords) in platform code. I cannot check this currently, but it might be that using this feature leads to token files with different content.

How can I detect dead productions from an ANTLR4 grammar?

I have an ANTLR4 grammar containing a large number of productions I don't want to use. I'd like to clean them out of the grammar file. ANTLR4 doesn't seem to allow you to specify a "goal" symbol, but if it could, I'd like to identify and remove any productions that aren't reachable from that goal symbol.
Is there a way to identify these kinds of unused productions so I can remove them from the grammar file?
There’s no such functionality in ANTLR itself. However, the ANTLR plugin for IntelliJ gives a warning when productions aren’t used:
Use Visual Studio Code along with my antlr4-vscode extension and enable code lenses (preferences: antlr4.referencesCodeLens.enabled). It will give you a reference count for each rule:
Or you can directly run AntlrLanguageSupport.countReferences(fileName, symbol) from the underlying antlr4-graps library in a node shell. More API details in the API doc file.

I have written a grammar in Antlr and I want to check if some expressions do or do not parse according to the Grammar

I have written a BNF Grammar in Antlr4. Using Antlr commands I managed to run it and compile it. The outputs are all the necessary files that Antlr generates (Lexers, Parsers, Listeners). I am not sure if the BNF grammar I created is semantically correct, but at least it is syntactically correct, since no errors appear.
At this point, I have to check if some existing expressions parse according to that grammar, but I have no idea how to do that.
I'm making the following assumptions:
antlr-4.1-complete.jar is in your CLASSPATH
Your grammar is called 'Test'
Your starting rule is called 'parse'
Then do the following:
$ java org.antlr.v4.runtime.misc.TestRig Test parse -tree
Type expressions here
CTL^D
If you have an example expression in a file, you can pipe the contents through the parser:
$ cat fileName | java org.antlr.v4.runtime.misc.TestRig Test parse -tree

Which parser generator would be useful for manipulating the productions themselves?

Similar to Generating n statements from context-free grammars, I want to randomly generate sentences from a grammar.
What is a good parser generator for manipulating the actual grammar productions themselves? I want the parser generator to actually give me access to the productions (production objects?).
If I had a grammar akin to:
start_symbol ::= foo
foo ::= bar | baz
What is a good parser generator for:
giving me the starting production symbol
allow me to choose one production from RHS of the start symbol ( foo in this case)
give me the production options for foo
Clearly every parser has internal representations for productions and methods of associating the production with its RHS, but which parser would be easy to manipulate these internals?
Note: the blog entry linked to from the other SO question I mentioned has some sort of custom CFG parser. I want to use an actual grammar for a real parser, not generate my own grammar parser.
It should be pretty easy to write a grammar, that matches the grammar that a parser generator accepts. (With an open source parser genrator, you ought to be able to fetch such a grammar from the parser generator source code; they all then to have self-grammars). With that, you can then parse any grammar the parser generator accepts.
If you want to manipulate the parsed grammar, you'll need an abstract syntax tree of same. You can make most parser generators build a tree, either by using built-in mechanisms or ad hoc code you add.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Resources