how to convert antlr4 grammar file to tree-sitter grammar file? - parsing

Does anyone know of any tool(s) that can convert ANTLR v4 grammar files (.g4 extension) to tree-sitter grammar files (.js extension)? It would also be fine if I had to chain a couple conversion tools together. For example, going from foo.g4 (antlr4) to foo.ebnf (intermediary format) to foo.js (tree-sitter). Thank you!
I tried using this tool to go from g4 to ebnf, and then this tool to go from ebnf to tree-sitter js, but to no avail. The first tool seemed to create some junk at the bottom of the file which gave the second tool trouble. Additionally, the second tool seems to expect each definition to be completely on one line (and the first tool breaks each definition up into multiple lines for readability).

Related

How can I detect dead productions from an ANTLR4 grammar?

I have an ANTLR4 grammar containing a large number of productions I don't want to use. I'd like to clean them out of the grammar file. ANTLR4 doesn't seem to allow you to specify a "goal" symbol, but if it could, I'd like to identify and remove any productions that aren't reachable from that goal symbol.
Is there a way to identify these kinds of unused productions so I can remove them from the grammar file?
There’s no such functionality in ANTLR itself. However, the ANTLR plugin for IntelliJ gives a warning when productions aren’t used:
Use Visual Studio Code along with my antlr4-vscode extension and enable code lenses (preferences: antlr4.referencesCodeLens.enabled). It will give you a reference count for each rule:
Or you can directly run AntlrLanguageSupport.countReferences(fileName, symbol) from the underlying antlr4-graps library in a node shell. More API details in the API doc file.

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik
The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

POSIX sh EBNF grammar

Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.

Go uses Go to parse itself?

I am starting a class project that regards adding some functionality to Go.
However, I am thoroughly confused on the structure of Go. I was under the impression that Go used flex and bison but I can't find anything familiar in the Go source code.
On the other hand, the directory go/src/pkg/go has folders with familiar names (ast, token, parser, etc.) but all they contain are .go files. I'm confused!
My request is, of anyone familiar with Go, can you give me an overview of how Go is lexed, parsed, etc. and where to find the files to edit the grammar and whatnot?
The directory structure:
src/cmd/5* ARM
src/cmd/6* amd64 (x86-64)
src/cmd/8* i386 (x86-32)
src/cmd/cc C compiler (common part)
src/cmd/gc Go compiler (common part)
src/cmd/ld Linker (common part)
src/cmd/6c C compiler (amd64-specific part)
src/cmd/6g Go compiler (amd64-specific part)
src/cmd/6l Linker (amd64-specific part)
Lexer is written in pure C (no flex). Grammar is written in Bison:
src/cmd/gc/lex.c
src/cmd/gc/go.y
Many directories under src/cmd contain a doc.go file with short description of the directory's contents.
If you are planning to modify the grammar, it should be noted that the Bison grammar sometimes does not distinguish between expressions and types.
lex.c
go.y
The Go compilers are written in c, which is why you need flex and bison. The Go package for parsing is not used. If you wanted to write a self hosting compiler in Go, you could use the Go parsing package.

VBScript Partial Parser

I am trying to create a VBScript parser. I was wondering what is the best way to go about it. I have researched and researched. The most popular way seems to be going for something like Gold Parser or ANTLR.
The feature I want to implement is to do dynamic checking of Syntax Errors in VBScript. I do not want to compile the entire VBS every time some text changes. How do I go about doing that? I tried to use Gold Parser, but i assume there is no incremental way of doing parsing through it, something like partial parse trees...Any ideas on how to implement a partial parse tree for such a scenario?
I have implemented VBscript Parsing via GOLD Parser. However it is still not a partial parser, parses the entire script after every text change. Is there a way to build such a thing.
thks
If you really want to do incremental parsing, consider this paper by Tim Wagner.
It is brilliant scheme to keep existing parse trees around, shuffling mixtures of string fragments at the points of editing and parse trees representing the parts of the source text that hasn't changed, and reintegrating the strings into the set of parse trees. It is done using an incremental GLR parser.
It isn't easy to implement; I did just the GLR part and never got around to the incremental part.
The GLR part was well worth the trouble.
There are lots of papers on incremental parsing. This is one of the really good ones.
I'd first look for an existing VBScript parser instead of writing your own, which is not a trivial task!
Theres a VBScript grammar in BNF format on this page: http://rosettacode.org/wiki/BNF_Grammar which you can translate into a ANTLR (or some other parser generator) grammar.
Before trying to do fancy things like re-parsing only a part of the source, I recommend you first create a parser that actually works.
Best of luck!

Resources