Writing a subshell parsing rule on ANTLR - parsing

I'm trying to create a simple BaSH-like grammar on ANTLRv3 but haven't been able to parse (and check) input inside subshell commands.
Further explanation:
I want to parse the following input:
$(command parameters*)
`command parameters`
"some text $(command parameters*)"
And be able to check it's contents as I would with simple input such as: command parameters.
i.e.:
Parsing it would generate a tree like (SUBSHELL (CMD command (PARAM parameters*))) (tokens are in upper-case)
I'm able to ignore '$('s and '`'s, but that won't cover the cases where the subshells are used inside double-quoted strings, like:
$ echo "String test $(ls -l) end"
So... any tips on how do I achieve this?

I'm not very familiar with the details of Antlr v3, but I can tell you that you can't handle bash-style command substitution inside double-quoted strings in a traditional-style lexer, as the nesting cannot be expressed using a regular grammar. Most traditional compiler-compilers restrict lexers to use regular grammars so that efficient DFAs can be constructed for them. (Lexers, which irreducibly have to scan every single character of the source, have historically been one of the slowest parts of a compiler.)
You must either parse " as a token and (ideally) use a different lexer or lexer mode for the internals of strings, so that most shell metacharacters, e.g. '{', aren't parsed as tokens but as text; or alternatively, do away with the lexer-parser division and use a scannerless approach, so that the "lexer" rule for double-quoted strings can call into the "parser" rule for command substitutions.
I would favour the scannerless approach. I would investigate how well Antlr v3 supports writing grammars that work directly over a character stream, rather than using a token stream.

Related

Is there a way to do context sensitive parsing in tatsu

context sensitive '%' ..... eol comments
I'm starting with the grammar for PDF described here
https://github.com/caradoc-org/caradoc/blob/master/doc/grammar/grammar.pdf
which seems to lack the definition of eol comments.
PDF has end of line comments which start with the '%' character except inside string_literal (and another rule stream).
string_literal = "(" string_content ")";
where string_content can include the '%' character and also eol, but not "()" etc. The PDF language also has some special cases which otherwise look like comments eg
'%PDF-1.5' eol;
or
"%%EOF" [eol];
is there a way to handle the context sensitivity in a tatsu grammar?
I'll stay away from "Context Sensitive" in this answer, because the phrase has meaning in Language Theory.
PEG is perfectly capable of parsing a sub-language (say, Python string formatting expressions) within another language.
In fact, the original PEG definition does not use a tokenizer, because PEG grammars can parse the token sub-language.
If you think of sub-grammars, then the context is provided by the rule that knows that a sub-grammar has to be invoked.
With TatSu, there are features that allow tokenization to happen before the parsing (the Buffer class) for efficiency, and convenience, but using those features is not mandatory.
The only cases that cannot be handled easily as a grammar-within-a-grammar are preprocessing with macro capabilities, because those require an interpretation phase before the text for the inner grammar can be parsed.

What exactly is a lexer's job?

I'm writing a lexer for Markdown. In the process, I realized that I do not fully understand what its core responsibility should be.
The most common definition of a lexer is that it translates an input stream of characters into an output stream of tokens.
Input → Output
(characters) (tokens)
That sounds quite simple at first, but the question that arises here is how much semantic interpretation the lexer should do before handing over its output of tokens to the parser.
Take this example of Markdown syntax:
### Headline
*This* is an emphasized word.
It might be translated by a lexer into the following series of tokens:
Lexer 1 Output
.headline("Headline")
.emphasis("This")
.text"(" is an emphasized word.")
But it might as well be translated on a more granular level, depending on the grammar (or the set of lexemes) used:
Lexer 2 Output
.controlSymbol("#")
.controlSymbol("#")
.controlSymbol("#")
.text(" Headline")
.controlSymbol("*")
.text("This")
.controlSymbol("*")
.text"(" is an emphasized word.")
It seems a lot more practical to have the lexer produce an output similar to that of Lexer 1, because the parser will then have an easier job. But it also means that the lexer needs to semantically understand what the code means. It's not merely mapping a sequence of characters to a token. It needs to look ahead and identify patterns. (For example, it needs to be able to be able to distinguish between **Hey* you* and **Hey** you. It cannot simply translate a double asterisk ** into .openingEmphasis, because that depends on the following context.)
According to this Stackoverflow post and the CommonMark definition, it seems to make sense to first break down the Markdown input into a number of blocks (representing one or more lines) and then analyze the contents of each block in a second step. With the example above, this would mean the following:
.headlineBlock("Headline")
.paragraphBlock("*This* is an emphasized word.")
But this wouldn't count as a valid sequence of tokens because some of the lexemes ("*") have not been parsed yet and it wouldn't be right to pass this paragraphBlock to the parser.
So here's my question:
Where do you draw the line?
How much semantic work should the lexer do? Is there some hard cut in the definition of a lexer that I am not aware of?
What would be the best way to define a grammar for the lexer?
BNF is used to describe many languages / create lexers and parsers
MOST use a Look right 1 to define a unambiguous format.
Recently I was looking at playing with SQL BNF
https://github.com/ronsavage/SQL/blob/master/sql-92.bnf
I made the decision that my lexer would return only terminal token strings. Similar to your option 1.
'('
')'
'KEWORDS'
'-- comment eol'
'12.34'
...
Any rule that defined the syntax tree would be left to the parser.
<Document> := <lines>
<Lines> := <line> [<Lines>]
<line> := ...

POSIX sh EBNF grammar

Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.

antlr: how to [sometimes] parse things in quotes?

I have a situation where my language allows quotes strings but sometimes I want to interpret the contents of the quoted string as language constructs. Think of it as, say, eval function.
So to support quoted strings i need a lexer rule and it overrides my attempts to have a grammar rule evaluating things in quotes if prefixed with 'eval'. Is there any way to deal with this in the grammar?
IMO you should not try to handle this case directly through the lexer.
I think I would leave the string as it in the lexer and add some code in the eval rule of the parser that calls a sub-parser on the string content.
If you want to implement an eval function, you're really looking for a runtime interpreter.
The only time you need an "eval" function is when you want to build up the content to compile at runtime. If you have the content available at compile-time, you can parse it without it being a string...
So... keep it as a string, and then use the same parser at runtime to parse/compile its contents.

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.
I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt
Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.
There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.
Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.
It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Resources