Tatsu: Rule Ordering - tatsu

I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?

Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.

Related

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

Compiler Design : Is "variable not declared" a syntactic error or semantic error?

Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

Alpha renaming in many languages

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.

Better way to test (automatically) a parser?

I’m recently writing a small programming language, and have finished writing its parser. I want to write an automated test for the parser (that its result is an abstract syntax tree), but I’m not sure which way is better.
First what I tried is just to serialize AST to S-expression text and compare it to the expected output text I wrote by hand, but it has some problems:
There are trivial meaningless differences between a serialized text and the expected output like whitespaces. For example, there is no difference between:
(attribute (symbol str) (symbol length))
(that is serialized) and:
(attribute (symbol str)
(symbol length))
(that is handwritten by me) in their meanings, but string comparison distincts them of course. Okay, I could resolve it by normalization.
When a test fails, it doesn’t show the difference between actual tree and expected tree concisely. I want to show only a difference node, not whole tree.
Second what I tried is to write S-expression parser and compare AST that parser (to be tested) generates to AST that S-expression parser (that I just implemented) generates from the handwritten expected output. However I realized that S-expression have to be tested also and it could be really nonsense.
I wonder what is the typical and easy way to test the parser.
PS. I am using Java, and dont’t want any dependencies to third-party libraries.
Providing you are looking for a completely automated and extensible unit testing framework for your parser I'd recommend the following approach:
Incorrect input
Create a set of samples of incorrect inputs. Then feed the parse with each of them making sure the parser rejects them. I's a good idea to provide metadata for each test case that defines the expected output — the specific error code / message the parser is supposed to produce.
Correct input
As in the previous case, create a set of samples representing various correct inputs. Besides the simple validation that the parser accepts all inputs, there's still the problem of validating that the actual Abstract Syntax Tree makes sense.
To address this problem I'd do the following: Describing the expected AST for each test case in some well-known format that can be safely parsed — deserialized into the actual in-memory AST structures — by a 3rd party parser considered bug-free (for your case). The natural choice is XML since most languages / programming frameworks cover XML support and provide the respective (de)serialization facilities. The best solution would be to deserialize right into the AST node types. Since convenient visual editing tools for XML exist it's feasible to construct even large test cases.
Then I'd construct an AST comparer using the visitor pattern which pair-up the two ASTs and compare both nodes in each pair for equality. However, equality is a per-AST-node-type specific operation.
Notes:
This approach would work with most unit-testing frameworks like JUnit.
AST to XML serialization is a welcome tool for debugging the compiler.
The visitor pattern implementation can easily serve as the backbone for multiple processing stages within the compiler.
There are compiler test suites freely available that can provide some inspiration to your project — see for example the Ada Conformity Assesment Test Suite for the Ada programming language, although this test suite deals with higher-level testing, not just parser testing.
Here's what. A grammar defines a language. The language is the set of string that the grammar generates, or that a parser for the grammar accepts.
Given that, more than testing if the ASTs seem right, it's important to test that the parser accepts strings intended to be in the language and rejects strings that in your mind shouldn't belong to it.
In that sense, a simple accept or reject (bonus point for input position for the rejection) is enough to build a nice and large set of test cases.
Examples:
()
(a)
((((((((((a))))))))))
((((((((((a)))))))))
(a (a (a (a (a (a (b)))))))
(((((((b) a) a) a) a) a) a)
(((((((b a) a) a) a) a) a)
((a)(a)(a)(a))
((a)(a a)(a))
(())
(()())
((()())(()())(()()))
((()())()()(()()))
...

Resources