Better way to test (automatically) a parser? - parsing

I’m recently writing a small programming language, and have finished writing its parser. I want to write an automated test for the parser (that its result is an abstract syntax tree), but I’m not sure which way is better.
First what I tried is just to serialize AST to S-expression text and compare it to the expected output text I wrote by hand, but it has some problems:
There are trivial meaningless differences between a serialized text and the expected output like whitespaces. For example, there is no difference between:
(attribute (symbol str) (symbol length))
(that is serialized) and:
(attribute (symbol str)
(symbol length))
(that is handwritten by me) in their meanings, but string comparison distincts them of course. Okay, I could resolve it by normalization.
When a test fails, it doesn’t show the difference between actual tree and expected tree concisely. I want to show only a difference node, not whole tree.
Second what I tried is to write S-expression parser and compare AST that parser (to be tested) generates to AST that S-expression parser (that I just implemented) generates from the handwritten expected output. However I realized that S-expression have to be tested also and it could be really nonsense.
I wonder what is the typical and easy way to test the parser.
PS. I am using Java, and dont’t want any dependencies to third-party libraries.

Providing you are looking for a completely automated and extensible unit testing framework for your parser I'd recommend the following approach:
Incorrect input
Create a set of samples of incorrect inputs. Then feed the parse with each of them making sure the parser rejects them. I's a good idea to provide metadata for each test case that defines the expected output — the specific error code / message the parser is supposed to produce.
Correct input
As in the previous case, create a set of samples representing various correct inputs. Besides the simple validation that the parser accepts all inputs, there's still the problem of validating that the actual Abstract Syntax Tree makes sense.
To address this problem I'd do the following: Describing the expected AST for each test case in some well-known format that can be safely parsed — deserialized into the actual in-memory AST structures — by a 3rd party parser considered bug-free (for your case). The natural choice is XML since most languages / programming frameworks cover XML support and provide the respective (de)serialization facilities. The best solution would be to deserialize right into the AST node types. Since convenient visual editing tools for XML exist it's feasible to construct even large test cases.
Then I'd construct an AST comparer using the visitor pattern which pair-up the two ASTs and compare both nodes in each pair for equality. However, equality is a per-AST-node-type specific operation.
Notes:
This approach would work with most unit-testing frameworks like JUnit.
AST to XML serialization is a welcome tool for debugging the compiler.
The visitor pattern implementation can easily serve as the backbone for multiple processing stages within the compiler.
There are compiler test suites freely available that can provide some inspiration to your project — see for example the Ada Conformity Assesment Test Suite for the Ada programming language, although this test suite deals with higher-level testing, not just parser testing.

Here's what. A grammar defines a language. The language is the set of string that the grammar generates, or that a parser for the grammar accepts.
Given that, more than testing if the ASTs seem right, it's important to test that the parser accepts strings intended to be in the language and rejects strings that in your mind shouldn't belong to it.
In that sense, a simple accept or reject (bonus point for input position for the rejection) is enough to build a nice and large set of test cases.
Examples:
()
(a)
((((((((((a))))))))))
((((((((((a)))))))))
(a (a (a (a (a (a (b)))))))
(((((((b) a) a) a) a) a) a)
(((((((b a) a) a) a) a) a)
((a)(a)(a)(a))
((a)(a a)(a))
(())
(()())
((()())(()())(()()))
((()())()()(()()))
...

Related

Tatsu: Rule Ordering

I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.

How is Coq's parser implemented?

I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.

Can this be parsed by a LALR(1) parser?

I am writing a parser in Bison for a language which has the following constructs, among others:
self-dispatch: [identifier arguments]
dispatch: [expression . identifier arguments]
string slicing: expression[expression,expression] - similar to Python.
arguments is a comma-separated list of expressions, which can be empty too. All of the above are expressions on their own, too.
My problem is that I am not sure how to parse both [method [other_method]] and [someString[idx1, idx2].toInt] or if it is possible to do this at all with an LALR(1) parser.
To be more precise, let's take the following example: [a[b]] (call method a with the result of method b). When it reaches the state [a . [b]] (the lookahead is the second [), it won't know whether to reduce a (which has already been reduced to identifier) to expression because something like a[b,c] might follow (which could itself be reduced to expression and continue with the second construct from above) or to keep it identifier (and shift it) because a list of arguments will follow (such as [b] in this case).
Is this shift/reduce conflict due to the way I expressed this grammar or is it not possible to parse all of these constructs with an LALR(1) parser?
And, a more general question, how can one prove that a language is/is not parsable by a particular type of parser?
Assuming your grammar is unambiguous (which the part you describe appears to be) then your best bet is to specify a %glr-parser. Since in most cases, the correct parse will be forced after only a few tokens, the overhead should not be noticeable, and the advantage is that you do not need to complicate either the grammar or the construction of the AST.
The one downside is that bison cannot verify that the grammar is unambiguous -- in general, this is not possible -- and it is not easy to prove. If it turns out that some input is ambiguous, the GLR parser will generate an error, so a good test suite is important.
Proving that the language is not LR(1) would be tricky, and I suspect that it would be impossible because the language probably is recognizable with an LALR(1) parser. (Impossible to tell without seeing the entire grammar, though.) But parsing (outside of CS theory) needs to create a correct parse tree in order to be useful, and the sort of modifications required to produce an LR grammar will also modify the AST, requiring a post-parse fixup. The difficultly in creating a correct AST spring from the difference in precedence between
a[b[c],d]
and
[a[b[c],d]]
In the first (subset) case, b binds to its argument list [c] and the comma has lower precedence; in the end, b[c] and d are sibling children of the slice. In the second case (method invocation), the comma is part of the argument list and binds more tightly than the method application; b, [c] and d are siblings in a method application. But you cannot decide the shape of the parse tree until an arbitrarily long input (since d could be any expression).
That's all a bit hand-wavey since "precedence" is not formally definable, and there are hacks which could make it possible to adjust the tree. Since the LR property is not really composable, it is really possible to provide a more rigorous analysis. But regardless, the GLR parser is likely to be the simplest and most robust solution.
One small point for future reference: CFGs are not just a programming tool; they also serve the purpose of clearly communicating the grammar in question. Nirmally, if you want to describe your language, you are better off using a clear CFG than trying to describe informally. Of course, meaningful non-terminal names will help, and a few examples never hurt, but the essence of the grammar is in the formal description and omitting that makes it harder for others to "be helpful".

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
Example:
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1
I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.
(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure
Using
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
|
V
Lexical Analyzer
(token stream)
|
V
Syntax Analyzer
(syntax tree)
|
V
Semantic Analyzer
(syntax tree)
|
V
...
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.

Resources