comparing constructs in clang AST parser - clang

While parsing an ASTVisitor in Clang in sample codes, I see there constructs to verify statements eg.
isa<IfStmt>(statement)
isa<UnaryOperator>(Expression)
Is there a comprehensive list of such constructs which are used to evaluate a current expression/statement.
Thanks

First of all, there's nothing magical about isa, that's just LLVM's way of checking whether an object is a subtype of some class; the expression isa<IfStmt>(statement) is basically equivalent to this RTTI-enabled expression:
dynamic_cast<IfStmt*>(statement) != NULL
So your question really boils down to what the AST hierarchy is; and for that, it's best to check these four pages, with complete hierarchy chart:
Type's hierarchy
Decl's hierarchy
DeclContext's hierarchy
Stmt's hierarchy (this includes Expr and its children)

Related

How can I find all uses of a ValueDecl?

I'd like to take clang AST, analyze how a certain variable is used and do some
source-to-source transformation if a specific usage pattern is recognized.
Particularly, I'm looking for patterns like this:
void *h;
h = create_handler(...);
use_handler(h);
destroy_handler(h);
So far, I am able to detect ValueDecl corresponding to void *h. Next step
would be to find all uses of h and see if they are safe and if
create_handler/destroy_handler properly dominate/post-dominate one another.
Unfortunately, I have no idea how to iterate over h's uses, it seems that
there is no such interface in ValueDecl class.
I'd appreciate it if you could you either suggest how I could find all uses of a
variable in AST, or point me to some clang-based tool dealing with a similar problem.
Thank you!
One can match declRefExprs referencing the variable (using AST matchers). After that, ParentMap could be used to traverse AST backward and find recursively AST nodes which use those declRefExprs. Keep in mind that typically ParentMap is constructed not for the whole AST but for a subtree only (passed as a parameter into the constructor).

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

Syntax analysis and semantic analysis

I am wondering how the syntax analysis and semantic analysis work.
I have finished the lexer and the grammar construction of my interpreter.
Now I am going to implement a recursive descent (top down) parser for this grammar
For example, I have the following grammar:
<declaration> ::= <data_type> <identifier> ASSIGN <value>
so i coded it like this (in java):
public void declaration(){
data_type();
identifier();
if(token.equals("ASSIGN")){
lexer(); //calls next token
value();
} else {
error();
}
}
Assuming I have three data types: Int, String and Boolean. Since the values for each data types are different, (ex. true or false only in Boolean) how can I determine if it fits the data type correctly? What part of my code would determine that?
I am wondering where would I put the code to:
1.) call the semantic analysis part of my program.
2.) store my variables into the symbol table.
Do syntax analysis and semantic analysis happen at the same time?
or do i need to finish the syntax analysis first, then do the semantic analysis?
I am really confused. Please help.
Thank you.
You can do syntax analysis (parsing) and semantic analysis (e.g., checking for agreement between <data_type> and <value>) at the same time. For example, when declaration() calls data_type(), the latter could return something (call it DT) that indicates whether the declared type is Int, String or Boolean. Similarly, value() could return something (VT) that indicates the type of the parsed. Then declaration() would simply compare DT and VT, and raise an error if they don't match. (Alternatively, value() could take a parameter indicating the declared type, and it could do the check.)
However, you may find it easier to completely separate the two phases. To do so, you would typically have the parsing phase build a parse tree (or abstract syntax tree). So your top level would call (e.g.) program() to parse a whole program, which would return a tree representing (the syntax of) the program, and you would pass that tree to a semantic_analysis() routine, which would traverse the tree, extracting pertinent information and enforcing semantic constraints.
The short answer is: it depends on the definition of your programming language. And, since you only specified one derivation rule and three native types, there is no way of knowing. For example, if your programming language allows forward declarations like the c++ code below, then handling the derivation rule for function declaration (foo) is done without knowing the type of the variable serial
class Tree {
public:
int foo(void)
{
return serial;
}
int serial;
};
Indeed, modern compilers separate the syntax analysis phase from the semantic analysis phase. The syntax analysis phase is performed first, making sure the input program agrees with the context free grammar of the language. And, in addition, produces an Abstract Syntax Tree (AST). Note the difference between an AST and a parse tree, as discussed in this SO post. The semantic analysis phase then traverses the AST and checks for type mismatches among other things.
Having said that, toy programming languages can sometimes couple semantic and syntax analysis together. When a recursive descent parser is used,
you should have relevant recursive calls return a type.

Compiler Design : Is "variable not declared" a syntactic error or semantic error?

Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.

What is a tree parser in ANTLR and am I forced to write one?

I'm writing a lexer/parser for a small subset of C in ANTLR that will be run in a Java environment. I'm new to the world of language grammars and in many of the ANTLR tutorials, they create an AST - Abstract Syntax Tree, am I forced to create one and why?
Creating an AST with ANTLR is incorporated into the grammar. You don't have to do this, but it is a really good tool for more complicated requirements. This is a tutorial on tree construction you can use.
Basically, with ANTLR when the source is getting parsed, you have a few options. You can generate code or an AST using rewrite rules in your grammar. An AST is basically an in memory representation of your source. From there, there's a lot you can do.
There's a lot to ANTLR. If you haven't already, I would recommend getting the book.
I found this answer to the question on jGuru written by Terence Parr, who created ANTLR. I copied this explanation from the site linked here:
Only simple, so-called syntax directed translations can be done with actions within the parser. These kinds of translations can only spit out constructs that are functions of information already seen at that point in the parse. Tree parsers allow you to walk an intermediate form and manipulate that tree, gradually morphing it over several translation phases to a final form that can be easily printed back out as the new translation.
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
<html>
<head>
<title>There are 3 items</title>
</head>
<body>
<ol>
<li>Dog</li>
<li>Cat</li>
<li>Velociraptor</li>
</body>
</html>
from input
Dog
Cat
Velociraptor
So with simple actions in your grammar how can you compute the title? You can't without reading the whole input. Ok, so now we know we need an intermediate form. The best is usually an AST I've found since it records the input structure. In this case, it's just a list but it demonstrates my point.
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
+
|
3 -- 4
and classes
class PlusNode extends CommonAST {
public String toString() {
AST left = getFirstChild();
AST right = left.getNextSibling();
return left + " + " + right;
}
}
class IntNode extends CommonAST {
public String toString() {
return getText();
}
}
Given an expression tree, you can translate it back to text with t.toString(). SO, what's wrong with this? Seems to work great, right? It appears to work well in this case because it's simple, but I argue that, even for this simple example, tree grammars are more readable and are formalized descriptions of precisely what you coded in the PlusNode.toString().
expr returns [String r]
{
String left=null, right=null;
}
: #("+" left=expr right=expr) {r=left + " + " + right;}
| i:INT {r=i.getText();}
;
Note that the specific class ("heterogeneous AST") approach actually encodes a complete recursive-descent parser for #(+ INT INT) by hand in toString(). As parser generator folks, this should make you cringe. ;)
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
the order of output constructs is the same as the input order
all constructs can be generated from information parsed up to the point when you need to spit them out
Beyond this, you will need an intermediate form--the AST is the best form usually. Using a grammar to describe the structure of the AST is analogous to using a grammar to parse your input text. Formalized descriptions in a domain-specific high-level language like ANTLR are better than hand coded parsers. Actions within a tree grammar have very clear context and can conveniently access information passed from invoking rlues. Translations that manipulate the tree for multipass translations are also much easier using a tree grammar.
I think the creation of the AST is optional. The Abstract Syntax Tree is useful for subsequent processing like semantic analysis of the parsed program.
Only you can decide if you need to create one. If your only objective is syntactic validation then you don't need to generate one. In javacc (similar to ANTLR) there is a utility called JJTree that allows the generation of the AST. So I imagine this is optional in ANTLR as well.

Resources