Syntax analysis and semantic analysis - parsing

I am wondering how the syntax analysis and semantic analysis work.
I have finished the lexer and the grammar construction of my interpreter.
Now I am going to implement a recursive descent (top down) parser for this grammar
For example, I have the following grammar:
<declaration> ::= <data_type> <identifier> ASSIGN <value>
so i coded it like this (in java):
public void declaration(){
data_type();
identifier();
if(token.equals("ASSIGN")){
lexer(); //calls next token
value();
} else {
error();
}
}
Assuming I have three data types: Int, String and Boolean. Since the values for each data types are different, (ex. true or false only in Boolean) how can I determine if it fits the data type correctly? What part of my code would determine that?
I am wondering where would I put the code to:
1.) call the semantic analysis part of my program.
2.) store my variables into the symbol table.
Do syntax analysis and semantic analysis happen at the same time?
or do i need to finish the syntax analysis first, then do the semantic analysis?
I am really confused. Please help.
Thank you.

You can do syntax analysis (parsing) and semantic analysis (e.g., checking for agreement between <data_type> and <value>) at the same time. For example, when declaration() calls data_type(), the latter could return something (call it DT) that indicates whether the declared type is Int, String or Boolean. Similarly, value() could return something (VT) that indicates the type of the parsed. Then declaration() would simply compare DT and VT, and raise an error if they don't match. (Alternatively, value() could take a parameter indicating the declared type, and it could do the check.)
However, you may find it easier to completely separate the two phases. To do so, you would typically have the parsing phase build a parse tree (or abstract syntax tree). So your top level would call (e.g.) program() to parse a whole program, which would return a tree representing (the syntax of) the program, and you would pass that tree to a semantic_analysis() routine, which would traverse the tree, extracting pertinent information and enforcing semantic constraints.

The short answer is: it depends on the definition of your programming language. And, since you only specified one derivation rule and three native types, there is no way of knowing. For example, if your programming language allows forward declarations like the c++ code below, then handling the derivation rule for function declaration (foo) is done without knowing the type of the variable serial
class Tree {
public:
int foo(void)
{
return serial;
}
int serial;
};
Indeed, modern compilers separate the syntax analysis phase from the semantic analysis phase. The syntax analysis phase is performed first, making sure the input program agrees with the context free grammar of the language. And, in addition, produces an Abstract Syntax Tree (AST). Note the difference between an AST and a parse tree, as discussed in this SO post. The semantic analysis phase then traverses the AST and checks for type mismatches among other things.
Having said that, toy programming languages can sometimes couple semantic and syntax analysis together. When a recursive descent parser is used,
you should have relevant recursive calls return a type.

Related

Tatsu: Rule Ordering

I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.

Parser AST - Advantage of Binary Expression vs Function

I've written a parser for a simple in house SQL style language. It's a typical recursive descent parser.
Naturally, we have expressions, and two of the possible forms of expressions I model are BinaryExpression and FunctionExpression. My question is, since a binary expression can be modelled as a function with two arguments, is there any advantage in keeping the distinction?
Perhaps function invocation is not normally modelled as an expression but as a statement, but here all my functions must produce a value.
How you choose to model your language is really up to you; it completely depends on how you intend to use the AST you construct.
Certainly there is no fundamental difference between evaluation of a binary operator and evaluation of a function with two arguments. On the other hand, there is a significant difference in the presentation (in most languages). Certain operators have very well understood properties which can be of use during static analysis, such as finding optimisations.
So both styles are certainly valid, and you will have to make the choice based on your knowledge of the intended use(s) of the AST.

Compiler Design : Is "variable not declared" a syntactic error or semantic error?

Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.

What is a tree parser in ANTLR and am I forced to write one?

I'm writing a lexer/parser for a small subset of C in ANTLR that will be run in a Java environment. I'm new to the world of language grammars and in many of the ANTLR tutorials, they create an AST - Abstract Syntax Tree, am I forced to create one and why?
Creating an AST with ANTLR is incorporated into the grammar. You don't have to do this, but it is a really good tool for more complicated requirements. This is a tutorial on tree construction you can use.
Basically, with ANTLR when the source is getting parsed, you have a few options. You can generate code or an AST using rewrite rules in your grammar. An AST is basically an in memory representation of your source. From there, there's a lot you can do.
There's a lot to ANTLR. If you haven't already, I would recommend getting the book.
I found this answer to the question on jGuru written by Terence Parr, who created ANTLR. I copied this explanation from the site linked here:
Only simple, so-called syntax directed translations can be done with actions within the parser. These kinds of translations can only spit out constructs that are functions of information already seen at that point in the parse. Tree parsers allow you to walk an intermediate form and manipulate that tree, gradually morphing it over several translation phases to a final form that can be easily printed back out as the new translation.
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
<html>
<head>
<title>There are 3 items</title>
</head>
<body>
<ol>
<li>Dog</li>
<li>Cat</li>
<li>Velociraptor</li>
</body>
</html>
from input
Dog
Cat
Velociraptor
So with simple actions in your grammar how can you compute the title? You can't without reading the whole input. Ok, so now we know we need an intermediate form. The best is usually an AST I've found since it records the input structure. In this case, it's just a list but it demonstrates my point.
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
+
|
3 -- 4
and classes
class PlusNode extends CommonAST {
public String toString() {
AST left = getFirstChild();
AST right = left.getNextSibling();
return left + " + " + right;
}
}
class IntNode extends CommonAST {
public String toString() {
return getText();
}
}
Given an expression tree, you can translate it back to text with t.toString(). SO, what's wrong with this? Seems to work great, right? It appears to work well in this case because it's simple, but I argue that, even for this simple example, tree grammars are more readable and are formalized descriptions of precisely what you coded in the PlusNode.toString().
expr returns [String r]
{
String left=null, right=null;
}
: #("+" left=expr right=expr) {r=left + " + " + right;}
| i:INT {r=i.getText();}
;
Note that the specific class ("heterogeneous AST") approach actually encodes a complete recursive-descent parser for #(+ INT INT) by hand in toString(). As parser generator folks, this should make you cringe. ;)
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
the order of output constructs is the same as the input order
all constructs can be generated from information parsed up to the point when you need to spit them out
Beyond this, you will need an intermediate form--the AST is the best form usually. Using a grammar to describe the structure of the AST is analogous to using a grammar to parse your input text. Formalized descriptions in a domain-specific high-level language like ANTLR are better than hand coded parsers. Actions within a tree grammar have very clear context and can conveniently access information passed from invoking rlues. Translations that manipulate the tree for multipass translations are also much easier using a tree grammar.
I think the creation of the AST is optional. The Abstract Syntax Tree is useful for subsequent processing like semantic analysis of the parsed program.
Only you can decide if you need to create one. If your only objective is syntactic validation then you don't need to generate one. In javacc (similar to ANTLR) there is a utility called JJTree that allows the generation of the AST. So I imagine this is optional in ANTLR as well.

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
Example:
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1
I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.
(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure
Using
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
|
V
Lexical Analyzer
(token stream)
|
V
Syntax Analyzer
(syntax tree)
|
V
Semantic Analyzer
(syntax tree)
|
V
...
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.

Resources