I've been studying the grammar and AST nodes for various languages with AST explorer.
With python, I've noticed that some form of semantic analysis is taking place during the parsing process. E.g.
x = 2
x = 2
Yields the following AST consisting of a VariableDeclaration node and ExpressionStatement node.
So when the first x = 2 line is parsed, it checks a symbol table for the existence of x and then registers it and produces a VariableDeclaration node. Then when the second x = 2 line is parsed, it finds that x is already defined and produces a ExpressionStatement node.
However when I attempt to use the following semantically incorrect code:
2 + "string"
It accepts the code, and produces a ExpressionStatement node - even though it is semantically incorrect i.e. int + string, and rightly so produces an error when I attempt to execute it with a python interpreter.
This suggests to me that semantic analysis takes place twice: once during the parsing process and once again while traversing the complete AST. Is this assumption correct? If so why is this the case? Wouldn't it be more simple to do the entire semantic analysis phase during parsing instead of splitting it up?
The semantic error in the statement 2 + "string" is not detected in any semantic pass whatsoever. It is a runtime error and it is reported when you attempt to execute the statement. If the statement is never executed, no error is reported, as you can see by executing the script
if False:
2 + "string"
print("All good!")
Resolving the first use of a global variable as a declaration is more of an optimisation than anything else, and it is common for compilers to execute multiple optimisation passes.
There is always a temptation to try to combine these multiple passes but it is a false economy: walking an AST is relatively low overhead, and code is much clearer and more maintainable when it only attempts to do one thing. Intertwining two unrelated optimization heuristics is poor design, just as is intertwining any set of unrelated procedures.
Related
I have detailed the specifications of the problem for reasons that will become clear after I ask my question, at the end. The program I am building is a parser in Java for a language with the following syntax (although this is not very relevant to the question):
<expr> ::= [<op> <expr> <expr>] | <symbol> | <value>
<symbol> ::= [a-zA-Z]+
<value> ::= [0-9]+
<op> ::= '+' | '*' | '==' | ‘<’
<assignment> ::= [= <symbol> <expr>]
<prog> ::= <assignment> |
[; <prog> <prog>] |
[if <expr> <prog> <prog>] |
[for <assignment> <expr> <assignment> <prog>] |
[assert <expr>] |
[return <expr>]`
This is an example of code in said language:
[; [= x 0] [; [if [== x 5] [= x 7] [= x [+ x 1]]] [return x]]]
Which is equivalent to:
x = 0;
if (x == 5)
x = 7;
else
x = x + 1;
return x;`
The code is guaranteed to be give in correct syntax; incorrectness of the given code is defined only by having:
a) An used variable (symbol) not previously declared (by declared meaning assigned something to it), even if the variable is used in a branch of an if or some other place that is never reached in the execution of the program;
b) Having a "return" instruction on each path the program could take, meaning the program cannot end without returning on any execution path it may take.
The requirements are that the program should parse the code.
My parser must:
a) Check for said correctness;
b) Parse the code and compute what is the returned value.
My take on this is:
1) Parse the code given into a tree of instructions and expressions;
2) Check for correctness by traversing the tree and seeing if a variable was declared in an upper scope before it was used;
3) Check for correctness by traversing the tree and seeing if any execution branch ends in a "return" instruction;
4) If all previous conditions hold, evaluate the returned value of the code by traversing the tree and remembering the value of all the variables in a HashMap or some other storage.
Now, my problem is that I must implement the parser using the Visitor and Observer design patterns. This is a key requirement for the project. I am fairly new to design patterns and only have a basic grasp about these two.
My question is: where should/can I use the Observer design patter in my parser?
It makes sense to use a Visitor to visit the nodes of the tree for steps 2, 3 and 4. I cannot figure out where I must use the Observer pattern, though.
Is there any place I can use it in my implementation? From my understanding, the Observer pattern takes care of data that can be read and modified by many "observers", the central idea being that an object modifying the data will announce the other objects that may be affected by the modification.
The main data being modified in my program is the tree and the HashMap in which I store the values for the variables. Both of there are accessed in a linear fashion, by only one thing. The tree is built one node at a time, and no other node, or object, for that matter, cares that a node is added or modified. In the evaluation phase, each node is visited and variables are added or modified in the hash table, but no object other than the current visitor from the current node cares about this. I suppose I can make each node an observer which upon observing a change does nothing, or something like that, forcing an Observer pattern, but that isn't really useful.
So, is there an obvious answer which I am completely missing? Is there a not so obvious one, but still giving an useful implementation of Observer? Can I use a half useful slightly forced Observer pattern somewhere in my algorithms, or is fully forced, completely useless way the only way to implement it? Is there a completely different way of approaching the problem which will allow me to use the Visitor and, more importantly, the Observer pattern?
Notes:
I am yet to implement the evaluation of the tree (steps 2, 3 and 4) with Visitor; I have only thought about how I should do it. I will implement it tomorrow and see if there is a way to use Observer somewhere, but having thought about how I could use it for a few hours, I still have no idea. I am hoping, though, that there is a way, which I haven't been able to discover but which will become clear after writing that part.
I apologize for writing so much. I couldn't summarize it better and still give details about the situation any better.
Also, I apologize if I am not clear in explanations. It is quite late, I have though about this for some hours and got tired, and I can't say I have a perfect grasp on the concepts. If anything is unclear or want further details on some matter, don't hesitate to ask. Also, don't hesitate in highlighting any mistakes or wrong paths in my judgement about the problem.
Here are some ideas how you could use well-known patterns and concepts to build an interpreter for your language:
Start processing an input stream by splitting it up into tokens ([;, =, x, 0, ], etc.). This first component (a.k.a. lexer, scanner, tokenizer) strips out irrelevant detail such as whitespace and produces tokens.
You could implement this as a simple state machine that is fed one character at a time. It can therefore be implemented as an observer of the input character sequence.
In the next stage (a.k.a. parsing) you build an abstract syntax tree (AST) from the generated tokens.
The parser is an observer of the tokenizer, i.e. it gets notified of any new tokens the tokenizer produces.(*) It is fed one token at a time. In the case of a fairly simple grammar, the scanner itself can also be some stack-based state machine. (For example, if it needs to match opening and closing brackets, it needs to be able to remember what context / state it was in outside the brackets, thus it'll probably use some kind of stack for context management.)
Once the parser has built an AST, perhaps almost all subsequent steps can be implemented using the visitor pattern. Every algorithm that traverses the AST, locates specific nodes or subtrees, or transforms (parts of) the AST can be modelled as a visitor. Some visitors will only model actions that do not return a value, others return a new AST node for a visited node such that a new transformed AST can be put together. For example:
Check whether the AST or any subtree of it describes a valid program fragment (visit a (sub-) tree, reduce it to a boolean or a list of errors).
Simplify / optimize the program (visit a (sub-) tree, generate a smaller or otherwise more optimal subtree, finally reassemble a new tree from the transformed subtrees).
Execute the AST (visit and interpret each node, execute some code accoeding to the node's meaning).
(*) Calling the parser an observer of the scanner is perhaps somewhat inaccurate. This Software Engineering SE post has a good summary of closely related design patterns. Could be that the scanner implements a strategy for dealing with the tokens.
I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.
I am wondering how the syntax analysis and semantic analysis work.
I have finished the lexer and the grammar construction of my interpreter.
Now I am going to implement a recursive descent (top down) parser for this grammar
For example, I have the following grammar:
<declaration> ::= <data_type> <identifier> ASSIGN <value>
so i coded it like this (in java):
public void declaration(){
data_type();
identifier();
if(token.equals("ASSIGN")){
lexer(); //calls next token
value();
} else {
error();
}
}
Assuming I have three data types: Int, String and Boolean. Since the values for each data types are different, (ex. true or false only in Boolean) how can I determine if it fits the data type correctly? What part of my code would determine that?
I am wondering where would I put the code to:
1.) call the semantic analysis part of my program.
2.) store my variables into the symbol table.
Do syntax analysis and semantic analysis happen at the same time?
or do i need to finish the syntax analysis first, then do the semantic analysis?
I am really confused. Please help.
Thank you.
You can do syntax analysis (parsing) and semantic analysis (e.g., checking for agreement between <data_type> and <value>) at the same time. For example, when declaration() calls data_type(), the latter could return something (call it DT) that indicates whether the declared type is Int, String or Boolean. Similarly, value() could return something (VT) that indicates the type of the parsed. Then declaration() would simply compare DT and VT, and raise an error if they don't match. (Alternatively, value() could take a parameter indicating the declared type, and it could do the check.)
However, you may find it easier to completely separate the two phases. To do so, you would typically have the parsing phase build a parse tree (or abstract syntax tree). So your top level would call (e.g.) program() to parse a whole program, which would return a tree representing (the syntax of) the program, and you would pass that tree to a semantic_analysis() routine, which would traverse the tree, extracting pertinent information and enforcing semantic constraints.
The short answer is: it depends on the definition of your programming language. And, since you only specified one derivation rule and three native types, there is no way of knowing. For example, if your programming language allows forward declarations like the c++ code below, then handling the derivation rule for function declaration (foo) is done without knowing the type of the variable serial
class Tree {
public:
int foo(void)
{
return serial;
}
int serial;
};
Indeed, modern compilers separate the syntax analysis phase from the semantic analysis phase. The syntax analysis phase is performed first, making sure the input program agrees with the context free grammar of the language. And, in addition, produces an Abstract Syntax Tree (AST). Note the difference between an AST and a parse tree, as discussed in this SO post. The semantic analysis phase then traverses the AST and checks for type mismatches among other things.
Having said that, toy programming languages can sometimes couple semantic and syntax analysis together. When a recursive descent parser is used,
you should have relevant recursive calls return a type.
Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.
I’m recently writing a small programming language, and have finished writing its parser. I want to write an automated test for the parser (that its result is an abstract syntax tree), but I’m not sure which way is better.
First what I tried is just to serialize AST to S-expression text and compare it to the expected output text I wrote by hand, but it has some problems:
There are trivial meaningless differences between a serialized text and the expected output like whitespaces. For example, there is no difference between:
(attribute (symbol str) (symbol length))
(that is serialized) and:
(attribute (symbol str)
(symbol length))
(that is handwritten by me) in their meanings, but string comparison distincts them of course. Okay, I could resolve it by normalization.
When a test fails, it doesn’t show the difference between actual tree and expected tree concisely. I want to show only a difference node, not whole tree.
Second what I tried is to write S-expression parser and compare AST that parser (to be tested) generates to AST that S-expression parser (that I just implemented) generates from the handwritten expected output. However I realized that S-expression have to be tested also and it could be really nonsense.
I wonder what is the typical and easy way to test the parser.
PS. I am using Java, and dont’t want any dependencies to third-party libraries.
Providing you are looking for a completely automated and extensible unit testing framework for your parser I'd recommend the following approach:
Incorrect input
Create a set of samples of incorrect inputs. Then feed the parse with each of them making sure the parser rejects them. I's a good idea to provide metadata for each test case that defines the expected output — the specific error code / message the parser is supposed to produce.
Correct input
As in the previous case, create a set of samples representing various correct inputs. Besides the simple validation that the parser accepts all inputs, there's still the problem of validating that the actual Abstract Syntax Tree makes sense.
To address this problem I'd do the following: Describing the expected AST for each test case in some well-known format that can be safely parsed — deserialized into the actual in-memory AST structures — by a 3rd party parser considered bug-free (for your case). The natural choice is XML since most languages / programming frameworks cover XML support and provide the respective (de)serialization facilities. The best solution would be to deserialize right into the AST node types. Since convenient visual editing tools for XML exist it's feasible to construct even large test cases.
Then I'd construct an AST comparer using the visitor pattern which pair-up the two ASTs and compare both nodes in each pair for equality. However, equality is a per-AST-node-type specific operation.
Notes:
This approach would work with most unit-testing frameworks like JUnit.
AST to XML serialization is a welcome tool for debugging the compiler.
The visitor pattern implementation can easily serve as the backbone for multiple processing stages within the compiler.
There are compiler test suites freely available that can provide some inspiration to your project — see for example the Ada Conformity Assesment Test Suite for the Ada programming language, although this test suite deals with higher-level testing, not just parser testing.
Here's what. A grammar defines a language. The language is the set of string that the grammar generates, or that a parser for the grammar accepts.
Given that, more than testing if the ASTs seem right, it's important to test that the parser accepts strings intended to be in the language and rejects strings that in your mind shouldn't belong to it.
In that sense, a simple accept or reject (bonus point for input position for the rejection) is enough to build a nice and large set of test cases.
Examples:
()
(a)
((((((((((a))))))))))
((((((((((a)))))))))
(a (a (a (a (a (a (b)))))))
(((((((b) a) a) a) a) a) a)
(((((((b a) a) a) a) a) a)
((a)(a)(a)(a))
((a)(a a)(a))
(())
(()())
((()())(()())(()()))
((()())()()(()()))
...