Java Bison and Jflex error for redeclared/undeclared variables - parsing

I am making a compiler with Jflex and Bison. Jflex does the lexical analysis. Bison does the parsing.
The lexical analysis (in a .l file) is perfect. Tokenizes the input, and passes the input to the .y file for Bison to parse.
I need the parser to print an error for redeclared/undeclared variables. My thought are that it would need some sort of memory to remember all the variables initialized so far, so that it can produce an error for those tokens coming in and when it sees an undeclared variable being used. For example, ''bool", "test", "=", "true", ";", and on a new line, "test2", "=", "false", ";", the parser would need some sort of memory to remember ''test" and when it parses the second line it can access that memory again and say that "test2" is undeclared, hence it would print an error.
What I'm confused about is how we can make a memory like that with bison using Java in the .y file. With C, you would use the -d flag and it would make 2 files with enum types and a header file which would keep track of the declared variables but in Java I'm not too sure if I can do the same as I can't structure the grammar in any way so that it will remember variable names.
I could make a symbol table in Java code to check for redeclared variables, but in the main() in the .y file I have
public static void main(String args[]) throws IOException {
EXAMPLELexer lexer = new EXAMPLLexer(System.in);
EXAMPLE parser = new EXAMPLE(lexer);
if(parser.parse()){
System.out.println("VALID FROM PARSER");
}
else{
System.out.println("ERROR FROM PARSER");
}
return;
}
There is no way to get the tokens individually and pass them into another java instance or whatever.%union{} doesnt work with Java, so I dont know how this is even possible.
I can't find a single piece of documentation explaining this so I would love some answers!

It's actually a lot simpler to add your own data to a Bison-generated Java parser than it is to a C parser (or even a C++ parser).
Note that Bison's Java API does not have unions, mostly because Java doesn't have unions. All semantic values are non-primitive types, so they derive from Object. If you need to, you can cast them to a more precise type, or even a primitive type.
(There is an option to define a more precise base class for semantic value types, but Object is probably a good place to start.)
The %code { ... } blocks are just copied into the parser class. So you can add your own members, as well as methods to manipulate them. If you want a symbol table, just add it as a HashMap to the parser class, and then you can add whatever you like to it in your actions.
Since all the parser actions are within the parser class, they have direct access to whatever members and member functions you add to the parser. All of Bison's internal members and member functions have names starting with yy, except for the member functions documented in the manual, so you can use almost any names you want without fear of name collision.
You can also use %parse-param to add arguments to the constructor; each argument corresponds to a class member. But that's probably not necessary for this particular exercise.
Of course, you'll have to figure out what an appropriate value type for the symbol is; that depends completely on what you're trying to do with the symbols. If you only want to validate that the symbols are defined when they are used, I suppose you could get away with a HashSet, but I'm sure eventually you'll want to store some more useful information.

Related

Tatsu: Rule Ordering

I am playing around with Tatsu to implement a parser for a language used in the semiconductor industry. This language requires that variables be defined before usage. So for example:
SignalGroup { A: In; B: Out};
Pattern {
V {A=1, B=1 }
V {A=1, B=0 }
};
In this case, the SignalGroup block must come before the Pattern block. How do I enforce/implement this "ordering" when writing the grammer in TatSu?
Although for some languages it is possible to write grammars that verify if the same symbol appears on different places, the grammars usually end up being too complicated to be useful.
Compilers (translators) are usually implemented with separate lexical, syntactical, and semantic analyzer components. There are several reasons for that:
Each component is so well focused that it is clearer and easier to write.
Each component is very efficient
The most common errors (which are exactly lexical, syntactical, and semantic) can be reported earlier
With those components in mind, checking if a symbol has ben previously defined belongs to the semantic (meaning) aspect of the program, and the way to check is to keep a symbol table that is filled when the definition parts of the input are being parsed, and queried on the use parts of the input are being parsed.
In TatSu in particular the different components are well separated, yet run in parallel. For your requirement you just need to use the simplest grammar that allows for the semantic actions that store and query the symbols. By raising FailedSemantics from within semantic actions, any semantic errors will be reported exactly as the lexical and syntactical ones so the user doesn't have to think about which component flagged each error.
If you use the Python parser generation in TatSu, the translator will generate the skeleton of a semantic actions class as part of the output.

Antlr: common token definitions

I'd like to define common token constants in a single central Antlr file. This way I can define several different lexers and parsers and mix and match them at runtime. If they all share a common set of token definitions, then they'll work fine.
In other words, I want to see public static final int WORD = 2; in each lexer, so they all agree that a "2" is a WORD.
I created a file named CommonTokenDefs.g4 and added a section like this:
tokens {
WORD, NUMBER
}
and included
options { tokenVocab = CommonTokenDefs; }
in each of my other .g4 files. It doesn't work. A .g4 file that includes the tokenVocab will assign a different constant int if it defines a token type, and worse, in its .tokens file it will include duplicate constants!
FOO=1
BAR=2
WORD=1
NUMBER=2
Doing an import CommonTokenDefs; doesn't work either, because if I define a token type in the lexer, and it's already in CommonTokenDefs then I get a "token name FOO is already defined" error.
How do I create a common vocabulary across lexers and parsers?
Including a grammar means to merge it. The imported grammar is not an own instance but instead enriches the grammar it is imported in. And the importing grammar numbers its tokens based on what is defined in it (and adds tokens from the imported grammar).
The only solution I see here is use a single lexer grammar in all your parser, if that is possible. You can implement certain variations in your lexer by using different base lexers (ANTLR option: superClass), but that is of course limited and especially doesn't allow to add more tokens.
Update
Actually, there is a way to make it work as you want it. In addition to the import statement (which is used to import grammars) there is the tokenVocab grammar option, which is used to load a *.tokens file with assignments of number values to tokens. By using such a token vocabulary you could predefine which value ANTLR should use for each token and can hence determine that certain tokens always get the same numeric value. See the generated *.tokens files for the required format.
I use *.tokens files to assign numeric value such that certain keywords are placed in a continuous value range, which allows for efficient checks later, like:
if (token >= KW_1 && token < KW100) ...
which wouldn't be possible if ANTLR would freely assign values to each of the keywords.

Data Structure to store Token Properties

I am writing an interpreter for a mathematical language in Rust which is intended to be used to solve mathematical expressions.
When lexing, the program needs to know based on the characters used in a token, what type of token it is (for example is it a function or an operator).
Currently I use an enumeration to represent a type of token:
pub enum IdentifierType {
Function,
Variable,
Operator,
Integer,
}
To check the type of a token I use a function which takes an IdentifierType as input and matches based on input to return a bool. The data structures that could be used in this case are relatively simple as tokens only have a single property: allowed characters.
When parsing to an Abstract Syntax Tree (AST), I would like to know what specific operator or function is being used based on a token and to be able to add a reference to that operator and its associated functions to the AST.
When interpreting, I would like to be able to call execute on a node and have it know how to perform its own function.
I have tried to come up with a solution to store all of these related items, but none that I have encountered as felt satisfactory.
For example I stored all of the operators in a TOML file (a type of configuration file that maps to a hash table) but storing enumerations (values that are constrained) is difficult and there is no way to store an operators function. I also want to be able to search by multiple keys, such as operator associativity (e.g. get all operators that are right associative), which means storing within source code is not very satisfactory.
Other possible ideas I have had are using some kind of SQL hybrid system, however that seems tough to implement

Syntax analysis and semantic analysis

I am wondering how the syntax analysis and semantic analysis work.
I have finished the lexer and the grammar construction of my interpreter.
Now I am going to implement a recursive descent (top down) parser for this grammar
For example, I have the following grammar:
<declaration> ::= <data_type> <identifier> ASSIGN <value>
so i coded it like this (in java):
public void declaration(){
data_type();
identifier();
if(token.equals("ASSIGN")){
lexer(); //calls next token
value();
} else {
error();
}
}
Assuming I have three data types: Int, String and Boolean. Since the values for each data types are different, (ex. true or false only in Boolean) how can I determine if it fits the data type correctly? What part of my code would determine that?
I am wondering where would I put the code to:
1.) call the semantic analysis part of my program.
2.) store my variables into the symbol table.
Do syntax analysis and semantic analysis happen at the same time?
or do i need to finish the syntax analysis first, then do the semantic analysis?
I am really confused. Please help.
Thank you.
You can do syntax analysis (parsing) and semantic analysis (e.g., checking for agreement between <data_type> and <value>) at the same time. For example, when declaration() calls data_type(), the latter could return something (call it DT) that indicates whether the declared type is Int, String or Boolean. Similarly, value() could return something (VT) that indicates the type of the parsed. Then declaration() would simply compare DT and VT, and raise an error if they don't match. (Alternatively, value() could take a parameter indicating the declared type, and it could do the check.)
However, you may find it easier to completely separate the two phases. To do so, you would typically have the parsing phase build a parse tree (or abstract syntax tree). So your top level would call (e.g.) program() to parse a whole program, which would return a tree representing (the syntax of) the program, and you would pass that tree to a semantic_analysis() routine, which would traverse the tree, extracting pertinent information and enforcing semantic constraints.
The short answer is: it depends on the definition of your programming language. And, since you only specified one derivation rule and three native types, there is no way of knowing. For example, if your programming language allows forward declarations like the c++ code below, then handling the derivation rule for function declaration (foo) is done without knowing the type of the variable serial
class Tree {
public:
int foo(void)
{
return serial;
}
int serial;
};
Indeed, modern compilers separate the syntax analysis phase from the semantic analysis phase. The syntax analysis phase is performed first, making sure the input program agrees with the context free grammar of the language. And, in addition, produces an Abstract Syntax Tree (AST). Note the difference between an AST and a parse tree, as discussed in this SO post. The semantic analysis phase then traverses the AST and checks for type mismatches among other things.
Having said that, toy programming languages can sometimes couple semantic and syntax analysis together. When a recursive descent parser is used,
you should have relevant recursive calls return a type.

Compiler Design : Is "variable not declared" a syntactic error or semantic error?

Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.

Resources