Antlr: common token definitions - parsing

I'd like to define common token constants in a single central Antlr file. This way I can define several different lexers and parsers and mix and match them at runtime. If they all share a common set of token definitions, then they'll work fine.
In other words, I want to see public static final int WORD = 2; in each lexer, so they all agree that a "2" is a WORD.
I created a file named CommonTokenDefs.g4 and added a section like this:
tokens {
WORD, NUMBER
}
and included
options { tokenVocab = CommonTokenDefs; }
in each of my other .g4 files. It doesn't work. A .g4 file that includes the tokenVocab will assign a different constant int if it defines a token type, and worse, in its .tokens file it will include duplicate constants!
FOO=1
BAR=2
WORD=1
NUMBER=2
Doing an import CommonTokenDefs; doesn't work either, because if I define a token type in the lexer, and it's already in CommonTokenDefs then I get a "token name FOO is already defined" error.
How do I create a common vocabulary across lexers and parsers?

Including a grammar means to merge it. The imported grammar is not an own instance but instead enriches the grammar it is imported in. And the importing grammar numbers its tokens based on what is defined in it (and adds tokens from the imported grammar).
The only solution I see here is use a single lexer grammar in all your parser, if that is possible. You can implement certain variations in your lexer by using different base lexers (ANTLR option: superClass), but that is of course limited and especially doesn't allow to add more tokens.
Update
Actually, there is a way to make it work as you want it. In addition to the import statement (which is used to import grammars) there is the tokenVocab grammar option, which is used to load a *.tokens file with assignments of number values to tokens. By using such a token vocabulary you could predefine which value ANTLR should use for each token and can hence determine that certain tokens always get the same numeric value. See the generated *.tokens files for the required format.
I use *.tokens files to assign numeric value such that certain keywords are placed in a continuous value range, which allows for efficient checks later, like:
if (token >= KW_1 && token < KW100) ...
which wouldn't be possible if ANTLR would freely assign values to each of the keywords.

Related

Difference between Parser.tokens and Lexer.tokens?

Normally when I export or grun a grammar to a target language it gives me two .tokens files. For example in the following:
lexer grammar TestLexer;
NUM : [0-9]+;
OTHER : ABC;
fragment NEWER : [xyz]+;
ABC : [abc]+;
I get a token for each non-fragment, and I get two identical files:
# Parser.tokens
NUM=1
OTHER=2
ABC=3
# Lexer.tokens
NUM=1
OTHER=2
ABC=3
Are these files always the same? I tried defining a token in the parser but since I've defined it as parser grammar it doesn't allow that, so I would assume these two files would always be the same, correct?
Grammars are always processed as individual lexer and parser grammars. If a combined grammar is used then it is temporarily split into two grammars and processed individually. Each processing step produces a tokens file (the list of found lexer tokens). The tokens file is the link between lexers and parsers. When you set a tokenVocab value actually the tokens file is used. That also means you don't need a lexer grammar, if you have a tokens file.
I'm not sure about the parser.tokens file. It might be useful for grammar imports.
And then you can specify a tocenVocab for lexer grammars too, which allows you to explicitly assign number values to tokens, which can come in handy if you have to check for token ranges (e.g. all keywords) in platform code. I cannot check this currently, but it might be that using this feature leads to token files with different content.

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

Java Bison and Jflex error for redeclared/undeclared variables

I am making a compiler with Jflex and Bison. Jflex does the lexical analysis. Bison does the parsing.
The lexical analysis (in a .l file) is perfect. Tokenizes the input, and passes the input to the .y file for Bison to parse.
I need the parser to print an error for redeclared/undeclared variables. My thought are that it would need some sort of memory to remember all the variables initialized so far, so that it can produce an error for those tokens coming in and when it sees an undeclared variable being used. For example, ''bool", "test", "=", "true", ";", and on a new line, "test2", "=", "false", ";", the parser would need some sort of memory to remember ''test" and when it parses the second line it can access that memory again and say that "test2" is undeclared, hence it would print an error.
What I'm confused about is how we can make a memory like that with bison using Java in the .y file. With C, you would use the -d flag and it would make 2 files with enum types and a header file which would keep track of the declared variables but in Java I'm not too sure if I can do the same as I can't structure the grammar in any way so that it will remember variable names.
I could make a symbol table in Java code to check for redeclared variables, but in the main() in the .y file I have
public static void main(String args[]) throws IOException {
EXAMPLELexer lexer = new EXAMPLLexer(System.in);
EXAMPLE parser = new EXAMPLE(lexer);
if(parser.parse()){
System.out.println("VALID FROM PARSER");
}
else{
System.out.println("ERROR FROM PARSER");
}
return;
}
There is no way to get the tokens individually and pass them into another java instance or whatever.%union{} doesnt work with Java, so I dont know how this is even possible.
I can't find a single piece of documentation explaining this so I would love some answers!
It's actually a lot simpler to add your own data to a Bison-generated Java parser than it is to a C parser (or even a C++ parser).
Note that Bison's Java API does not have unions, mostly because Java doesn't have unions. All semantic values are non-primitive types, so they derive from Object. If you need to, you can cast them to a more precise type, or even a primitive type.
(There is an option to define a more precise base class for semantic value types, but Object is probably a good place to start.)
The %code { ... } blocks are just copied into the parser class. So you can add your own members, as well as methods to manipulate them. If you want a symbol table, just add it as a HashMap to the parser class, and then you can add whatever you like to it in your actions.
Since all the parser actions are within the parser class, they have direct access to whatever members and member functions you add to the parser. All of Bison's internal members and member functions have names starting with yy, except for the member functions documented in the manual, so you can use almost any names you want without fear of name collision.
You can also use %parse-param to add arguments to the constructor; each argument corresponds to a class member. But that's probably not necessary for this particular exercise.
Of course, you'll have to figure out what an appropriate value type for the symbol is; that depends completely on what you're trying to do with the symbols. If you only want to validate that the symbols are defined when they are used, I suppose you could get away with a HashSet, but I'm sure eventually you'll want to store some more useful information.

Using Lemon parser with custom token values

I am porting an old grammar to lemon and I have all the terminal symbols already defined in a header file; I would like to use them with those values instead of the ones generated in parser.h by lemon: is that possible?
Overwriting parser.h is completly useless because that's just a mirror of what happens internally, the matched values would keep being the same.
(Since lemon shares a lot of code with Bison I think that a solution for bison would solve the problem in lemon too)
With bison, you can manually assign values to tokens by declaring them in the %token directive (%token TOK 263, for example). However, that option is not available in lemon (as far as I know).
In any event, the above does not really meet your request, because it doesn't allow you to read the token values from an external header file. That would not be a trivial requirement for a parser generator. In order to build the parse tables, the parser generator, whether it is bison or lemon, must actually know the value associated with each token, and the task of parsing a header to extract that information is well beyond the complexity of a parser generator; it would need an embedded C parser.
I would recommend just letting the parser generator generate the header file, and then using it instead of the definitions in your existing header file. The only cost (afaics) is that you need to recompile any parts of the project which reference the token values, which would typically be limited to the lexer.

lex/yacc: why do lexers have to include a parser's header file?

I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.
In bison, one uses the %token declaration to define token names, for example
%token INTEGER
%token VARIABLE
and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:
#define INTEGER 258
#define VARIABLE 259
Finally, the lexer includes the header file y.tab.h by returning the right code for each token:
%{
#include "y.tab.h"
%}
%%
[a-z] {
// some code
return VARIABLE;
}
[1-9][0-9]* {
// some code
return INTEGER;
}
So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.
Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?
As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.
It is more convenient for yacc/bison to produce the token numberings, though, because:
The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and
yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.
The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)
In the lexer, the tokens are only present in the return value: they are part of the target language (ie. C++), and lex itself knows nothing about them.
In the parser, on the other hand, tokens are part of the definition language: you write them in the actual parser definition, and not just in the target language. So yacc has to know about these tokens.
The ordering of the phases is not necessarily reflected in the architecture of the compiler. The scanner is the first phase and the parser the second, so in a sense data flows from the scanner to the parser, but in a typical Bison/Flex-generated compiler it is the parser that controls everything, and it is the parser that calls the lexer as a helper subroutine when it needs a new token as input in the parsing process.

Resources