Difference between Parser.tokens and Lexer.tokens? - parsing

Normally when I export or grun a grammar to a target language it gives me two .tokens files. For example in the following:
lexer grammar TestLexer;
NUM : [0-9]+;
OTHER : ABC;
fragment NEWER : [xyz]+;
ABC : [abc]+;
I get a token for each non-fragment, and I get two identical files:
# Parser.tokens
NUM=1
OTHER=2
ABC=3
# Lexer.tokens
NUM=1
OTHER=2
ABC=3
Are these files always the same? I tried defining a token in the parser but since I've defined it as parser grammar it doesn't allow that, so I would assume these two files would always be the same, correct?

Grammars are always processed as individual lexer and parser grammars. If a combined grammar is used then it is temporarily split into two grammars and processed individually. Each processing step produces a tokens file (the list of found lexer tokens). The tokens file is the link between lexers and parsers. When you set a tokenVocab value actually the tokens file is used. That also means you don't need a lexer grammar, if you have a tokens file.
I'm not sure about the parser.tokens file. It might be useful for grammar imports.
And then you can specify a tocenVocab for lexer grammars too, which allows you to explicitly assign number values to tokens, which can come in handy if you have to check for token ranges (e.g. all keywords) in platform code. I cannot check this currently, but it might be that using this feature leads to token files with different content.

Related

Antlr: common token definitions

I'd like to define common token constants in a single central Antlr file. This way I can define several different lexers and parsers and mix and match them at runtime. If they all share a common set of token definitions, then they'll work fine.
In other words, I want to see public static final int WORD = 2; in each lexer, so they all agree that a "2" is a WORD.
I created a file named CommonTokenDefs.g4 and added a section like this:
tokens {
WORD, NUMBER
}
and included
options { tokenVocab = CommonTokenDefs; }
in each of my other .g4 files. It doesn't work. A .g4 file that includes the tokenVocab will assign a different constant int if it defines a token type, and worse, in its .tokens file it will include duplicate constants!
FOO=1
BAR=2
WORD=1
NUMBER=2
Doing an import CommonTokenDefs; doesn't work either, because if I define a token type in the lexer, and it's already in CommonTokenDefs then I get a "token name FOO is already defined" error.
How do I create a common vocabulary across lexers and parsers?
Including a grammar means to merge it. The imported grammar is not an own instance but instead enriches the grammar it is imported in. And the importing grammar numbers its tokens based on what is defined in it (and adds tokens from the imported grammar).
The only solution I see here is use a single lexer grammar in all your parser, if that is possible. You can implement certain variations in your lexer by using different base lexers (ANTLR option: superClass), but that is of course limited and especially doesn't allow to add more tokens.
Update
Actually, there is a way to make it work as you want it. In addition to the import statement (which is used to import grammars) there is the tokenVocab grammar option, which is used to load a *.tokens file with assignments of number values to tokens. By using such a token vocabulary you could predefine which value ANTLR should use for each token and can hence determine that certain tokens always get the same numeric value. See the generated *.tokens files for the required format.
I use *.tokens files to assign numeric value such that certain keywords are placed in a continuous value range, which allows for efficient checks later, like:
if (token >= KW_1 && token < KW100) ...
which wouldn't be possible if ANTLR would freely assign values to each of the keywords.

Using Lemon parser with custom token values

I am porting an old grammar to lemon and I have all the terminal symbols already defined in a header file; I would like to use them with those values instead of the ones generated in parser.h by lemon: is that possible?
Overwriting parser.h is completly useless because that's just a mirror of what happens internally, the matched values would keep being the same.
(Since lemon shares a lot of code with Bison I think that a solution for bison would solve the problem in lemon too)
With bison, you can manually assign values to tokens by declaring them in the %token directive (%token TOK 263, for example). However, that option is not available in lemon (as far as I know).
In any event, the above does not really meet your request, because it doesn't allow you to read the token values from an external header file. That would not be a trivial requirement for a parser generator. In order to build the parse tables, the parser generator, whether it is bison or lemon, must actually know the value associated with each token, and the task of parsing a header to extract that information is well beyond the complexity of a parser generator; it would need an embedded C parser.
I would recommend just letting the parser generator generate the header file, and then using it instead of the definitions in your existing header file. The only cost (afaics) is that you need to recompile any parts of the project which reference the token values, which would typically be limited to the lexer.

lex/yacc: why do lexers have to include a parser's header file?

I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.
In bison, one uses the %token declaration to define token names, for example
%token INTEGER
%token VARIABLE
and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:
#define INTEGER 258
#define VARIABLE 259
Finally, the lexer includes the header file y.tab.h by returning the right code for each token:
%{
#include "y.tab.h"
%}
%%
[a-z] {
// some code
return VARIABLE;
}
[1-9][0-9]* {
// some code
return INTEGER;
}
So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.
Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?
As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.
It is more convenient for yacc/bison to produce the token numberings, though, because:
The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and
yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.
The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)
In the lexer, the tokens are only present in the return value: they are part of the target language (ie. C++), and lex itself knows nothing about them.
In the parser, on the other hand, tokens are part of the definition language: you write them in the actual parser definition, and not just in the target language. So yacc has to know about these tokens.
The ordering of the phases is not necessarily reflected in the architecture of the compiler. The scanner is the first phase and the parser the second, so in a sense data flows from the scanner to the parser, but in a typical Bison/Flex-generated compiler it is the parser that controls everything, and it is the parser that calls the lexer as a helper subroutine when it needs a new token as input in the parsing process.

Is the word "lexer" a synonym for the word "parser"?

The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.

Practical difference between parser rules and lexer rules in ANTLR?

I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?

Resources