I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?
Related
Normally when I export or grun a grammar to a target language it gives me two .tokens files. For example in the following:
lexer grammar TestLexer;
NUM : [0-9]+;
OTHER : ABC;
fragment NEWER : [xyz]+;
ABC : [abc]+;
I get a token for each non-fragment, and I get two identical files:
# Parser.tokens
NUM=1
OTHER=2
ABC=3
# Lexer.tokens
NUM=1
OTHER=2
ABC=3
Are these files always the same? I tried defining a token in the parser but since I've defined it as parser grammar it doesn't allow that, so I would assume these two files would always be the same, correct?
Grammars are always processed as individual lexer and parser grammars. If a combined grammar is used then it is temporarily split into two grammars and processed individually. Each processing step produces a tokens file (the list of found lexer tokens). The tokens file is the link between lexers and parsers. When you set a tokenVocab value actually the tokens file is used. That also means you don't need a lexer grammar, if you have a tokens file.
I'm not sure about the parser.tokens file. It might be useful for grammar imports.
And then you can specify a tocenVocab for lexer grammars too, which allows you to explicitly assign number values to tokens, which can come in handy if you have to check for token ranges (e.g. all keywords) in platform code. I cannot check this currently, but it might be that using this feature leads to token files with different content.
I'm trying to learn a little more about compiler construction, so I've been toying with flexc++ and bisonc++; for this question I'll refer to flex/bison however.
In bison, one uses the %token declaration to define token names, for example
%token INTEGER
%token VARIABLE
and so forth. When bison is used on the grammar specification file, a header file y.tab.h is generated which has some define directives for each token:
#define INTEGER 258
#define VARIABLE 259
Finally, the lexer includes the header file y.tab.h by returning the right code for each token:
%{
#include "y.tab.h"
%}
%%
[a-z] {
// some code
return VARIABLE;
}
[1-9][0-9]* {
// some code
return INTEGER;
}
So the parser defines the tokens, then the lexer has to use that information to know which integer codes to return for each token.
Is this not totally bizarre? Normally, the compiler pipeline goes lexer -> parser -> code generator. Why on earth should the lexer have to include information from the parser? The lexer should define the tokens, then flex creates a header file with all the integer codes. The parser then includes that header file. These dependencies would reflect the usual order of the compiler pipeline. What am I missing?
As with many things, it's just a historical accident. It certainly would have been possible for the token declarations to have been produced by lex (but see below). Or it would have been possible to force the user to write their own declarations.
It is more convenient for yacc/bison to produce the token numberings, though, because:
The terminals need to be parsed by yacc because they are explicit elements in the grammar productions. In lex, on the other hand, they are part of the unparsed actions and lex can generate code without any explicit knowledge about token values; and
yacc (and bison) produce parse tables which are indexed by terminal and non-terminal numbers; the logic of the tables require that terminals and non-terminals have distinct codes. lex has no way of knowing what the non-terminals are, so it can't generate appropriate codes.
The second argument is a bit weak, because in practice bison-generated parsers renumber token ids to fit them into the id-numbering scheme. Even so, this is only possible if bison is in charge of the actual numbers. (The reason for the renumbering is to make the id value contiguous; by another historical accident, it's normal to reserve codes 0 through 255 for single-character tokens, and 0 for EOF; however, not all the 8-bit codes are actually used by most scanners.)
In the lexer, the tokens are only present in the return value: they are part of the target language (ie. C++), and lex itself knows nothing about them.
In the parser, on the other hand, tokens are part of the definition language: you write them in the actual parser definition, and not just in the target language. So yacc has to know about these tokens.
The ordering of the phases is not necessarily reflected in the architecture of the compiler. The scanner is the first phase and the parser the second, so in a sense data flows from the scanner to the parser, but in a typical Bison/Flex-generated compiler it is the parser that controls everything, and it is the parser that calls the lexer as a helper subroutine when it needs a new token as input in the parsing process.
When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?
The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.
What's the difference between this grammar:
...
if_statement : 'if' condition 'then' statement 'else' statement 'end_if';
...
and this:
...
if_statement : IF condition THEN statement ELSE statement END_IF;
...
IF : 'if';
THEN: 'then';
ELSE: 'else';
END_IF: 'end_if';
....
?
If there is any difference, as this impacts on performance ...
Thanks
In addition to Will's answer, it's best to define your lexer tokens explicitly (in your lexer grammar). In case you're mixing them in your parser grammar, it's not always clear in what order the tokens are tokenized by the lexer. When defining them explicitly, they're always tokenized in the order they've been put in the lexer grammar (from top to bottom).
The biggest difference is one that may not matter to you. If your Lexer rules are in the lexer then you can use inheritance to have multiple lexer's share a common set of lexical rules.
If you just use strings in your parser rules then you can not do this. If you never plan to reuse your lexer grammar then this advantage doesn't matter.
Additionally I, and I'm guessing most Antlr veterans, are more accustom to finding the lexer rules in the actual lexer grammar rather than mixed in with the parser grammar, so one could argue the readability is increased by putting the rules in the lexer.
There is no runtime performance impact after the Antlr parser has been built to either approach.
The only difference is that in your first production rule, the keyword tokens are defined implicitly. There is no run-time performance implication for tokens defined implicitly vs. explicitly.
Yet another difference: when you explicitly define your lexer rules you can access them via the name you gave them (e.g. when you need to check for a specific token type). Otherwise ANTLR will use arbitrary numbers (with a prefix).