I'm working on a pet project and I have to implement a lexer/parser. I'm reading about the subject and prototyping a simple math parser to understand as I go along.
There is something I still don't understand quite right, and it's the boundary between the lexer and the parser. Consider the following syntactically wrong expression:
+1.234.2 + pi
I'm currently lexing this into the following:
UnaryPlusOperatorToken: Lexeme "+" Pos: 1
NumberToken: Lexeme "1.234.2" Pos: 2
TriviaToken: Lexeme " " Pos: 9
BinaryAddToken: Lexeme "+" Pos: 10
TriviaToken: Lexeme " " Pos: 11
ConstantToken. Lexeme "pi" Pos: 12
Ok, so a couple of things here:
Should the lexer really care about the erroneous extra decimal period in the NumberToken and add an error to the "error bag" I'm threading through the whole parsing context or should it be the parser that catches this later on?
Should the lexer distinguish '+' being a unary operator in one context or a binary operator in another? Or should it simply generate a PlusToken and then let the parser figure out what it is?
I'm finding it rather hard to understand where the limit between both is.
Here's my take on it:
Should the lexer really care about the erroneous extra decimal period in the NumberToken and add an error to the "error bag" I'm threading through the whole parsing context or should it be the parser that catches this later on?
Let's see what your options are and compare the pros and cons:
If you stop scanning at the second decimal, your parser will get the following two tokens: 1.234 .2 and will have to issue an error like "unexpected token".
If the lexer scans 1.234.2 as a single token, you'll be easily able to issue an error like "invalid number".
The second choice will issue a much better error message, and is as easy to implement as the first choice, so I'd say it's the better solution. You have an invalid token here, treat it as such.
Should the lexer distinguish '+' being a unary operator in one context or a binary operator in another? Or should it simply generate a PlusToken and then let the parser figure out what it is?
Nope. The lexer should just output a + token. It's the parser's role to further interpret that token's meaning in the context of a parser rule.
The limit between the lexer and parser can be quite fuzzy at times. While some concerns could easily go on either side without any problems, others will bite you later if you make the wrong call, so it's worth being extra careful while making design choices.
Related
I wrote a parser that analyzes code and reduces it as would by lex and yacc (pretty much.)
I am wondering about one aspect of the matter. If I have a set of rules such as the following:
unary: IDENTIFIER
| IDENTIFIER '(' expr_list ')'
The first rule with just an IDENTIFIER can be reduced as soon as an identifier is found. However, the second rule can only be reduced if the input also includes a valid list of expressions written between parenthesis.
How is a parser expected to work in a case like this one?
If I reduce the first identifier immediately, I can keep the result and throw it away if I realize that the second rule does match. If the second rule does not match, then I can return the result of the early reduction.
This also means that both reduction functions are going to be called if the second rule applies.
Are we instead expected to hold on the early reduction and apply it only if the second, longer rule applies?
For those interested, I put a more complete version of my parser grammar in this answer: https://codereview.stackexchange.com/questions/41769/numeric-expression-parser-calculator/41786#41786
Bottom-up parsers (like bison and yacc) do not reduce until they reach the end of the production. They don't have to guess which reduction they will use until they need it. So it's really not an issue to have two productions with the same prefix. In this sense, the algorithm is radically different from a top-down algorithm, used in recursive descent parsing, for example.
In order for the fragment which you provide to be parseable by an LALR(1) parser-generator -- that is, a bottom-up parser with the ability to examine only one (1) token beyond the end of a production -- the grammar must be such that there is no place in which unary might be followed by (. As long as that is true, the fact that the parser can see a ( is sufficient to prevent the unit reduction unary: IDENTIFIER from happening in a context in which it should reduce with the other unary production.
(That's a bit of an oversimplification, but I don't think that it would be correct to reproduce a standard text on LALR parsing here on SO.)
The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.
Preamble
I have written GLR-parser with error recovery. When it encounters an error it splits into following alternatives:
Insert the expected element into input (may be the user just missed it) and proceed as usual.
Replace the erroneous element with expected one (may be the user just made a mistype) and proceed as usual.
Skip erroneous element and if next element is also erroneous then go to #2.
But if input has a lot of errors (for example, user has by mistake given JPEG file to the parser) a number of alternatives grows exponentially.
Example
Such a parser corresponding to the following grammar:
Program -> Identifier WS Identifier WS '=' WS Identifier
Identifier -> ('a'..'z' | 'A'..'Z' | '0'..'9')*
WS -> ' '*
applied to the following text:
x = "abc\"def"; y = "ghi\"jkl";
fails with "out of memory" on moderately modern desktop computer.
Question
How to reduce number of alternatives in case of errors in input?
Doing GLR (parsing therefore) error correction at the character level is possible but aggravates your problem.
The GLR error recovery procedure we use operates on tokens, so it isn't as bad.
But when the input has a huge number of errors, it is pretty hard to recover. More sophisticated error recovery schemes basically use the parser to identify valid substrings of the language in the input, and then attempts to patch the substrings together to get the result. That's pretty ambitious.
I've built GLR parsers with error recovery. I wasn't that ambitious. In general
the parser mostly just aborts when the number of live parsers gets above "a large number" (e.g., 10,000) or the number of syntax errors encountered exceed a threshold (e.g, 10 or 20). You might consider aborting the parser if it hasn't advanced the input stream in the last second, which is an indirect sign it has too many live parsers.
I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?
I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.