How do lexers treat overlapping terminals? - parsing

From what I understand:
Lexers tokenize text by a set of terminals.
Terminals are regular expressions.
I'm about to implement a lexer for a simple parser generator I'm working on. I would build the lexer this way:
There is a set of terminals
All these terminals are combined into a NFA (remembering the final state of each terminal)
This NFA is then converted into a DFA (mapping the final state of each terminal)
Then if the lexer was about to lex a token, the implementation would:
Try the combined DFA on the next symbols until it fails
Find the last final state it entered
Map that state to the predefined terminal
(and feed terminal including the matched string to the parser).
The problem arises with this approach when there are at least two terminals defined where the languages intersect. Consider the following terminals:
Name = [a-z0-9]+
Number = [0-9]+
So language of number is actually a sublanguage of name.
If the lexer was about to analyze the string 123, the next token can either be a name or a number.
What is wrong with my design? Or is it needed to feed the parser multiple possible terminals?

Related

Is my lexer doing too much -- is it doing the work of the parser?

My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:
Sally M. Smith
Tom V. Jones
John Doe
Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?
<INITIAL>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(FIRSTNAME); }
\t { BEGIN MI_STATE; }
. { BEGIN JUNK_STATE; }
}
<MI_STATE>{
[A-Z]\. { yylval.strval = strdup(yytext); return(MI); }
\t { BEGIN LASTNAME_STATE; }
. { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
[a-zA-Z]+ { yylval.strval = strdup(yytext); return(LASTNAME); }
\n { BEGIN INITIAL; return EOL; }
. { BEGIN JUNK_STATE; }
}
<JUNK_STATE>. { printf("JUNK: %s\n", yytext); }
You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.
In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.
Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.
Yes; your lexer is parsing. The main evidence is that it's implementing identical rules in different start states. Two rules have exactly the same pattern.
The purpose of start states in the context of lexing is to modify the lexical grammar in order to shield the parser from certain differences. It works with the parser. For instance, say you had some document language in which $ shifts into math expression mode, which has different tokenizing rules. The lexer still just returns tokens in math mode; it doesn't try to parse the math expressions. It is the parser which will determine that, if the brackets are balanced, then another $ can shift out of math mode.
In your code the rules for returning a last name and first name are identical; you have used to start state to handle phrase structure syntax: the fact that the last name comes later than the first name.
Another bit of telltale evidence that the lexer is parsing is that the lexer itself is handling all of the start condition changes. In our $...$ math mode example, we might have the lexer shift into a start state when it sees the $ symbol. However, if the lexer also recognizes the end of math mode, then that is evidence it is parsing the math mode expression. The end can only be recognized by following the nested phrase structure grammar of math mode expressions. The way you would handle that would be for the lexer to expose a function lex_end_math_mode(). When the parser processes and reduces the entire math mode expression, it calls this function to tell the lexer to switch back to the lexical syntax outside of math mode. The math-mode-terminating dollar sign would likely also appear as a token visible to the parser, though the leading one might not. So that is to say, the parser parses math_mode_expr : expr '$': a math mode expression followed by a required dollar sign to end math mode. The action for that rule would include the call to lex_end_math_mode, so the lexer returns to the tokenization rules outside of math mode for scanning the next token after the closing $.
There is no right or wrong answer because it's all parsing. Every grammar that is divided into tokens and phrase structure rules could be expressed by a unified grammar which includes the rules for the token structure.
Why we often use a design which separates lexical scanning and parsing is that:
Unifying the lexical and phrase structure grammar into one will turn LL(1) into LL(k). A recursive-descent parser then needs to look k symbols ahead to make parsing decisions. For instance if you're parsing C with this holistic approach, you need to treat int as a reserved keyword. That requires four symbols of lookahead: you have to recognize i, n, t, and then if the next symbol indicates that the token has ended, you treat that as the keyword, otherwise an identifier.
Performance: lexical scanning uses efficient techniques tailored to that task, which take advantage of the restriction that the lexical grammar is regular.
Correspondence to spec: if you have a language whose specification is described in terms of a lexical grammar separate from a phrase structure grammar, then if you implement it that way, features of your code are more readily traceable to features of requirement spec. You may be able to write unit tests which separately show that the lexing and parsing obeys the spec.
Schooling: people who went through a CS program that included a course on compiler construction had separate lexing and parsing drilled into their heads, and whenever it comes up in their subsequent career, they just lean on that wisdom. They are never confronted with situations in which they recognize it as not being a good approach, and don't question it.
Whatever works in your individual situations with whatever you're parsing overrules the theory. If it's convenient for you to recognize some phrase-like fragments in the lexer, and you're able to convince yourself that it's a clean approach, then by all means do it that way.

What type of parser is needed for this grammar?

I have a grammar that I do not know what type of parser I need in order to parse it other than I do not believe the grammar is LL(1). I am thinking I need a parser with backtracking or LL(*) of some sort. The grammar I have came up with (which may need some rewriting) is:
S: Rules
Rules: Rule | Rule Rules
Rule: id '=' Ids
Ids: id | Ids id
The language I am trying to generate looks something like this:
abc = def g hi jk lm
xy = aaa bbb ccc ddd eee fff jjj kkk
foo = bar ha ha
Zero or more Rule that contain a left identifier followed by an equal sign followed by one or more identifers. The part that I think I will have a problem writing a parser for is that the grammar allows any amount of id in a Rule and that the only way to tell when a new Rule starts is when it finds id =, which would require backtracking.
Does anyone know the classification of this grammar and the best method of parsing, for a hand written parser?
The grammar that generates an identifier followed by an equals sign followed by a finite sequence of identifiers is regular. This means that strings in the language can be parsed using a DFA or regular expression. No need for fancy nondeterministic or LL(*) parsers.
To see that the language is regular, let Id = U {a : a ∈ Γ}, where Γ ⊂ Σ is the set of symbols that can occur in identifiers. The language you are trying to generate is denoted by the regular expression
Id+ =( Id+)* Id+
Setting Γ = {a, b, ..., z}, examples of strings in the language of the regular expression are:
look = i am in a regular language
hey = that means i can be recognized by a dfa
cool = or even a regular expression
There is no need to parse your language using powerful parsing techniques. This is one case where parsing using regular expressions or DFA is both appropriate and optimal.
edit:
Call the above regular expression R. To parse R*, generate a DFA recognizing the language of R*. To do this, generate an NFA recognizing the language of R* using the algorithm obtainable from Kleene's theorem. Then convert the NFA into a DFA using the subset construction. The resultant DFA will recognize all strings in R*. Given a representation of the constructed DFA in your implementation language, the required actions - for instance,
Add the last identifier parsed to the right-hand side of the current declaration statement being parsed
Add the last declaration statement parsed to a list of parsed declarations, and use the last identifier parsed to begin parsing a new declaration statement
can be encoded into the states of the DFA. In reality, using Kleene's theorem and the subset construction is probably unnecessary for such a simple language. That is, you can probably just write a parser with the above two actions without implementing an automaton. Given a more complicated regular langauge (for instance, the lexical structure of a programming langauge), the conversion would be the best option.

Is the word "lexer" a synonym for the word "parser"?

The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.

Practical difference between parser rules and lexer rules in ANTLR?

I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:
my_rule: ... ;
MY_RULE: ... ;
Do they result in different AST trees? Different performance? Potential ambiguities?
... what are the practical differences between these two statements in ANTLR ...
MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.
my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.
That's the difference.
Do they result in different AST trees? Different performance? ...
The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.
This post may be helpful:
The lexer is responsible for the first step, and it's only job is to
create a "token stream" from text. It is not responsible for
understanding the semantics of your language, it is only interested in
understanding the syntax of your language.
For example, syntax is the rule that an identifier must only use
characters, numbers and underscores - as long as it doesn't start with
a number. The responsibility of the lexer is to understand this rule.
In this case, the lexer would accept the sequence of characters
"asd_123" but reject the characters "12dsadsa" (assuming that there
isn't another rule in which this text is valid). When seeing the valid
text example, it may emit a token into the token stream such as
IDENTIFIER(asd_123).
Note that I said "identifier" which is the general term for things
like variable names, function names, namespace names, etc. The parser
would be the thing that would understand the context in which that
identifier appears, so that it would then further specify that token
as being a certain thing's name.
(sidenote: the token is just a unique name given to an element of the
token stream. The lexeme is the text that the token was matched from.
I write the lexeme in parentheses next to the token. For example,
NUMBER(123). In this case, this is a NUMBER token with a lexeme of
'123'. However, with some tokens, such as operators, I omit the lexeme
since it's redundant. For example, I would write SEMICOLON for the
semicolon token, not SEMICOLON( ; )).
From ANTLR - When to use Parser Rules vs Lexer Rules?

What's the difference between a parser and a scanner?

I already made a scanner, now I'm supposed to make a parser. What's the difference?
A Scanner simply turns an input String (say a file) into a list of tokens. These tokens represent things like identifiers, parentheses, operators etc.
A parser converts this list of tokens into a Tree-like object to represent how the tokens fit together to form a cohesive whole (sometimes referred to as a sentence).
In terms of programming language parsers, the output is usually referred to as an Abstract Syntax Tree (AST). Each node in the AST represents a different construct of the language, e.g. an IF statement would be a node with 2 or 3 sub nodes, a CONDITION node, a THEN node and potentially an ELSE node.
A parser does not give the nodes any meaning beyond structural cohesion. The next thing to do is extract meaning from this structure (sometimes called contextual analysis).
Parsing (in a general sense) is about turning the symbols (characters, digits, left parens, etc) into sentences of your grammar.
The lexical analyzer (the "lexer") parses individual symbols from the source code file into tokens. From there, the "parser" proper turns those whole tokens into sentences of your grammar.
Put another way, the lexer combines symbols into tokens, and the parser combines tokens to form sentences.

Resources