What exactly is a lexer's job? - parsing

I'm writing a lexer for Markdown. In the process, I realized that I do not fully understand what its core responsibility should be.
The most common definition of a lexer is that it translates an input stream of characters into an output stream of tokens.
Input → Output
(characters) (tokens)
That sounds quite simple at first, but the question that arises here is how much semantic interpretation the lexer should do before handing over its output of tokens to the parser.
Take this example of Markdown syntax:
### Headline
*This* is an emphasized word.
It might be translated by a lexer into the following series of tokens:
Lexer 1 Output
.headline("Headline")
.emphasis("This")
.text"(" is an emphasized word.")
But it might as well be translated on a more granular level, depending on the grammar (or the set of lexemes) used:
Lexer 2 Output
.controlSymbol("#")
.controlSymbol("#")
.controlSymbol("#")
.text(" Headline")
.controlSymbol("*")
.text("This")
.controlSymbol("*")
.text"(" is an emphasized word.")
It seems a lot more practical to have the lexer produce an output similar to that of Lexer 1, because the parser will then have an easier job. But it also means that the lexer needs to semantically understand what the code means. It's not merely mapping a sequence of characters to a token. It needs to look ahead and identify patterns. (For example, it needs to be able to be able to distinguish between **Hey* you* and **Hey** you. It cannot simply translate a double asterisk ** into .openingEmphasis, because that depends on the following context.)
According to this Stackoverflow post and the CommonMark definition, it seems to make sense to first break down the Markdown input into a number of blocks (representing one or more lines) and then analyze the contents of each block in a second step. With the example above, this would mean the following:
.headlineBlock("Headline")
.paragraphBlock("*This* is an emphasized word.")
But this wouldn't count as a valid sequence of tokens because some of the lexemes ("*") have not been parsed yet and it wouldn't be right to pass this paragraphBlock to the parser.
So here's my question:
Where do you draw the line?
How much semantic work should the lexer do? Is there some hard cut in the definition of a lexer that I am not aware of?
What would be the best way to define a grammar for the lexer?

BNF is used to describe many languages / create lexers and parsers
MOST use a Look right 1 to define a unambiguous format.
Recently I was looking at playing with SQL BNF
https://github.com/ronsavage/SQL/blob/master/sql-92.bnf
I made the decision that my lexer would return only terminal token strings. Similar to your option 1.
'('
')'
'KEWORDS'
'-- comment eol'
'12.34'
...
Any rule that defined the syntax tree would be left to the parser.
<Document> := <lines>
<Lines> := <line> [<Lines>]
<line> := ...

Related

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Combined unparser/parser generator

Is there a parser generator that also implements the inverse direction, i.e. unparsing domain objects (a.k.a. pretty-printing) from the same grammar specification? As far as I know, ANTLR does not support this.
I have implemented a set of Invertible Parser Combinators in Java and Kotlin. A parser is written pretty much in LL-1 style and it provides a parse- and a print-method where the latter provides the pretty printer.
You can find the project here: https://github.com/searles/parsing
Here is a tutorial: https://github.com/searles/parsing/blob/master/tutorial.md
And here is a parser/pretty printer for mathematical expressions: https://github.com/searles/parsing/blob/master/src/main/java/at/searles/demo/DemoInvert.kt
Take a look at Invertible syntax descriptions: Unifying parsing and pretty printing.
There are several parser generators that include an implementation of an unparser. One of them is the nearley parser generator for context-free grammars.
It is also possible to implement bidirectional transformations of source code using definite clause grammars. In SWI-Prolog, the phrase/2 predicate can convert an input text into a parse tree and vice-versa.
Our DMS Software Reengineering Toolkit does precisely this (and provides a lot of additional support for analyzing/transforming code). It does this by decorating a language grammar with additional attributes, producing what is called an attribute grammar. We use a special DSL to write these rules to make them convenient to write.
It helps to know that DMS produces a tree based directly on the grammar.
Each DMS grammar rule is paired with with so-called "prettyprinting" rule. Each prettyprinting rule describes how to "prettyprint" the syntactic element and sub-elements recognized by its corresponding grammar rule. The prettyprinting process essentially manufactures or combines rectangular boxes of text horizontally or vertically (with optional indentation), with leaves producing unit-height boxes containing the literal value of the leaf (keyword, operator, identifier, constant, etc.
As an example, one might write the following DMS grammar rule and matching prettyprinting rule:
statement = 'for' '(' assignment ';' assignment ';' conditional_expression ')'
'{' sequence_of_statements '}' ;
<<PrettyPrinter>>:
{ V(H('for','(',assignment[1],';','assignment[2],';',conditional_expression,')'),
H('{', I(sequence_of_statements)),
'}');
This will parse the following:
for ( i=x*2;
i--; i>-2*x ) { a[x]+=3;
b[x]=a[x]-1; }
(using additional grammar rules for statements and expressions) and prettyprint it (using additional prettyprinting rules for those additional grammar rules) as follows:
for (i=x*2;i--;i>-2*x)
{ a[x]+=3;
b[x]=a[x]-1;
}
DMS also captures comments, attaches them to AST nodes, and regenerates them on output. The implementation is a bit exotic because most parsers don't handle comments, but utilization is easy, even "free"; comments will be automatically inserted in the prettyprinted result in their original places.
DMS can also print in "fidelity" mode. In this form, it tries to preserve the shape of the toke (e.g., number radix, identifier character capitalization, which keyword spelling was used) the column offset (into the line) of a parsed token. This would cause the original text (or something so close that you don't think it is different) to get regenerated.
More details about what prettyprinters must do are provided in my SO answer on Compiling an AST back to source code. DMS addresses all of those topics cleanly.
This capability has been used by DMS on some 40+ real languages, including full IBM COBOL, PL/SQL, Java 1.8, C# 5.0, C (many dialects) and C++14.
By writing a sufficiently interesting set of prettyprinter rules, you can build things like JavaDoc extended to include hyperlinked source code.
It is not possible in general.
What makes a print pretty? A print is pretty, if spaces, tabs or newlines are at those positions, which make the print looking nicely.
But most grammars ignore white spaces, because in most languages white spaces are not significant. There are exceptions like Python but in general the question, whether it is a good idea to use white spaces as syntax, is still controversial. And therefor most grammars do not use white spaces as syntax.
And if the abstract syntax tree does not contain white spaces, because the parser has thrown them away, no generator can use them to pretty print an AST.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Parsing Context Sensitive Language

i am reading the Definitive ANTLR reference by Terence Parr, where he says:
Semantic predicates are a powerful
means of recognizing context-sensitive
language structures by allowing
runtime information to drive
recognition
But the examples in the book are very simple. What i need to know is: can ANTLR parse context-sensitive rules like:
xAy --> xBy
If ANTLR can't parse these rules, is there is another tool that deals with context-sensitive grammars?
ANTLR parses only grammars which are LL(*). It can't parse using grammars for full context-sensitive languages such as the example you provided. I think what Parr meant was that ANTLR can parse some languages that require some (left) context constraints.
In particular, one can use semantic predicates on "reduction actions" (we do this for GLR parsers
used by our DMS Software Reengineering Toolkit but the idea is similar for ANTLR, I think) to inspect any data collected by the parser so far, either as ad hoc side effects of other semantic actions, or in a partially-built parse tree.
For our DMS-based DMS-based Fortran front end, there's a context-sensitive check to ensure that DO-loops are properly lined up. Consider:
DO 20, I= ...
DO 10, J = ...
...
20 CONTINUE
10 CONTINUE
From the point of view of the parser, the lexical stream looks like this:
DO <number> , <variable> = ...
DO <number> , <variable> = ...
...
<number> CONTINUE
<number> CONTINUE
How can the parser then know which DO statement goes with which CONTINUE statement?
(saying that each DO matches its closest CONTINUE won't work, because FORTRAN can
share a CONTINUE statement with multiple DO-heads).
We use a semantic predicate "CheckMatchingNumbers" on the reduction for the following rule:
block = 'DO' <number> rest_of_do_head newline
block_of_statements
<number> 'CONTINUE' newline ; CheckMatchingNumbers
to check that the number following the DO keyword, and the number following the CONTINUE keyword match. If the semantic predicate says they match, then a reduction for this rule succeeds and we've aligned the DO head with correct CONTINUE. If the predicate fails, then no reduction is proposed (and this rule is removed from candidates for parsing the local context); some other set of rules has to parse the text.
The actual rules and semantic predicates to handle FORTRAN nesting with shared continues is more complex than this but I think this makes the point.
What you want is full context-sensitive parsing engine. I know people have built them, but I don't know of any full implementations, and don't expect them to be fast.
I did follow Quinn Taylor Jackson's MetaS grammar system for awhile; it sounded like a practical attempt to come close.
It is comparatively easy to write a context-sensitive parser in Prolog. This program parses the string [a,is,less,than,b,and,b,is,less,than,c], converting it into [a,<,b,<,c]:
:- initialization(main).
:- set_prolog_flag('double_quotes','chars').
main :-
rewrite_system([a,is,less,than,b,and,b,is,less,than,c],X),writeln('\nFinal output:'),writeln(X).
rewrite_rule([[A,<,B],and,[B,<,C]],[A,<,B,<,C]).
rewrite_rule([A,is,less,than,B],[A,<,B]).
rewrite_rule([[A,<,B],and,C,than,D],[[A,<,B],and,A,is,C,than,D]).
rewrite_rule([A,<,B],[[A,<,B]]).
rewritten(A) :- atom(A);bool(A).
bool(A) :- atom(A).
bool([A,<,B,<,C]) :- atom(A),atom(B),atom(C).
bool([A,and,B]) :- bool(A),bool(B).
% this predicate is from https://stackoverflow.com/a/8312742/975097
replace(ToReplace, ToInsert, List, Result) :-
once(append([Left, ToReplace, Right], List)),
append([Left, ToInsert, Right], Result).
rewrite_system(Input,Output) :-
rewritten(Input),Input=Output;
rewrite_rule(A,B),
replace(A,B,Input,Input1),
writeln(Input1),
rewrite_system(Input1,Output).
Using the same algorithm, I also wrote an adaptive parser that "learns" new rewrite rules from its input.

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
Example:
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1
I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.
(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure
Using
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
|
V
Lexical Analyzer
(token stream)
|
V
Syntax Analyzer
(syntax tree)
|
V
Semantic Analyzer
(syntax tree)
|
V
...
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.

Resources