What's the difference between a parser and a scanner? - parsing

I already made a scanner, now I'm supposed to make a parser. What's the difference?

A Scanner simply turns an input String (say a file) into a list of tokens. These tokens represent things like identifiers, parentheses, operators etc.
A parser converts this list of tokens into a Tree-like object to represent how the tokens fit together to form a cohesive whole (sometimes referred to as a sentence).
In terms of programming language parsers, the output is usually referred to as an Abstract Syntax Tree (AST). Each node in the AST represents a different construct of the language, e.g. an IF statement would be a node with 2 or 3 sub nodes, a CONDITION node, a THEN node and potentially an ELSE node.
A parser does not give the nodes any meaning beyond structural cohesion. The next thing to do is extract meaning from this structure (sometimes called contextual analysis).

Parsing (in a general sense) is about turning the symbols (characters, digits, left parens, etc) into sentences of your grammar.
The lexical analyzer (the "lexer") parses individual symbols from the source code file into tokens. From there, the "parser" proper turns those whole tokens into sentences of your grammar.
Put another way, the lexer combines symbols into tokens, and the parser combines tokens to form sentences.


Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its
input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?
I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.
But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).
So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163
If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").
Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.
As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)
A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Trying to understand lexers, parse trees and syntax trees

I'm reading the "Dragon Book" and I think I understand the main point of a lexer, parse tree and syntax tree and what errors they're typically supposed to catch (assuming we're using a context-free language), but I need someone to catch me if I'm wrong. My understanding is that a lexer simply tokenizes the input and catches errors that have to do with invalid constructs in code, such as a semi-colon being passed in language that doesn't contain semi-colons. the parse tree is used to verify that the syntax is followed and the code is in the correct order, and the syntax tree is used to actually evaluate the statements and expressions in the code and generate things like 3-address code or machine code. Are any of these correct?
Side-note: Are a concrete-syntax tree and a parse tree the same thing?
Side-Side note: When constructing the AST, does the entire program get built into one giant AST, or is a different AST constructed per statement/expression?
Strictly speaking, lexers are parsers too. The difference between lexers and parsers is in what they operate on. In the world of a lexer, everything is made of individual characters, which it then tokenizes by matching them to the regular grammar it understands. To a parser, the world is made up of tokens which it makes a syntax tree out of by matching them to the context-free grammar it understands. In this sense, they are both doing the same kind of thing but on different levels. In fact, you could build parsers on top of parsers, operating at higher and higher levels so one symbol in the highest level grammar could represent something incredibly complex at the lowest-level.
To your other questions:
Yes, a concrete syntax tree is a parse tree.
Generally, parsers make
one tree for the entire file, since that represents a sentence in the
CFG. This isn't necessarily always the case though.

How can a lexer extract a token in ambiguous languages?

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.
Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.
The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ?
What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ?
I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?
Thanks a lot
The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.
The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.
So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.
My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.
As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).
A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.
Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.
This isn't always so simple, so you have some tools to help you out:
Start conditions
A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.
A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.
You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins
This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.
The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning
It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)
The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

Two level grammar

I am trying to determine whether suggested changes to the EcmaScript grammar introduce ambiguities.
The grammar is odd in a few ways
There is no regular or context free lexical grammar meaning there is no way to break the input into a series of tokens which can be fed to a tree builder, though at a given parser state there is a context free grammar which can be used to fetch the next token.
Some tokens are implicit. Specifically semicolons are inserted in some places when not present in the source text. This only requires one non-ignorable token of lookahead but since ignorable tokens can be of arbitrary length prevents non-finite lookahead.
There is no translation simpler than a full parse that allows removal or collapsing of ignorable tokens.
Line terminators tokens (and multiline comments that are equivalent to line terminators) are ignorable in most contexts but are significant in some.
I know that proving no ambiguity is not doable in general, but I'd like to be able to achieve a simpler goal:
A test that is true if and only if there is no string such that two different paths through the candidate grammar might produce two different trees where each path involves breaking the string into less than k tokens.
I would be very happy if I could prove such a thing for a candidate grammar to k of 50.
Is there any literature on detecting ambiguity within such limits?

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1
I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.
(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
Lexical Analyzer
(token stream)
Syntax Analyzer
(syntax tree)
Semantic Analyzer
(syntax tree)
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.
