Building a parser (Part I) - parsing

I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!

Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).

Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.

Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.

Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.

Related

YACC|BISON :How do I manipulate parse tree?

The goal of my application is to validate an sql code and generate,in the mean time, from that code a formatted one with some modification.For example this where clause :
where e.student_name= c.contact_name and ( c.address = " nefta"
or c.address=" tozeur ") and e.age <18
we will have as formatted output something like that :
where e.student_name= c.contact_name and (c.address=trim("nefta")
or c.address=trim("tozeur") ) and e.age <18
I hope I've explained my aim well
The problem is grammars may contain recursive rules which make the rewrite task unreliable ; for instance in my sql grammar i have this :
search_condition : search_condition OR search_condition{clbck_or}
| search_condition AND search_condition{clbck_and}
| NOT search_condition {clbck_not}
| '(' search_condition ')'{clbck__}
| predicate {clbck_pre}
;
Knowing that I specified a precedence priority to solve shift reduce problems
%left OR
%left AND
%left NOT
So back on the last example ; my clause where will be consumed this way:
c.address="nefta"or c.address="tozeur" -> search_condition
(c.address="nefta"or c.address="tozeur")->search_condition
e.student_name= c.contact_name and (c.address="nefta"or c.address="tozeur")-> search_condition
... and e.age<18-> search_condition
You can by the way understand that it's tough to rebuild the input stream referring to callbacks triggered by each reduction cause the order is not the same.
Any help for this problem ?
Your question is a bit vague, so I'm guessing that you actually print in your clbck_or (), etc. The "common" way to which wildplasser has alluded is to use "semantic values", i. e. (untested):
search_condition : search_condition OR search_condition{$$ = clbck_or($1, $3);}
| search_condition AND search_condition{$$ = clbck_and($1, $3);}
| NOT search_condition {$$ = clbck_not($2);}
| '(' search_condition ')'{$$ = clbck__($2);}
| predicate {$$ = clbck_pre($1);}
;
If you're using Bison, the manual has a fine example in the section "Infix Notation Calculator: `calc'". With strings and C, you will have to add some memory handling.
Bison is good at parsing, and with some manual help, good at building a custom syntax tree. After that, its up to you to do what you want with the tree. The good news is you can do whatever you want. The bad news is you still have to build a lot of machinery to do what you want. Your basic problem of regenerating source code is called "prettyprinting"; see my SO answer on how to prettyprint to understand what it takes to do this, including all the peccadillos of lexical syntax (you don't to lose the escapes in your literal strings, right?). You didn't at all address how to find the construct you wanted to change in the tree, or how you'd smash the tree to change it.
If you don't want to do all of that, then what you really want is a program transformation system, which is good at parsing, building a syntax tree for you (so you don't have to think about it, SQL is pretty big grammar), will let you find patterns in the tree in terms of SQL syntax you are used to, make tree changes without knowing much about the shape of the tree, and can finally regenerate valid source text by prettyprint as I describe in my answer link above. (A program transformation systems essentially includes a parser as a subroutine).
Our DMS Software Reengineering Toolkit is such a program transformation system. It has a set of predefined language definitions including SQL2011 and means for configuring for a particular dialect.
Using DMS source-to-source syntax rules, you could carry out the change in your example with the following rule:
domain SQL;
rule trim_c_members(f: identifier, s: string):condition->condition
= " c.\f = \s " -> " c.\f = trim(\s) ";
This is DMS Rule language (meta) syntax to describe a rewrite on ("domain") SQL code.
The rule has a name (because in complex application there's lot of rules) and it
as syntactic place holders "f" and "s"; it rewrites only conditions in the code.
The quotes are RSL meta-quotes; stuff inside is SQL with RSL metavariables "\f"
and "\s"; stuff outside is RSL rule syntax. What the rule says is,
"for any condition on a variable explicitly named 'c', with any field f,
if that field is compared by equality to some literal string, then replace
the literal string by 'trim' applied to the literal string".
I left out some code that basically says, "apply this rule to the entire tree, and don't apply it twice in the same place". That "strategy" is one of many built into DMS.
There's the question of how does the rule work. that is accomplished by DMS applying the SQL parser to the meta-quoted strings, to produce "pattern" syntax trees with placeholders where the metavariables are written. The left hand side pattern tree is then matched against the target tree with placeholder referring to subtrees; the right hand tree is spliced in where the left tree matched, and the placeholder subtrees transferred. So, you the programmer see surface sytax that you know and love; the tool works with trees and so it isn't confused by text.
Now, I don't think my rule matches exactly your intent, but that's partly because I can't guess your actual intent. You can write other rules if this isn't what you wanted.
This rule is purely driven by syntax; one can add a semantic predicate (not shown) if you want more complicated conditions to apply to the rule (e.g, the variable has to be ones only in certain scopes you define), and that gets messier to say. But it is much simpler and far easier to read than C code that climbs over the AST (notice you didn't see the AST here?) and tries to figure all this out.
The parsing and prettyprinting happens before and after rule application; there's a lot of machinery required to implement all that, but that machinery is built into DMS (e.g., it has something like [but more powerful] than Bison built in), and for predefined domains such as SQL, all the pretty printing works has been preconfigured, too.
If you want to get a better sense of what it takes to go full cycle with DMS (define your own language parser, define a pretty printer, define complicated rules), here's a nice and complete example of defining and symbolically simplifying calculus using DMS.

Is there a parser generator that can use the Wirth syntax?

ie: http://en.wikipedia.org/wiki/Wirth_syntax_notation
It seems like most use BNF / EBNF ...
The distinction made by the Wikipedia article looks to me like it is splitting hairs. "BNF/EBNF" has long meant writing grammar rules in roughly the following form:
nonterminal = right_hand_side end_rule_marker
As with other silly langauge differences ( "{" in C, begin in Pascal, endif vs. fi) you can get very different looking but identical meaning by choosing different, er, syntax for end_rule_marker and what you are allowed to say for the right_hand_side.
Normally people allow literal tokens in (your choice! of) quotes, other nonterminal names, and for EBNF, various "choice" operators typically | or / for alternatives, * or + for "repeat", [ ... ] or ? for optional, etc.
Because people designing language syntax are playing with syntax, they seem to invent their own every time they write some down. (Check the various syntax formalisms in the language standards; none of them are the same). Yes, we'd all be better off if there were one standard way to write this stuff. But we don't do that for C or C++ or C# (not even Microsoft could follow their own standard); why should BNF be any different?
Because people that build parser generators usually use it to parse their own syntax, they can and so easily define their own for each parser generator. I've never seen one that did precisely the "WSN" version at Wikipedia and I doubt I ever will in practice.
Does it matter much? Well, no. What really matters is the power of the parser generator behind the notation. You have to bend most grammars to match the power (well, weakness) of the parser generator; for instance, for LL-style generators, your grammar rules can't have left recursion (WSN allows that according to my reading). People building parser generators also want to make it convenient to express non-parser issues (such as, "how to build tree nodes") and so they also add extra notation for non-parsing issues.
So what really drives the parser generator syntax are weaknesses of the parser generator to handle arbitrary languages, extra goodies deemed valuable for the parser generator, and the mood of the guy implementing it.

Parsing rules - how to make them play nice together

So I'm doing a Parser, where I favor flexibility over speed, and I want it to be easy to write grammars for, e.g. no tricky workaround rules (fake rules to solve conflicts etc, like you have to do in yacc/bison etc.)
There's a hand-coded Lexer with a fixed set of tokens (e.g. PLUS, DECIMAL, STRING_LIT, NAME, and so on) right now there are three types of rules:
TokenRule: matches a particular token
SequenceRule: matches an ordered list of rules
GroupRule: matches any rule from a list
For example, let's say we have the TokenRule 'varAccess', which matches token NAME (roughly /[A-Za-z][A-Za-z0-9_]*/), and the SequenceRule 'assignment', which matches [expression, TokenRule(PLUS), expression].
Expression is a GroupRule matching either 'assignment' or 'varAccess' (the actual ruleset I'm testing with is a bit more complete, but that'll do for the example)
But now let's say I want to parse
var1 = var2
And let's say the Parser begins with rule Expression (the order in which they are defined shouldn't matter - priorities will be solved later). And let's say the GroupRule expression will first try 'assignment'. Then since 'expression' is the first rule to be matched in 'assignment', it will try to parse an expression again, and so on until the stack is filled up and the computer - as expected - simply gives up in a sparkly segfault.
So what I did is - SequenceRules add themselves as 'leafs' to their first rule, and become non-roôt rules. Root rules are rules that the parser will first try. When one of those is applied and matches, it tries to subapply each of its leafs, one by one, until one matches. Then it tries the leafs of the matching leaf, and so on, until nothing matches anymore.
So that it can parse expressions like
var1 = var2 = var3 = var4
Just right =) Now the interesting stuff. This code:
var1 = (var2 + var3)
Won't parse. What happens is, var1 get parsed (varAccess), assign is sub-applied, it looks for an expression, tries 'parenthesis', begins, looks for an expression after the '(', finds var2, and then chokes on the '+' because it was expecting a ')'.
Why doesn't it match the 'var2 + var3' ? (and yes, there's an 'add' SequenceRule, before you ask). Because 'add' isn't a root rule (to avoid infinite recursion with the parse-expresssion-beginning-with-expression-etc.) and that leafs aren't tested in SequenceRules otherwise it would parse things like
reader readLine() println()
as
reader (readLine() println())
(e.g. '1 = 3' is the expression expected by add, the leaf of varAccess a)
whereas we'd like it to be left-associative, e.g. parsing as
(reader readLine()) println()
So anyway, now we've got this problem that we should be able to parse expression such as '1 + 2' within SequenceRules. What to do? Add a special case that when SequenceRules begin with a TokenRule, then the GroupRules it contains are tested for leafs? Would that even make sense outside that particular example? Or should one be able to specify in each element of a SequenceRule if it should be tested for leafs or not? Tell me what you think (other than throw away the whole system - that'll probably happen in a few months anyway)
P.S: Please, pretty please, don't answer something like "go read this 400pages book or you don't even deserve our time" If you feel the need to - just refrain yourself and go bash on reddit. Okay? Thanks in advance.
LL(k) parsers (top down recursive, whether automated or written by hand) require refactoring of your grammar to avoid left recursion, and often require special specifications of lookahead (e.g. ANTLR) to be able to handle k-token lookahead. Since grammars are complex, you get to discover k by experimenting, which is exactly the thing you wish to avoid.
YACC/LALR(1) grammars aviod the problem of left recursion, which is a big step forward. The bad news is that there are no real programming langauges (other than Wirth's original PASCAL) that are LALR(1). Therefore you get to hack your grammar to change it from LR(k) to LALR(1), again forcing you to suffer the experiments that expose the strange cases, and hacking the grammar reduction logic to try to handle K-lookaheads when the parser generators (YACC, BISON, ... you name it) produce 1-lookahead parsers.
GLR parsers (http://en.wikipedia.org/wiki/GLR_parser) allow you to avoid almost all of this nonsense. If you can write a context free parser, under most practical circumstances, a GLR parser will parse it without further effort. That's an enormous relief when you try to write arbitrary grammars. And a really good GLR parser will directly produce a tree.
BISON has been enhanced to do GLR parsing, sort of. You still have to write complicated logic to produce your desired AST, and you have to worry about how to handle failed parsers and cleaning up/deleting their corresponding (failed) trees. The DMS Software Reengineering Tookit provides standard GLR parsers for any context free grammar, and automatically builds ASTs without any additional effort on your part; ambiguous trees are automatically constructed and can be cleaned up by post-parsing semantic analyis. We've used this to do define 30+ language grammars including C, including C++ (which is widely thought to be hard to parse [and it is almost impossible to parse with YACC] but is straightforward with real GLR); see C+++ front end parser and AST builder based on DMS.
Bottom line: if you want to write grammar rules in a straightforward way, and get a parser to process them, use GLR parsing technology. Bison almost works. DMs really works.
My favourite parsing technique is to create recursive-descent (RD) parser from a PEG grammar specification. They are usually very fast, simple, and flexible. One nice advantage is you don't have to worry about separate tokenization passes, and worrying about squeezing the grammar into some LALR form is non-existent. Some PEG libraries are listed [here][1].
Sorry, I know this falls into throw away the system, but you are barely out of the gate with your problem and switching to a PEG RD parser, would just eliminate your headaches now.

Any reason I couldn't create a language supporting infix, postfix, and prefix functions, and more?

I've been mulling over creating a language that would be extremely well suited to creation of DSLs, by allowing definitions of functions that are infix, postfix, prefix, or even consist of multiple words. For example, you could define an infix multiplication operator as follows (where multiply(X,Y) is already defined):
a * b => multiply(a,b)
Or a postfix "squared" operator:
a squared => a * a
Or a C or Java-style ternary operator, which involves two keywords interspersed with variables:
a ? b : c => if a==true then b else c
Clearly there is plenty of scope for ambiguities in such a language, but if it is statically typed (with type inference), then most ambiguities could be eliminated, and those that remain could be considered a syntax error (to be corrected by adding brackets where appropriate).
Is there some reason I'm not seeing that would make this extremely difficult, impossible, or just a plain bad idea?
Edit: A number of people have pointed me to languages that may do this or something like this, but I'm actually interested in pointers to how I could implement my own parser for it, or problems I might encounter if doing so.
This is not too hard to do. You'll want to assign each operator a fixity (infix, prefix, or postfix) and a precedence. Make the precedence a real number; you'll thank me later. Operators of higher precedence bind more tightly than operators of lower precedence; at equal levels of precedence, you can require disambiguation with parentheses, but you'll probably prefer to permit some operators to be associative so you can write
x + y + z
without parentheses. Once you have a fixity, a precedence, and an associativity for each operator, you'll want to write an operator-precedence parser. This kind of parser is fairly simply to write; it scans tokens from left to right and uses one auxiliary stack. There is an explanation in the dragon book but I have never found it very clear, in part because the dragon book describes a very general case of operator-precedence parsing. But I don't think you'll find it difficult.
Another case you'll want to be careful of is when you have
prefix (e) postfix
where prefix and postfix have the same precedence. This case also requires parentheses for disambiguation.
My paper Unparsing Expressions with Prefix and Postfix Operators has an example parser in the back, and you can download the code, but it's written in ML, so its workings may not be obvious to the amateur. But the whole business of fixity and so on is explained in great detail.
What are you going to do about order of operations?
a * b squared
You might want to check out Scala which has a kind of unique approach to operators and methods.
Haskell has just what you're looking for.

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract data declaration and definitions.
I have been looking for examples and can find some info, but I really struggling to grasp the underlying concepts like grammar rules, parse trees and abstract syntax tree and how they interrelate to each other. Eventually these concepts need to be stored in an actual program, but 1) what do they look like, 2) are there common implementations.
I have been looking at Wikipedia on these topics and programs like Lex and Yacc, but having never gone through a compiler class (EE major) I am finding it difficult to fully understand what is going on.
A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).
A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.
A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.
Last I checked, the best book on the subject was "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
Example:
int x = 1;
A lexer or tokeniser will split that up into tokens 'int', 'x', '=', '1', ';'.
A parser will take those tokens and use them to understand in some way:
we have a statement
it's a definition of an integer
the integer is called 'x'
'x' should be initialised with the value 1
I would say that a lexer and a tokenizer are basically the same thing, and that they smash the text up into its component parts (the 'tokens'). The parser then interprets the tokens using a grammar.
I wouldn't get too hung up on precise terminological usage though - people often use 'parsing' to describe any action of interpreting a lump of text.
(adding to the given answers)
Tokenizer will also remove any comments, and only return tokens to the Lexer.
Lexer will also define scopes for those tokens (variables/functions)
Parser then will build the code/program structure
Using
"Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, AKA the Purple Dragon Book
a related answer of mine What is the difference between a token and a lexeme?
As with my other answer such questions as this make more sense when a specific goal is desired.
In your case the specific goal is
Create a program will go through c/h source files to extract data declaration and definitions.
If the goal is to create Abstract Syntax Trees (AST) then those are created using a Parser and a Parser is commonly feed a list of Tokens from the Lexer. Notice that Tokenizer is deliberately not mentioned.
Another way to think of the relation between a Lexer and Parser is that a Lexer creates a linear structure (list/stream of tokens) and a Parser converts the tokens into an tree structure (Abstract Syntax Tree).
If you read the Dragon book you will notice that the word Analysis appears often which is to say that analysis is one of the key functions at the various stages. This is because when working with Lexers and Parsers they are designed to work with formal languages and a determination needs to be made if the input adheres to the formal language.
From page 5
character stream
|
V
Lexical Analyzer
(token stream)
|
V
Syntax Analyzer
(syntax tree)
|
V
Semantic Analyzer
(syntax tree)
|
V
...
In the above diagram the Lexer is associated with Lexical Analyzer and I would associate Syntax Analyzer and Semantic Analyzer with Parser but YMMV.
AFAIK Tokenizer has no official definition in the Dragon book, not even noted in the index. I don't have an electronic copy of the book so could not do an automated search.
One common reference that notes Tokenizer is Anatomy of a Compiler but the Dragon books are the reference of choice by many in the field.
However if your only goal is to create a list of tokens and then do something else other than semantic analysis then calling the module/function/... a tokenizer might be the right name.
I use Lexer with Parser and don't use Tokenizer with Parser.
Another thought to keep in mind is that if no useful information should be lost in the transformations. In other words if one of your goals is to be able to recreate the input from the AST then the AST needs to capture the extraneous information like whitespace, which then means the Lexer also needs to capture the extraneous information. One reason to go through such effort is to create useful error messages or for Edit code and continue Debugging.

Resources