to compile a syntax portion of a language we need to parse syntactically and lexiqually
then a step that is not trivial: adding semantic
what I do is to use DCG with some predicates in prolog for the first two step, but now I want to know how it done for the semantic ?
how I consider this task?
separate to the parser or mix it with them?...
EDIT :
Usually, a semantic of simple expression with DCG be written as;
expr (Z) -> term(X) "+", expr(Y), {Z is X + Y}.
but if the language is based on logical formulas such as B Method , problem become more complicated,
and what I need is some tips to overcome this problem
sorry about the wrong expressions
I am very appreciated to your opinions
Related
The grammar S -> a S a | a a generates all even-length strings of a's. We can devise a recursive-descent parser with backtrack for this
grammar.
If we choose to expand by production S -> aa first, then we shall
only recognize the string aa.
Thus, any reasonable recursive-descent parser will
try S -> aSa first.
Show that this recursive-descent parser recognizes inputs aa, aaaa, and
aaaaaaaa, but not aaaaaa.
The parser will try to invoke match(a);S();match(a); at first rather than match(a);match(a); as described in the problem. And note that as you try to recursively invokeS() inside the block match(a);S();match(a);, you have only invoked match(a) once, the 'a' symbol in the end is not consumed.
I'll paraphrase myself from this other answer.
It's actually a property of the "singleton match strategy" usual in naive implementations of recursive descent parsers with backtracking.
Quoting this answer by Dr Adrian Johnstone:
The trick here is to realise that many backtracking parsers use what we call a singleton match strategy, in which as soon as the parse function for a rule finds a match, it returns. In general, parse functions need to return a set of putative matches. Just try working it through, and you'll see that a singelton match parser misses some of the possible derivations.
Also, the images available in this answer will help you visualize what's happening in the case you exemplified.
According to me, the first condition for Recursive Decent Parser is that given grammar should not be Left Recursive and Non-deterministic. but here in question given grammar is non-deterministic so we need to convert that using left factoring and we'll get S->aS', S'->Sa|a and this will work I guess.
I'm looking for a mature parser library, either for Scala or Haskell.
The most important point is, that the library can handle ambiguity.
If an expression is ambiguous, I want every possible abstract syntax tree, that matches the expression.
Simple example: The expression a ⊗ b ⊗ c can be seen as (a ⊗ b) ⊗ c or a ⊗ (b ⊗ c), and I need both variants.
Thanks!
I feel like the old guy for remembering when Walder's papers like Comprehending Monads (the precursor to the do notation) were exciting and new. The idea is that you (to quote) replace a failure by a list of successes, meaning maintain a list of all the possible parses. At the end you normally just take the first match, but with this setup, you can take all of them.
These aren't all that efficient for a deterministic parser, which is why they're less in fashion, but they are what you need.
Have a look at polyparse, and in particular Text.ParserCombinators.HuttonMeijer and Text.ParserCombinators.HuttonMeijerWallace.
(Hutton & Meijer translated the parser library to Haskell (from Gofer) and Wallace added extra features.)
Make sure you check it out on simple cases like parsing "aaaa" with
testP = do
a <- many $ char 'a'
b <- many $ char 'a'
return (a,b)
to see if it has the semantics you seek.
You asked for mature. These libraries are part of pure functional programming's heritage! Having said that, I'd call parsec more mature, even though it's younger.
(Speculation: I don't think parsec can do what you want. Its standard choice combinator is deterministic. I haven't looked into tweaking or replacing that behaviour, and I wouldn't want to I'm afraid.)
This question immediately reminded me of the Yacc is dead / No, it's not debate from the end of 2010. The authors of the Yacc is dead paper provide a library in Scala (unmaintained), Haskell and Racket. In the Yacc is alive response, Russ Cox points out that the code runs in exponential time for ambiguous grammars.
It's well-known that it is possible to parse ambiguous grammars in O(n^3), although obviously it can take exponential time to enumerate all the parse trees in the case that there are exponentially many of them -- and there will be in the case of x1 + x2 + x3 ... + xn. bison implements the GLR algorithm which does so; unfortunately, while bison is certainly mature (if not actually moribund), it is written neither in Haskell nor in Scala.
Daniel Spiewak implemented a GLL parser in Scala IIRC, but last time I looked at it, it suffered from some performance issues. So I'm not sure that it could be described as mature, either.
I can't speak to how mature it is or give you any usage examples, but I've had the scala gll-combinators library open in a tab for a few days. It handles ambiguous grammars and looks pretty nifty.
At the end the the choice fell on the Syntax Definition Formalism (SDF2)
with an sdf table generator here
and JSGLR as parser generator.
I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.
I'm trying to learn how to make a compiler. In order to do so, I read a lot about context-free language. But there are some things I cannot get by myself yet.
Since it's my first compiler there are some practices that I'm not aware of. My questions are asked with in mind to build a parser generator, not a compiler neither a lexer. Some questions may be obvious..
Among my reads are : Bottom-Up Parsing, Top-Down Parsing, Formal Grammars. The picture shown comes from : Miscellanous Parsing. All coming from the Stanford CS143 class.
Here are the points :
0) How do ( ambiguous / unambiguous ) and ( left-recursive / right-recursive ) influence the needs for one algorithm or another ? Are there other ways to qualify a grammar ?
1) An ambiguous grammar is one that have several parse trees. But shouldn't the choice of a leftmost-derivation or rightmost-derivation lead to unicity of the parse tree ?
[EDIT: Answered here ]
2.1) But still, is the ambiguity of the grammar related to k ? I mean giving a LR(2) grammar, is it ambiguous for a LR(1) parser and not ambiguous for a LR(2) one ?
[EDIT: No it's not, a LR(2) grammar means that the parser will need two tokens of lookahead to choose the right rule to use. On the other hand, an ambiguous grammar is one that possibly leads to several parse trees. ]
2.2) So a LR(*) parser, as long as you can imagine it, will have no ambiguous grammar at all and can then parse the entire set of context free languages ?
[EDIT: Answered by Ira Baxter, LR(*) is less powerful than GLR, in that it can't handle multiple parse trees. ]
3) Depending on the previous answers, what follows may be self contradictory. Considering LR parsing, do ambiguous grammars trigger shift-reduce conflict ? Can an unambiguous grammar trigger one too ? In the same way, what about reduce-reduce conflicts ?
[EDIT: this is it, ambiguous grammars leads to shift-reduce and reduce-reduce conflicts. By contrapositive, if there are no conflicts, the grammar is univocal. ]
4) The ability to parse left-recursive grammar is an advantage of LR(k) parser over LL(k), is it the only difference between them ?
[EDIT: yes. ]
5) Giving G1 :
G1 :
S -> S + S
S -> S - S
S -> a
5.1) G1 is both left-recursive, right-recursive, and ambiguous, am I right ? Is it a LR(2) grammar ? One would make it unambiguous :
G2 :
S -> S + a
S -> S - a
S -> a
5.2) Is G2 still ambiguous ? Does a parser for G2 needs two lookaheads ? By factorisation we have :
G3 :
S -> S V
V -> + a
V -> - a
S -> a
5.3) Now, does a parser for G3 need one lookahead only ? What are the counter parts for doing these transformations ? Is LR(1) the minimal parser required ?
5.4) G1 is left recursive, in order to parse it with a LL parser, one need to transform it into a right recursive grammar :
G4 :
S -> a + S
S -> a - S
S -> a
then
G5 :
S -> a V
V -> - V
V -> + V
V -> a
5.5) Does G4 need at least a LL(2) parser ? G5 only is parsable by a LL(1) parser, G1-G5 do define the same language, and this language is ( a (+/- a)^n ). Is it true ?
5.6) For each grammar G1 to G5, what is the minimal set to which it belongs ?
6) Finally, since many differents grammars may define the same language, how does one chose the grammar and the associated parser ? Is the resulting parse tree imortant ? What is the influence of the parse tree ?
I'm asking a lot, and I don't really expect a complete answer, anyway any help would be very appreciated.
Thx for reading !
"Many grammars may define the same langauge, how does one choose..."?
Usually, you choose the one that meets the following criteria:
conceptually as simple as you can make it (implication: smaller than others)
tracks the terminology in the langauge reference manual where possible
least amount of bending to meet the constraints of your parser generator
That last one can make a mess of your conceptual simplicity, and your chart of various parser styles shows the number of different issues that you face depending on your choice-of-generator. This is aggravated by the fact that choice is often made well before you actually choose the grammar.
One way to minimize grammar bending is to choose a parser generator which handles fully context-free grammars. GLR parsing has this very significant advantage. I've been using it for 15 years and have done dozens of real langauges with it.
Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.