Haskell/Parsec: How do you use the functions in Text.Parsec.Indent? - parsing

I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)

So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Cheers!

Related

Starting a parser for scheme language

I am writing a basic parser for a Scheme interpreter and here are the definitions I have set up to define the various type of tokens:
# 1. Parens
Type:
PAREN
Subtype:
LEFT_PAREN
Value:
'('
# 2. Operators (<=, =, +, ...)
Type:
OPERATOR
Subtype:
EQUALS
Value:
'='
Arity:
2
# 3. Types (2.5, "Hello", #f, etc.)
Type:
DATA
Subtype:
NUMBER
Value:
2.4
# 4. Procedures, builtins, and such
Type:
KEYWORD
Subtype:
BUILTIN
Value:
"set"
Arity:
2
PROCEDURE:
... // probably need a new class for this
Does the above seem like it's a good starting place? Are there some obvious things I'm missing here, or does this give me a "good-enough" foundation?
Your approach makes distinctions which really don't exist in the syntax of the language, and also makes decisions far too early. For example consider this program:
(let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
When I run this:
> (let ((x 1))
(with-assignment-notes
(set! x 2)
(set! x 3)
x))
setting x to 2
setting x to 3
3
In order for this to work with-assignment-notes has to somehow redefine what (set! ...) means in its body. Here's a hacky and probably incorrect (Racket) implementation of that:
(define-syntax with-assignment-notes
(syntax-rules (set!)
[(_ form ...)
(let-syntax ([rewrite/maybe
(syntax-rules (set!)
[(_ (set! var val))
(let ([r val])
(printf "setting ~A to ~A~%" 'var r)
(set! var r))]
[(_ thing)
thing])])
(rewrite/maybe form) ...)]))
So the critical features of any parser for a Lisp-family language are:
it should not make any decision about the semantics of the language that it can avoid making;
the structure it constructs must be available to the language itself as first-class objects;
(and optionally) the parser should be modifiable from the language itself.
As examples:
it is probably inevitable that the parser needs to make decisions about what is and is not a number and what sort of number it is;
it would be nice if it had default handling for strings, but this should ideally be controllable by the user;
it should make no decision at all about what, say (< x y) means but rather should return a structure representing it for interpretation by the language.
The reason for the last, optional, requirement is that Lisp-family languages are used by people who are interested in using them for implementing languages. Allowing the reader to be altered from within the language makes that hugely easier, since you don't have to start from scratch each time you want to make a language which is a bit like the one you started with but not completely.
Parsing Lisp
The usual approach to parsing Lisp-family languages is to have machinery which will turn a sequence of characters into a sequence of s-expressions consisting of objects which are defined by the language itself, notably symbols and conses (but also numbers, strings &c). Once you have this structure you then walk over it to interpret it as a program: either evaluating it on the fly or compiling it. Critically, you can also write programs which manipulate this structure itself: macros.
In 'traditional' Lisps such as CL this process is explicit: there is a 'reader' which turns a sequence of characters into a sequence of s-expressions, and macros explicitly manipulate the list structure of these s-expressions, after which the evaluator/compiler processes them. So in a traditional Lisp (< x y) would be parsed as (a cons of a symbol < and (a cons of a symbol x and (a cons of a symbol y and the empty list object)), or (< . (x . (y . ()))), and this structure gets handed to the macro expander and hence to the evaluator or compiler.
In Scheme it is a little more subtle: macros are specified (portably, anyway) in terms of rules which turn a bit of syntax into another bit of syntax, and it's not (I think) explicit whether such objects are made of conses & symbols or not. But the structure which is available to syntax rules needs to be as rich as something made of conses and symbols, because syntax rules get to poke around inside it. If you want to write something like the following macro:
(define-syntax with-silly-escape
(syntax-rules ()
[(_ (escape) form ...)
(call/cc (λ (c)
(define (escape) (c 'escaped))
form ...))]
[(_ (escape val ...) form ...)
(call/cc (λ (c)
(define (escape) (c val ...))
form ...))]))
then you need to be able to look into the structure of what came from the reader, and that structure needs to be as rich as something made of lists and conses.
A toy reader: reeder
Reeder is a little Lisp reader written in Common Lisp that I wrote a little while ago for reasons I forget (but perhaps to help me learn CL-PPCRE, which it uses). It is emphatically a toy, but it is also small enough and simple enough to understand: certainly it is much smaller and simpler than the standard CL reader, and it demonstrates one approach to solving this problem. It is driven by a table known as a reedtable which defines how parsing proceeds.
So, for instance:
> (with-input-from-string (in "(defun foo (x) x)")
(reed :from in))
(defun foo (x) x)
Reeding
To read (reed) something using a reedtable:
look for the next interesting character, which is the next character not defined as whitespace in the table (reedtables have a configurable list of whitespace characters);
if that character is defined as a macro character in the table, call its function to read something;
otherwise call the table's token reader to read and interpret a token.
Reeding tokens
The token reader lives in the reedtable and is responsible for accumulating and interpreting a token:
it accumulates a token in ways known to itself (but the default one does this by just trundling along the string handling single (\) and multiple (|) escapes defined in the reedtable until it gets to something that is whitespace in the table);
at this point it has a string and it asks the reedtable to turn this string into something, which it does by means of token parsers.
There is a small kludge in the second step: as the token reader accumulates a token it keeps track of whether it is 'denatured' which means that there were escaped characters in it. It hands this information to the token parsers, which allows them, for instance, to interpret |1|, which is denatured, differently to 1, which is not.
Token parsers are also defined in the reedtable: there is a define-token-parser form to define them. They have priorities, so that the highest priority one gets to be tried first and they get to say whether they should be tried for denatured tokens. Some token parser should always apply: it's an error if none do.
The default reedtable has token parsers which can parse integers and rational numbers, and a fallback one which parses a symbol. Here is an example of how you would replace this fallback parser so that instead of returning symbols it returns objects called 'cymbals' which might be the representation of symbols in some embedded language:
Firstly we want a copy of the reedtable, and we need to remove the symbol parser from that copy (having previously checked its name using reedtable-token-parser-names).
(defvar *cymbal-reedtable* (copy-reedtable nil))
(remove-token-parser 'symbol *cymbal-reedtable*)
Now here's an implementation of cymbals:
(defvar *namespace* (make-hash-table :test #'equal))
(defstruct cymbal
name)
(defgeneric ensure-cymbal (thing))
(defmethod ensure-cymbal ((thing string))
(or (gethash thing *namespace*)
(setf (gethash thing *namespace*)
(make-cymbal :name thing))))
(defmethod ensure-cymbal ((thing cymbal))
thing)
And finally here is the cymbal token parser:
(define-token-parser (cymbal 0 :denatured t :reedtable *cymbal-reedtable*)
((:sequence
:start-anchor
(:register (:greedy-repetition 0 nil :everything))
:end-anchor)
name)
(ensure-cymbal name))
An example of this. Before modifying the reedtable:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(x y . z)
After:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(#S(cymbal :name "x") #S(cymbal :name "y") . #S(cymbal :name "z"))
Macro characters
If something isn't the start of a token then it's a macro character. Macro characters have associated functions and these functions get called to read one object, however they choose to do that. The default reedtable has two-and-a-half macro characters:
" reads a string, using the reedtable's single & multiple escape characters;
( reads a list or a cons.
) is defined to raise an exception, as it can only occur if there are unbalanced parens.
The string reader is pretty straightforward (it has a lot in common with the token reader although it's not the same code).
The list/cons reader is mildly fiddly: most of the fiddliness is dealing with consing dots which it does by a slightly disgusting trick: it installs a secret token parser which will parse a consing dot as a special object if a dynamic variable is true, but otherwise will raise an exception. The cons reader then binds this variable appropriately to make sure that consing dots are parsed only where they are allowed. Obviously the list/cons reader invokes the whole reader recursively in many places.
And that's all the macro characters. So, for instance in the default setup, ' would read as a symbol (or a cymbal). But you can just install a macro character:
(defvar *qr-reedtable* (copy-reedtable nil))
(setf (reedtable-macro-character #\' *qr-reedtable*)
(lambda (from quote table)
(declare (ignore quote))
(values `(quote ,(reed :from from :reedtable table))
(inch from nil))))
And now 'x will read as (quote x) in *qr-reedtable*.
Similarly you could add a more compllicated macro character on # to read objects depending on their next character in the way CL does.
An example of the quote reader. Before:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
\'
The object it has returned is a symbol whose name is "'", and it didn't read beyond that of course. After:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
`(x y . z)
Other notes
Everything works one-character-ahead, so all of the various functions get the stream being read, the first character they should be interested in and the reedtable, and return both their value and the next character. This avoids endlessly unreading characters (and probably tells you what grammar class it can handle natively (obviously macro character parsers can do whatever they like so long as things are sane when they return).
It probably doesn't use anything which isn't moderately implementable in non-Lisp languages. Some
Macros will cause pain in the usual way, but the only one is define-token-parser. I think the solution to that is the usual expand-the-macro-by-hand-and-write-that-code, but you could probably help a bit by having an install-or-replace-token-parser function which dealt with the bookkeeping of keeping the list sorted etc.
You'll need a language with dynamic variables to implement something like the cons reeder.
it uses CL-PPCRE's s-expression representation of regexps. I'm sure other languages have something like this (Perl does) because no-one wants to write stringy regexps: they must have died out decades ago.
It's a toy: it may be interesting to read but it's not suitable for any serious use. I found at least one bug while writing this: there will be many more.

Eliminating Epsilon Production for Left Recursion Elimination

Im following the algorithm for left recursion elimination from a grammar.It says remove the epsilon production if there is any
I have the following grammer
S-->Aa/b
A-->Ac/Sd/∈
I can see after removing the epsilon productions the grammer becomes
1) S-->Aa/a/b
2)A-->Ac/Sd/c/d
Im confused where the a/b comes in 1) and c/d comes in 2)
Can someone explain this?
lets look at the rule S->Aa, if A->∈ then S->∈a giving just S->a, so together with the previous rules we get S->Aa|a|b
now lets check the rule A->Ac and A->∈c which gives us A->c.
what about A->Sd? I dont see how you got A->d as a rule. if that is a rule, then the string "da" is accepted by this grammar (S->Aa & A->d --> "da"), but try to construct this string with the original grammar - if you start with S and the string finishes with a, it means you must use S->Aa, but then in order to have a "d" you must use A->Sd, which forces us to have another "a" or "b", meaning we cannot construct this string, and the rule A->d is not correct.

Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected "b", "c". OR expected "c"

Grammar:
rule: (a b)? a c ;
Input:
a d
Question: Which error message correct at position 2 for given input?
1. expected "b", "c".
2. expected "c".
P.S.
I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.
The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.
I don't know possible is the same as expected or not?
Which error message better and correct #1 or #2?
Thanks for answers.
P.S.
In first case I define marker of testing as limit of position.
if(_inputPos > testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
Limit moved in optional expressions:
OPTIONAL_EXPRESSION:
testing = _inputPos;
The "b" expression move _inputPos above the testing pos and add failure at _inputPos.
In second case I can define marker of testing as boolean flag.
if(!testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
The "b" expression in this case not add failure because it tested (inner for optional expression).
What you think what is better and correct?
Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).
Also failures not takes into account if rule not fails. They only reported if parsing fails.
P.S.
Changes at 06 Jan 2014
This question raised because it related to two different problems.
First problem:
Parsing expression grammar (PEG) describe only three atomic items of input:
terminal symbol
nonterminal symbol
empty string
This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.
Second problem:
What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?
Assume we have two grammar:
Grammar 1
rule <- type? identifier
Grammar 2
rule <- type identifier / identifier
They both accept the same input but produce (in PEG) different result.
Grammar 1 results:
{type : type, identifier : identifier}
{type : null, identifier : identifier}
Grammar 2 results:
{type : type, identifier : identifier}
{identifier : identifier}
Quetions:
Both grammar equal?
It is painless to do optimization of grammars?
My answer on both questions is negative. No equal, Not painless.
But you may ask. "But why this happens?".
I can answer to you. "Because this is not a problem. This is a feature".
In PEG parser expression ALWAYS consists from these parts.
ORDERED_CHOICE => SEQUENCE => EXPRESSION
And this explanation is the my answer on question "But why this happens?".
Another problem.
PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.
Now look at this grammar (in short):
program <- WHITESPACE expr EOF
expr <- ruleX
ruleX <- 'X' WHITESPACE
WHITESPACE < ' '?
EOF <- ! .
All PEG grammar desribed in this manner.
First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.
In this case in PEG optional WHITESPACE must be assumed as expected.
But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.
But the main rule of error messages is the following.
If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.
More precisely required to display "unexpected" error mesage.
How you in PEG will display expected WHITESPACE?
Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'
What about start charcters of comments? They also may be part of WHITESPACE in some grammars.
In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.
Is this good or bad?
I think this is not bad and required some tricks to hide this nature of PEG parsers.
And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected.
But all other inner (except at the first position) must treated as expected.
Example 1:
List<int list; // type? ident
Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".
This failure take into account and report as "expected '>'"
This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.
"List" was in "testing" position.
If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.
From this assumption has been asked main question.
You must just take into account that we are talking about PEG parsers and their error messages.
Here is your grammar:
What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.
The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.

Why doesn't Haskell's Prelude.read return a Maybe?

Is there a good reason why the type of Prelude.read is
read :: Read a => String -> a
rather than returning a Maybe value?
read :: Read a => String -> Maybe a
Since the string might fail to be parseable Haskell, wouldn't the latter be be more natural?
Or even an Either String a, where Left would contain the original string if it didn't parse, and Right the result if it did?
Edit:
I'm not trying to get others to write a corresponding wrapper for me. Just seeking reassurance that it's safe to do so.
Edit: As of GHC 7.6, readMaybe is available in the Text.Read module in the base package, along with readEither: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Text-Read.html#v:readMaybe
Great question! The type of read itself isn't changing anytime soon because that would break lots of things. However, there should be a maybeRead function.
Why isn't there? The answer is "inertia". There was a discussion in '08 which got derailed by a discussion over "fail."
The good news is that folks were sufficiently convinced to start moving away from fail in the libraries. The bad news is that the proposal got lost in the shuffle. There should be such a function, although one is easy to write (and there are zillions of very similar versions floating around many codebases).
See also this discussion.
Personally, I use the version from the safe package.
Yeah, it would be handy with a read function that returns Maybe. You can make one yourself:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case reads s of
[(x, "")] -> Just x
_ -> Nothing
Apart from inertia and/or changing insights, another reason might be that it's aesthetically pleasing to have a function that can act as a kind of inverse of show. That is, you want that read . show is the identity (for types which are an instance of Show and Read) and that show . read is the identity on the range of show (i.e. show . read . show == show)
Having a Maybe in the type of read breaks the symmetry with show :: a -> String.
As #augustss pointed out, you can make your own safe read function. However, his readMaybe isn't completely consistent with read, as it doesn't ignore whitespace at the end of a string. (I made this mistake once, I don't quite remember the context)
Looking at the definition of read in the Haskell 98 report, we can modify it to implement a readMaybe that is perfectly consistent with read, and this is not too inconvenient because all the functions it depends on are defined in the Prelude:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case [x | (x,t) <- reads s, ("","") <- lex t] of
[x] -> Just x
_ -> Nothing
This function (called readMaybe) is now in the Haskell prelude! (As of the current base -- 4.6)

Gold Parsing System - What can it be used for in programming?

I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).

Resources