I'm using TinyPG, which is an LL1 parser generator, to parse lambda calculus. I'm trying to write a rule that will parse function application like (a b) or (a b c) and so on.
So far I have this rule (a bit simplified):
APPLICATION -> LPARENTHESES VARIABLE (SPACE+ VARIABLE)+ RPARENTHESES;
But that'd fail to parse a term that has spaces after the left and before the right brackets: ( a b ). I can allow spaces after the opening bracket like this:
APPLICATION -> LPARENTHESES SPACE* VARIABLE (SPACE+ VARIABLE)+ RPARENTHESES;
But I'm having trouble setting it to allow spaces before the closing bracket. I came up with this, which seems to work:
ARG_LIST -> (RPARENTHESES | (SPACE+ (RPARENTHESES | (VARIABLE ARG_LIST))));
APPLICATION -> LPARENTHESES SPACE* VARIABLE ARG_LIST;
But is messy and recursive, which will then make it hard to read and compile the nodes. Is there any non recursive or at least simpler way to parse this?
There is no reason to confuse the parser with whitespace. It is sufficient for it to be ignored in the scanner using the [Skip] attribute as shown in the tutorial:
[Skip] WHITESPACE -> #"\s+";
"Skip" does not mean "delete". It means that the scanner should recognize the token and then ignore it. If you skip whitespace, whitespace will still separate alphanumeric tokens just fine. You just don't need to include the space token in your grammar, leaving you with:
APPLICATION -> LPARENTHESES VARIABLE VARIABLE+ RPARENTHESES;
(Actually, empty application lists are usually allowed, so I'd write that with a * instead of a +. But it's your language.)
Related
Given a parser
newtype Parser a = Parser { parse :: String -> [(a,String)] }
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
p >>= f = Parser $ \s -> concat [ parse (f a) s' | (a, s') <- parse p s ]
return :: a -> Parser a
return a = Parser (\s -> [(a,s)])
item :: Parser Char
item = Parser $ \s -> case cs of
"" -> []
(c:cs) -> [(c,cs)]
We can see that item consumes part of the input string given to it ("abc" -> [('a', "bc")]). Is there ever a case where a parser would produce additional string output or replace/modify it (e.g. Parser $ \s -> [((), 'a':s)])? I suspect that this might be the case with context-sensitive grammars but have trouble coming up with a sensible example.
Is there a reason why it would make sense to do this for a real-world problem?
References
Monadic Parsing in Haskell
Here are a couple of cases where it is convenient to inject tokens into the input stream. (How this is actually integrated into the parsing pipeline is another question.)
Macro expansion, in the style of the C/C++ preprocessing phase. This is arguably not the best model for macro expansion; hygienic macros would more likely be expanded using a tree transformation, as with C++ template resolution. But the token-oriented preprocessor is not going away soon. Since it is not tightly coupled with the language syntax, the easiest implementation is to substitute the macro (and arguments if applicable) with the tokens from its expansion.
Ecmascript-style automatic semi-colon insertion (ASI). The language syntax requires a semi-colon to be inserted into the token stream under certain precisely-defined circumstances, which are difficult (at least) to incorporate in a CFG. Since ASI is only possible if the next token in the input stream cannot be shifted (and done other conditions), it can certainly be integrated into the parser loop.
Similarly, indentation-aware block syntax (as in Haskell and Python, for example) can certainly be implemented by replacing leading whitespace with an injected INDENT token or some number of injected DEDENTs. Since this substitution is dependent on parse context (it isn't done inside parentheses, for example), injection inside the parser could be a reasonable approach.
That's not an exhaustive list, but it might be at least indicative. Not all of those cases necessarily involve context-sensitivity (I believe ASI could, in theory, be handled with a context-free grammar although I have no intention of trying) and not all instances of context-sensitivity necessarily require token injection (the ambiguity in C between type and variable names only requires selecting the correct token).
Currently I'm trying to build a parser for VHDL which
has some of the problems C++-Parsers have to face.
The context-free grammar of VHDL produces a parse
forest rather than a single parse tree because of it's
ambiguity regarding function calls and array subscriptions
foo := fun(3) + arr(5);
This assignment can only be parsed unambiguous if the parser
would carry around a hirachically, type-aware symbol table
which it'd use to resolve the ambiguities somewhat on-the-fly.
I don't want to do this, because for statements like the
aforementioned, the parse forest would not grow exponentially, but
rather linear depending on the amount of function calls and
array subscriptions.
(Except, of course, one would torture the parser with statements like)
foo := fun(fun(fun(fun(fun(4)))));
Since bison forces the user to just create one single parse-tree,
I used %merge attributes to collect all subtrees recursively and
added those subtrees under so called AMBIG nodes in the singleton
AST.
The result looks like this.
In order to produce the above, I parsed the token stream "I=I(N);".
The substance of the grammar I used inside the parse.y file, is
collected below. It tries to resemble the ambiguous parts of VHDL:
start: prog
;
/* I cut out every semantic action to make this
text more readable */
prog: assignment ';'
| prog assignment ';'
;
assignment: 'I' '=' expression
;
expression: function_call %merge <stmtmerge2>
| array_indexing %merge <stmtmerge2>
| 'N'
;
function_call: 'I' '(' expression ')'
| 'I'
;
array_indexing: function_call '(' expression ')' %merge <stmtmerge>
| 'I' '(' expression ')' %merge <stmtmerge>
;
The whole sourcecode can be read at this github repository.
And now, let's get down to the actual Problem.
As you can see in the generated parse tree above,
the nodes FCALL1 and ARRIDX1 refer to the same
single node EXPR1 which in turn refers to N1 twice.
This, by all means, should not have happened and I don't
know why. Instead there should be the paths
FCALL1 -> EXPR2 -> N2
ARRIDX1 -> EXPR1 -> N1
Do you have any idea why bison reuses the aforementioned
nodes?
I also wrote a bugreport on the official gnu mailing
list for bison, without a reply to this point though.
Unfortunately, due to the restictions for new stackoverflow
users, I can't provide no link to this bug report...
That behaviour is expected.
expression can be unambiguously reduced, and that reduced value is used by both possible ambiguous reductions which include the value. Remember that GLR, like LR, is a left-to-right parser. When a reduction action is executed, all of the child reductions have already happened. The effect is not different from the use of a terminal in a right-hand side; the terminal will not be artificially copied in order to produce different instances in the ambiguous productions which use it.
For most people, this would be a feature rather than a bug, and I don't mean that as a joke. Without the graph-structured stack, GLR has exponential run-time. If you really want to do a deep copy of shared AST nodes when you merge parse trees, you will have to do it yourself, but I suggest that you find a way to make use of the fact that the parse forest is really an directed acyclic graph rather than a tree; you will probably be able to take advantage of the lack of duplication.
There seems to be disagreement over whether MediaWiki markup (the markup language used to create and edit Wikipedia articles) is context-free or context-sensitive.
See http://www.mediawiki.org/wiki/User_talk:Kanor#Response_to_article_in_Meatball
I would argue that it is obviously context-sensitive. One example of this would be terminal characters in wikimarkup lists. Lists are formed like:
* One thing
* Another thing
* Yet another thing
The end of a list item is indicated by a carriage return.
However, if the list is nested in, say a table or transclusion, then the end of a list item may either be a carriage return, or a table/transclusion terminal symbol. For example, the following seems to be valid markup:
{{Infobox person
* One thing
* Another thing
* Yet another thing}}
However, a parser would need to keep track of the context, e.g. the fact that it is currently nested within a transclusion, when it encounters the }} symbol, instead of an end-line (carriage-return) character, when determining the end of the last list item.
So... how is this possibly not context-sensitive?
"Context-sensitive" has a precise formal definition, and it does not appear to match your intuition. The grammar
S -> P | E
P -> '(' T '.' ')'
E -> '[' T '!' ']'
T -> <any context-free grammar fragment>
is context free (even regular, if T is regular), despite the fact that what comes after T (dot/exclamation mark) depends on the first character: There are no "context non-terminals" on the left hand side. Even arbitrary nesting isn't a problem:
S -> A | B
A -> '(' S ')'
B -> '[' S ']'
The parser has to remember which unmatched opening braces it has seen so far, but it does not need context in the sense of context-free/-sensitive grammars. These particular grammars aren't even ambiguous (again a formal term, also used in the Wiki user page you link to).
Context free does not mean "parser needs no working memory", or equivalently "parser can be restricted to look at every single token in complete isolation".
The Read instance for Double behaves in a very straightforward way:
reads "34.567e8 foo" :: [(Double, String)] = [(3.4567e9," foo")]
However the Read instance for Scientific does something different:
reads "34.567e8 foo" :: [(Scientific, String)] =
[(34.0,".567e8 foo"),(34.567,"e8 foo"),(3.4567e9," foo")]
Strictly this is correct, in that it is presenting a list of possible parses of the input. In fact it could equally well have included (3.0, "4.567e8 foo") in the list, as well as some others. However the usual behaviour in cases like this (which the Double instance follows) is "maximal munch", meaning that the longest valid prefix is parsed.
I'm updating my Decimal library, which has a similar behaviour, and I'm wondering what the Right Thing is here. Both Scientific and Decimal are using Text.ParserCombinators.ReadP, which was designed to make it easy to write Read instances, and this seems to be a characteristic of ReadP parsers.
So my questions:
1: What is the Right Thing for "reads" to return in these cases? Should I file a bug for Data.Scientific?
2: If it should only return the maximal munch (like the Double instance does) then how do you get ReadP to do that?
I've decided that maximal munch is the Right Thing. Given "1.23" a parser that returns 1 is just wrong. I've been tripped up by this myself because I once tried to write a "maybeRead" looking like this:
maybeRead :: (Read a) => String -> Maybe a
maybeRead str = case reads str of
[v, ""] -> Just v
_ => Nothing
This worked fine for Double but failed for Decimal and Scientific. (Obviously it can be fixed to handle multiple return results, but I didn't expect to need to do this).
The problem turned out to be the implementation of "optional" in Text.ParserCombinators.ReadP. This uses the symmetric choice operator "+++", which returns the parse with and without the optional component. Hence when I wrote something like
expPart <- optional "" $ do {...}
the results included a parse without the expPart.
I wrote a different version of "optional" using the left-biased choice operator:
myOpt d p = p <++ return d
If the parser "p" consumes any text then the default is not used. This does the Right Thing if you want maximal munch.
For #2, you could change the scientific package to use this parser defined in terms of the old one: scientificPmaxmuch = scientificP <* eof :: ReadP Scientific.
I don't think there is much of a convention for #1: it doesn't make a difference for people using read or Text.Read.readMaybe. readS_to_P reads :: ReadP Double is probably faster than readS_to_P reads :: ReadP Scientific, but if efficiency mattered at all you would keep everything as ReadP until the end.
Using Parsec, I'm able to write a function of type String -> Maybe MyType with relative ease. I would now like to create a Read instance for my type based on that; however, I don't understand how readsPrec works or what it is supposed to do.
My best guess right now is that readsPrec is used to build a recursive parser from scratch to traverse a string, building up the desired datatype in Haskell. However, I already have a very robust parser who does that very thing for me. So how do I tell readsPrec to use my parser? What is the "operator precedence" parameter it takes, and what is it good for in my context?
If it helps, I've created a minimal example on Github. It contains a type, a parser, and a blank Read instance, and reflects quite well where I'm stuck.
(Background: The real parser is for Scheme.)
However, I already have a very robust parser who does that very thing for me.
It's actually not that robust, your parser has problems with superfluous parentheses, it won't parse
((1) (2))
for example, and it will throw an exception on some malformed inputs, because
singleP = Single . read <$> many digit
may use read "" :: Int.
That out of the way, the precedence argument is used to determine whether parentheses are necessary in some place, e.g. if you have
infixr 6 :+:
data a :+: b = a :+: b
data C = C Int
data D = D C
you don't need parentheses around a C 12 as an argument of (:+:), since the precedence of application is higher than that of (:+:), but you'd need parentheses around C 12 as an argument of D.
So you'd usually have something like
readsPrec p = needsParens (p >= precedenceLevel) someParser
where someParser parses a value from the input without enclosing parentheses, and needsParens True thing parses a thing between parentheses, while needsParens False thing parses a thing optionally enclosed in parentheses [you should always accept more parentheses than necessary, ((((((1)))))) should parse fine as an Int].
Since the readsPrec p parsers are used to parse parts of the input as parts of the value when reading lists, tuples etc., they must return not only the parsed value, but also the remaining part of the input.
With that, a simple way to transform a parsec parser to a readsPrec parser would be
withRemaining :: Parser a -> Parser (a, String)
withRemaining p = (,) <$> p <*> getInput
parsecToReadsPrec :: Parser a -> Int -> ReadS a
parsecToReadsPrec parsecParser prec input
= case parse (withremaining $ needsParens (prec >= threshold) parsecParser) "" input of
Left _ -> []
Right result -> [result]
If you're using GHC, it may however be preferable to use a ReadPrec / ReadP parser (built using Text.ParserCombinators.ReadP[rec]) instead of a parsec parser and define readPrec instead of readsPrec.