Non-greedy search in lpeg without consuming the end match - lua

This was spun off from the comments on this question.
As I understand, in the PEG grammar, it's possible to implement a non-greedy search by writing S <- E2 / E1 S (or S = pattern E2 if possible or pattern E1 and continued S).
However, I don't want to capture E2 in the final pattern - I want to capture up to E2. When trying to implement this in LPEG I've run into several issues, including 'Empty loop in rule' errors when building this into a grammar.
How would we implement the following search in a LPEG grammar: [tag] foo [/tag] where we want to capture the contents of the tag in a capture table ('foo' in the example), but we want to terminate before the ending tag? As I understand from the comments on the other question, this should be possible, but I can't find an example in LPEG.
Here's a snippet from the test grammar
local tag_start = P"[tag]"
local tag_end = P"[/tag]"
G = P{'Pandoc',
...
NotTag = #tag_end + P"1" * V"NotTag"^0;
...
tag = tag_start * Ct(V"NotTag"^0) * tag_end;
}

It's me again. I think you need better understanding about LPeg captures. Table capture (lpeg.Ct) is a capture that gathers your captures in a table. As there's no simple captures (lpeg.C) specified in NotTag rule, the final capture would become an empty table {}.
Once more, I recommend you start from lpeg.re because it's more intuitive.
local re = require('lpeg.re')
local inspect = require('inspect')
local g = re.compile[=[--lpeg
tag <- tag_start {| {NotTag} |} tag_end
NotTag <- &tag_end / . NotTag
tag_start <- '[tag]'
tag_end <- '[/tag]'
]=]
print(inspect(g:match('[tag] foo [/tag]')))
-- output: { " foo " }
Additionally, S <- E2 / E1 S is not S <- E2 / E1 S*, these two are not equivalent.
However, if I were to do the same task, I won't try to use a non-greedy match, as non-greedy matches are always slower than greedy match.
tag <- tag_start {| {( !tag_end . (!'[' .)* )*} |} tag_end
Combining not-predicate and greedy matching is enough.

Related

John Hughes' Deterministic LL(1) parsing with Arrow and errors

I wanted to write a parser based on John Hughes' paper Generalizing Monads to Arrows. When reading through and trying to reimplement his code I realized there were some things that didn't quite make sense. In one section he lays out a parser implementation based on Swierstra and Duponchel's paper Deterministic, error-correcting combinator parsers using Arrows. The parser type he describes looks like this:
data StaticParser ch = SP Bool [ch]
data DynamicParser ch a b = DP (a, [ch]) -> (b, [ch])
data Parser ch a b = P (StaticParser ch) (DynamicParser ch a b)
with the composition operator looking something like this:
(.) :: Parser ch b c -> Parser ch a b -> Parser ch a c
P (SP e2 st2) (DP f2) . P (SP e1 st1) (DP f1) =
P (SP (e1 && e2) (st1 `union` if e1 then st2 else []))
(DP $ f2 . f1)
The issue is that the composition of parsers q . p 'forgets' q's starting symbols. One possible interpretation I thought of is that Hughes' expects all our DynamicParsers to be total such that a symbol parser's type signature would be symbol :: ch -> Parser ch a (Maybe ch) instead of symbol :: ch -> Parser ch a ch. This still seems awkward though since we have to duplicate information putting starting symbol information in both the StaticParser and DynamicParser. Another issue is that almost all parsers will have the potential to throw which means we will have to spend a lot of time inside Maybe or Either creating what is essentially the "monads do not compose problem." This could be remedied by rewriting DynamicParser itself to handle failure or as an Arrow transformer, but this is straying quite a bit from the paper. None of these issues are addressed in the paper, and the Parser is presented as if it obviously works, so I feel like I must me missing something basic. If someone can catch what I missed that would be super helpful.
I think the deterministic parsers described by Swierstra and Duponcheel are a bit different from traditional parsers: they do not handle failure at all, only choice.
See also the invokeDet function in the S&D paper:
invokeDet :: Symbol s => DetPar s a -> Input s -> a
invokeDet (_, p) inp = case p inp [] of (a, _) -> a
This function clearly assumes it will always be able to find a valid parse.
With the arrow version of the parsers described by Hughes you can write a examples like this:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ab"
print $ invokeDet p "ac"
Which will print the expected:
'b'
'c'
However, if you write a "failing" parse:
main = do
let p = symbol 'a' >>> (symbol 'b' <+> symbol 'c')
print $ invokeDet p "ad"
It will still print:
'c'
To make this behavior a bit more sensible, Swierstra and Duponcheel also introduce error-correction. The output 'c' is expected if we assume the erroneous character d has been corrected to be a c in the input. This requires an extra mechanism which presumably was too complicated to include in Hughes' paper.
I have uploaded the implementation I used to get these results here: https://gist.github.com/noughtmare/eced4441332784cc8212e9c0adb68b35
For more information about a more practical parser in the same style (but no longer deterministic and no longer limited to LL(1)) I really like the "Combinator Parsing: A Short Tutorial" by Swierstra. An interesting excerpt from section 9.3:
A subtle point here is the question how to deal with monadic parsers. As we described in [13] the static analysis does not go well with monadic computations, since in that case we dynamically build new parses based on the input produced thus far: the whole idea of a static analysis is that it is static. This observation has lead John Hughes to propose arrows for dealing with such situations [7]. It is only recently that we realised that, although our arguments still hold in general, they do not apply to the case of the LL(1) analysis. If we want to compute the symbols which can be recognised as the first symbol by a parser of the form p >>= q then we are only interested in the starting symbols of the right hand side if the left hand side can recognise the empty string; the good news is that in that case we statically know what value will be returned as a witness, and can pass this value on to q, and analyse the result of this call statically too. Unfortunately we will have to take special precautions in case the left hand side operator contains a call to pErrors in one of the empty derivations, since then it is no longer true that the witness of this alternative can be determined statically.
The full parser implementation by Swierstra can be found in the uu-parsinglib package, although I do not know how many of the extensions are implemented there.

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Creating a recursive LPeg pattern

In a normal PEG (parsing expression grammar) this is a valid grammar:
values <- number (comma values)*
number <- [0-9]+
comma <- ','
However, if I try to write this using LPeg the recursive nature of that rule fails:
local lpeg = require'lpeg'
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = number * (comma * values)^-1
--> bad argument #2 to '?' (lpeg-pattern expected, got nil)
Although in this simple example I could rewrite the rule to not use recursion, I have some existing grammars that I'd prefer not to rewrite.
How can I write a self-referencing rule in LPeg?
Use a grammar.
With the use of Lua variables, it is possible to define patterns incrementally, with each new pattern using previously defined ones. However, this technique does not allow the definition of recursive patterns. For recursive patterns, we need real grammars.
LPeg represents grammars with tables, where each entry is a rule.
The call lpeg.V(v) creates a pattern that represents the nonterminal (or variable) with index v in a grammar. Because the grammar still does not exist when this function is evaluated, the result is an open reference to the respective rule.
A table is fixed when it is converted to a pattern (either by calling lpeg.P or by using it wherein a pattern is expected). Then every open reference created by lpeg.V(v) is corrected to refer to the rule indexed by v in the table.
When a table is fixed, the result is a pattern that matches its initial rule. The entry with index 1 in the table defines its initial rule. If that entry is a string, it is assumed to be the name of the initial rule. Otherwise, LPeg assumes that the entry 1 itself is the initial rule.
As an example, the following grammar matches strings of a's and b's that have the same number of a's and b's:
equalcount = lpeg.P{
"S"; -- initial rule name
S = "a" * lpeg.V"B" + "b" * lpeg.V"A" + "",
A = "a" * lpeg.V"S" + "b" * lpeg.V"A" * lpeg.V"A",
B = "b" * lpeg.V"S" + "a" * lpeg.V"B" * lpeg.V"B",
} * -1
It is equivalent to the following grammar in standard PEG notation:
S <- 'a' B / 'b' A / ''
A <- 'a' S / 'b' A A
B <- 'b' S / 'a' B B
I know this is a late answer but here is an idea how to back-reference a rule
local comma = lpeg.P(',')
local number = lpeg.R('09')^1
local values = lpeg.P{ lpeg.C(number) * (comma * lpeg.V(1))^-1 }
local t = { values:match('1,10,20,301') }
Basically a primitive grammar is passed to lpeg.P (grammar is just a glorified table) that references the first rule by number instead of name i.e. lpeg.V(1).
The sample just adds a simple lpeg.C capture on number terminal and collects all these results in local table t for further usage. (Notice that no lpeg.Ct is used which is not a big deal but still... part of the sample I guess.)

<|> in Parsec - why do these examples behave differently?

I think I'm misunderstanding <|> in parsec - I have an input stream that contains either a bunch of as in one representation or a bunch of as in another representation. I would expect the following to functions to be equivalent (given that the input is the form I said, and I have verified that it is):
foo = do
...
a1s <- many $ try $ a1
a2s <- many $ try $ a2
return $ a1s ++ a2s
versus
foo = do
...
as <- (many $ try $ a1) <|> (many $ try $ a2)
return as
What could be going wrong? The first function works on my input, the second function fails, saying unexpected a2, expecting a1.
When you give a sequence of a2 to the latter parser, the first many matches and returns an empty list, so it doesn't try to match against the second many.
You can use many1 instead.
foo = do
...
as <- many1 a1 <|> many a2
return as
In this case, the many1 fails when you give a sequence of a2, and the many matches against the input.

IndentParser example

Could someone please post a small example of IndentParser usage? I am looking to parse YAML-like input like the following:
fruits:
apples: yummy
watermelons: not so yummy
vegetables:
carrots: are orange
celery raw: good for the jaw
I know there is a YAML package. I would like to learn the usage of IndentParser.
I've sketched out a parser below, for your problem you probably only need the block
parser from IndentParser. Note I haven't tried to run it so it might have elementary errors.
The biggest problem for your parser is not really indenting, but that you only have strings and colon as tokens. You might find the code below takes quite a bit of debugging as it will have to be very sensitive about not consuming too much input, though I have tried to be careful about left-factoring. Because you only have two tokens there isn't much benefit you can get from Parsec's Token module.
Note that there is a strange truth to parsing that simple looking formats are often not simple to parse. For learning, writing a parser for simple expressions will teach you much more that an more-or-less arbitrary text format (that might only cause you frustration).
data DefinitionTree = Nested String [DefinitionTree]
| Def String String
deriving (Show)
-- Note - this might need some testing.
--
-- This is a tricky one, the parser has to parse trailing
-- spaces and tabs but not a new line.
--
category :: IndentCharParser st String
category = do
{ a <- body
; rest
; return a
}
where
body = manyTill1 (letter <|> space) (char ':')
rest = many (oneOf [' ', '\t'])
-- Because the DefinitionTree data type has two quite
-- different constructors, both sharing the same prefix
-- 'category' this combinator is a bit more complicated
-- than usual, and has to use an Either type to descriminate
-- between the options.
--
definition :: IndentCharParser st DefinitionTree
definition = do
{ a <- category
; b <- (textL <|> definitionsR)
; case b of
Left ss -> return (Def a ss)
Right ds -> return (Nested a ds)
}
-- Note this should parse a string *provided* it is on
-- the same line as the category.
--
-- However you might find this assumption needs verifying...
--
textL :: IndentCharParser st (Either DefinitionTrees a)
textL = do
{ ss <- manyTill1 anyChar "\n"
; return (Left ss)
}
-- Finally this one uses an indent parser.
--
definitionsR :: IndentCharParser st (Either a [DefinitionTree])
definitionsR = block body
where
body = do { a <- many1 definition; return (Right a) }

Resources