Scala combinator parser, what does >> mean? - parsing

I am little bit confusing about ">>" in scala. Daniel said in Scala parser combinators parsing xml? that it could be used to parameterize the parser base on result from previous parser. Could someone give me some example/hint ? I already read scaladoc but still not understand it.
thanks

As I said, it serves to parameterize a parser, but let's walk through an example to make it clear.
Let's start with a simple parser, that parses a number follow by a word:
def numberAndWord = number ~ word
def number = "\\d+".r
def word = "\\w+".r
Under RegexParsers, this will parse stuff like "3 fruits".
Now, let's say you also want a list of what these "n things" are. For example, "3 fruits: banana, apple, orange". Let's try to parse that to see how it goes.
First, how do I parse "N" things? As it happen, there's a repN method:
def threeThings = repN(3, word)
That will parse "banana apple orange", but not "banana, apple, orange". I need a separator. There's repsep that provides that, but that won't let me specify how many repetitions I want. So, let's provide the separator ourselves:
def threeThings = word ~ repN(2, "," ~> word)
Ok, that words. We can write the whole example now, for three things, like this:
def listOfThings = "3" ~ word ~ ":" ~ threeThings
def word = "\\w+".r
def threeThings = word ~ repN(2, "," ~> word)
That kind of works, except that I'm fixing "N" in 3. I want to let the user specify how many. And that's where >>, also known as into (and, yes, it is flatMap for Parser), comes into. First, let's change threeThings:
def things(n: Int) = n match {
case 1 => word ^^ (List(_))
case x if x > 1 => word ~ repN(x - 1, "," ~> word) ^^ { case w ~ l => w :: l }
case x => err("Invalid repetitions: "+x)
}
This is slightly more complicated than you might have expected, because I'm forcing it to return Parser[List[String]]. But how do I pass a parameter to things? I mean, this won't work:
def listOfThings = number ~ word ~ ":" ~ things(/* what do I put here?*/)
But we can rewrite that like this:
def listOfThings = (number ~ word <~ ":") >> {
case n ~ what => things(n.toInt)
}
That is almost good enough, except that I now lost n and what: it only returns "List(banana, apple, orange)", not how many there ought to be, and what they are. I can do that like this:
def listOfThings = (number ~ word <~ ":") >> {
case n ~ what => things(n.toInt) ^^ { list => new ~(n.toInt, new ~(what, list)) }
}
def number = "\\d+".r
def word = "\\w+".r
def things(n: Int) = n match {
case 1 => word ^^ (List(_))
case x if x > 1 => word ~ repN(x - 1, "," ~> word) ^^ { case w ~ l => w :: l }
case x => err("Invalid repetitions: "+x)
}
Just a final comment. You might have wondered asked yourself "what do you mean flatMap? Isn't that a monad/for-comprehension thingy?" Why, yes, and yes! :-) Here's another way of writing listOfThings:
def listOfThings = for {
nOfWhat <- number ~ word <~ ":"
n ~ what = nOfWhat
list <- things(n.toInt)
} yield new ~(n.toInt, new ~(what, list))
I'm not doing n ~ what <- number ~ word <~ ":" because that uses filter or withFilter in Scala, which is not implemented by Parsers. But here's even another way of writing it, that doesn't have the exact same semantics, but produce the same results:
def listOfThings = for {
n <- number
what <- word
_ <- ":" : Parser[String]
list <- things(n.toInt)
} yield new ~(n.toInt, new ~(what, list))
This might even give one to think that maybe the claim that "monads are everywhere" might have something to it. :-)

The method >> takes a function that is given the result of the parser and uses it to contruct a new parser. As stated, this can be used to parameterize a parser on the result of a previous parser.
Example
The following parser parses a line with n + 1 integer values. The first value n states the number of values to follow. This first integer is parsed and then the result of this parse is used to construct a parser that parses n further integers.
Parser definition
The following line assumes, that you can parse an integer with parseInt: Parser[Int]. It first parses an integer value n and then uses >> to parse n additional integers which form the result of the parser. So the initial n is not returned by the parser (though it's the size of the returned list).
def intLine: Parser[Seq[Int]] = parseInt >> (n => repN(n,parseInt))
Valid inputs
1 42
3 1 2 3
0
Invalid inputs
0 1
1
3 42 42

Related

Parse String to Datatype in Haskell

I'm taking a Haskell course at school, and I have to define a Logical Proposition datatype in Haskell. Everything so far Works fine (definition and functions), and i've declared it as an instance of Ord, Eq and show. The problem comes when I'm required to define a program which interacts with the user: I have to parse the input from the user into my datatype:
type Var = String
data FProp = V Var
| No FProp
| Y FProp FProp
| O FProp FProp
| Si FProp FProp
| Sii FProp FProp
where the formula: ¬q ^ p would be: (Y (No (V "q")) (V "p"))
I've been researching, and found that I can declare my datatype as an instance of Read.
Is this advisable? If it is, can I get some help in order to define the parsing method?
Not a complete answer, since this is a homework problem, but here are some hints.
The other answer suggested getLine followed by splitting at words. It sounds like you instead want something more like a conventional tokenizer, which would let you write things like:
(Y
(No (V q))
(V p))
Here’s one implementation that turns a string into tokens that are either a string of alphanumeric characters or a single, non-alphanumeric printable character. You would need to extend it to support quoted strings:
import Data.Char
type Token = String
tokenize :: String -> [Token]
{- Here, a token is either a string of alphanumeric characters, or else one
- non-spacing printable character, such as "(" or ")".
-}
tokenize [] = []
tokenize (x:xs) | isSpace x = tokenize xs
| not (isPrint x) = error $
"Invalid character " ++ show x ++ " in input."
| not (isAlphaNum x) = [x]:(tokenize xs)
| otherwise = let (token, rest) = span isAlphaNum (x:xs)
in token:(tokenize rest)
It turns the example into ["(","Y","(","No","(","V","q",")",")","(","V","p",")",")"]. Note that you have access to the entire repertoire of Unicode.
The main function that evaluates this interactively might look like:
main = interact ( unlines . map show . map evaluate . parse . tokenize )
Where parse turns a list of tokens into a list of ASTs and evaluate turns an AST into a printable expression.
As for implementing the parser, your language appears to have similar syntax to LISP, which is one of the simplest languages to parse; you don’t even need precedence rules. A recursive-descent parser could do it, and is probably the easiest to implement by hand. You can pattern-match on parse ("(":xs) =, but pattern-matching syntax can also implement lookahead very easily, for example parse ("(":x1:xs) = to look ahead one token.
If you’re calling the parser recursively, you would define a helper function that consumes only a single expression, and that has a type signature like :: [Token] -> (AST, [Token]). This lets you parse the inner expression, check that the next token is ")", and proceed with the parse. However, externally, you’ll want to consume all the tokens and return an AST or a list of them.
The stylish way to write a parser is with monadic parser combinators. (And maybe someone will post an example of one.) The industrial-strength solution would be a library like Parsec, but that’s probably overkill here. Still, parsing is (mostly!) a solved problem, and if you just want to get the assignment done on time, using a library off the shelf is a good idea.
the read part of a REPL interpreter typically looks like this
repl :: ForthState -> IO () -- parser definition
repl state
= do putStr "> " -- puts a > character to indicate it's waiting for input
input <- getLine -- this is what you're looking for, to read a line.
if input == "quit" -- allows user to quit the interpreter
then do putStrLn "Bye!"
return ()
else let (is, cs, d, output) = eval (words input) state -- your grammar definition is somewhere down the chain when eval is called on input
in do mapM_ putStrLn output
repl (is, cs, d, [])
main = do putStrLn "Welcome to your very own interpreter!"
repl initialForthState -- runs the parser, starting with read
your eval method will have various loops, stack manipulations, conditionals, etc to actually figure out what the user inputted. hope this helps you with at least the reading input part.

Writing Custom Expression parser or using ANTLR library?

I have expressions like follows:
eg 1: (f1 AND f2)
eg 2: ((f1 OR f2) AND f3)
eg 3: ((f1 OR f2) AND (f3 OR (f4 AND f5)))
Each of f(n) is used to generate a fragment of SQL and each of these fragments will be joined using OR / AND described in the expression.
Now I want to :
1) Parse this expression
2) Validate it
3) Generate "Expression Tree" for the expression and use this tree to generate the final SQL.
I found this series of articles on writing tokenizers, parsers..etc ex :
http://cogitolearning.co.uk/2013/05/writing-a-parser-in-java-the-expression-tree/
I also came across with the library ANTLR , which wondering whether I can use for my case.
Any tips?
I'm guessing you might only interested in Java (it would be good to say so in future), but if you have a choice of languages, then I would recommend using Python and parsy for a task like this. It is much more light weight than things like ANTLR.
Here is some example code I knocked together that parses your samples into appropriate data structures:
import attr
from parsy import string, regex, generate
#attr.s
class Variable():
name = attr.ib()
#attr.s
class Compound():
left_value = attr.ib()
right_value = attr.ib()
operator = attr.ib()
#attr.s
class Expression():
value = attr.ib()
# You could put an `evaluate` method here,
# or `generate_sql` etc.
whitespace = regex(r'\s*')
lexeme = lambda p: whitespace >> p << whitespace
AND = lexeme(string('AND'))
OR = lexeme(string('OR'))
OPERATOR = AND | OR
LPAREN = lexeme(string('('))
RPAREN = lexeme(string(')'))
variable = lexeme((AND | OR | LPAREN | RPAREN).should_fail("not AND OR ( )") >> regex("\w+")).map(Variable)
#generate
def compound():
yield LPAREN
left = yield variable | compound
op = yield OPERATOR
right = yield variable | compound
yield RPAREN
return Compound(left_value=left,
right_value=right,
operator=op)
expression = (variable | compound).map(Expression)
I'm also use attrs for simple data structures.
The result of parsing is a hierarchy of expressions:
>>> expression.parse("((f1 OR f2) AND (f3 OR (f4 AND f5)))")
Expression(value=Compound(left_value=Compound(left_value=Variable(name='f1'), right_value=Variable(name='f2'), operator='OR'), right_value=Compound(left_value=Variable(name='f3'), right_value=Compound(left_value=Variable(name='f4'), right_value=Variable(name='f5'), operator='AND'), operator='OR'), operator='AND'))

Scala Parser Combinator compilation issue: failing to match multiple variables in case

I am learning how to use the Scala Parser Combinators, which by the way are lovely to work with.
Unfortunately I am getting a compilation error. I have read, and recreated the worked examples from: http://www.artima.com/pins1ed/combinator-parsing.html <-- from Chapter 31 of Programming in Scala, First Edition, and a few other blogs.
I've reduced my code to a much simpler version to demonstrate my problem. I am working on parser that would parse the following samples
if a then x else y
if a if b then x else y else z
with a little extra that the conditions can have an optional "/1,2,3" syntax
if a/1 then x else y
if a/2,3 if b/3,4 then x else y else z
So I have ended with the following code
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no
def condition: Parser[Any] = ident
def inputList: Parser[Any] = opt("/" ~> repsep(input, ","))
def input: Parser[Any] = ident
def yes: Parser[Any] = "then" ~> result | ifThenElse
def no: Parser[Any] = result | ifThenElse
def result: Parser[Any] = ident
Now I want to add some transformations. I now get a compilation error on the second ~ in the case:
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no ^^ {
case c ~ i ~ y ~ n => null
^ constructor cannot be instantiated to expected type; found : SmallestFailure.this.~[a,b] required: String
When I change the code to
"if" ~> condition ~ inputList ~ yes ~> "else" ~ no ^^ {
case c ~ i => println("c: " + c + ", i: " + i)
I expected it not to compile, but it did. I thought I would need a variable for each clause. When executed (using parseAll) parsing "if a then b else c" produces "c: else, i: c". So it seems like c and i are the tail of the string.
I don't know if it is significant, but none of the example tutorials seem to have an example with more than two variables being matched, and this is matching four
You do not have to match the "else":
def ifThenElse: Parser[Any] =
"if" ~> condition ~ inputList ~ (yes <~ "else") ~ no ^^ {
case c ~ i ~ y ~ n => null
}
Works as expected.

Haskell: Traverse through a String/Text File

I am trying to read a script file then process and output it to a html file. In my script file, whenever there is a #title(this is a title), I will add tag [header] this is a title [/header] in my html output. So my approach is to first read the script file, write the content to a string, process the string, then write the string to html file.
In other to recognize the #title, I will need to read character by character in the string. When I read '#', I will need to detect the next character to see if they are t i t l e.
QUESTION: How do I traverse through a string (which is a list of char) in Haskell?
You could use a simple recursion trick, for example
findTag [] = -- end of list code.
findTag ('#':xs)
| take 5 xs == "title" = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) = findTag xs
so basically you just pattern match if the next char (head of list) is '#' and then you check if the next 5 characters form "title". if so you can then continue your parsing code. if next character isnt '#' you just continue the recursing. Once the list is empty you reach the first pattern match.
Someone else might have a better solution.
I hope this answers your question.
edit:
For a bit more flexibility, if you want to find a specific tag you could do this:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take (length tagName) xs == tagName = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) _ = findTag xs
This way if you do
findTag text "title"
You'll specifically look for the title, and you can always change the tagname to whatever you want.
Another edit:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take tLength xs == tagName = getTagContents tLength xs
| otherwise = findTag xs
where tLength = length tagName
findTag (_:xs) _ = findTag xs
getTagContents :: Int -> String -> String
getTagContents len = takeWhile (/=')') . drop (len + 1)
to be honest, it's getting a bit messy but here's what's happening:
You first drop the length of the tagName, then one more for the open bracket, and then you finish off by using takeWhile to take the characters until the closing bracket.
Evidently your problem falls into parsing category. As wisely stated by Daniel Wagner, for maintainability reasons you're much better off approaching it generally with a parser.
Another thing is if you want to work with textual data efficiently, you're better off using Text instead of String.
Here's how you could solve your problem using the Attoparsec parser library:
-- For autocasting of hardcoded strings to `Text` type
{-# LANGUAGE OverloadedStrings #-}
-- Import a way more convenient prelude, excluding symbols conflicting
-- with the parser library. See
-- http://hackage.haskell.org/package/classy-prelude
import ClassyPrelude hiding (takeWhile, try)
-- Exclude the standard Prelude
import Prelude ()
import Data.Attoparsec.Text
-- A parser and an inplace converter for title
title = do
string "#title("
r <- takeWhile $ notInClass ")"
string ")"
return $ "[header]" ++ r ++ "[/header]"
-- A parser which parses the whole document to parts which are either
-- single-character `Text`s or modified titles
parts =
(try endOfInput >> return []) ++
((:) <$> (try title ++ (singleton <$> anyChar)) <*> parts)
-- The topmost parser which concats all parts into a single text
top = concat <$> parts
-- A sample input
input = "aldsfj#title(this is a title)sdlfkj#title(this is a title2)"
-- Run the parser and output result
main = print $ parseOnly top input
This outputs
Right "aldsfj[header]this is a title[/header]sdlfkj[header]this is a title2[/header]"
P.S. ClassyPrelude reimplements ++ as an alias for Monoid's mappend, so you can replace it with mappend, <> or Alternative's <|> if you want.
For pattern search-and-replace, you can use
streamEdit.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
title :: Parsec Void String String
title = do
void $ string "#title("
someTill anySingle $ string ")"
editor t = "[header]" ++ t ++ "[/header]"
streamEdit title editor " #title(this is a title) "
" [header]this is a title[/header] "

IndentParser example

Could someone please post a small example of IndentParser usage? I am looking to parse YAML-like input like the following:
fruits:
apples: yummy
watermelons: not so yummy
vegetables:
carrots: are orange
celery raw: good for the jaw
I know there is a YAML package. I would like to learn the usage of IndentParser.
I've sketched out a parser below, for your problem you probably only need the block
parser from IndentParser. Note I haven't tried to run it so it might have elementary errors.
The biggest problem for your parser is not really indenting, but that you only have strings and colon as tokens. You might find the code below takes quite a bit of debugging as it will have to be very sensitive about not consuming too much input, though I have tried to be careful about left-factoring. Because you only have two tokens there isn't much benefit you can get from Parsec's Token module.
Note that there is a strange truth to parsing that simple looking formats are often not simple to parse. For learning, writing a parser for simple expressions will teach you much more that an more-or-less arbitrary text format (that might only cause you frustration).
data DefinitionTree = Nested String [DefinitionTree]
| Def String String
deriving (Show)
-- Note - this might need some testing.
--
-- This is a tricky one, the parser has to parse trailing
-- spaces and tabs but not a new line.
--
category :: IndentCharParser st String
category = do
{ a <- body
; rest
; return a
}
where
body = manyTill1 (letter <|> space) (char ':')
rest = many (oneOf [' ', '\t'])
-- Because the DefinitionTree data type has two quite
-- different constructors, both sharing the same prefix
-- 'category' this combinator is a bit more complicated
-- than usual, and has to use an Either type to descriminate
-- between the options.
--
definition :: IndentCharParser st DefinitionTree
definition = do
{ a <- category
; b <- (textL <|> definitionsR)
; case b of
Left ss -> return (Def a ss)
Right ds -> return (Nested a ds)
}
-- Note this should parse a string *provided* it is on
-- the same line as the category.
--
-- However you might find this assumption needs verifying...
--
textL :: IndentCharParser st (Either DefinitionTrees a)
textL = do
{ ss <- manyTill1 anyChar "\n"
; return (Left ss)
}
-- Finally this one uses an indent parser.
--
definitionsR :: IndentCharParser st (Either a [DefinitionTree])
definitionsR = block body
where
body = do { a <- many1 definition; return (Right a) }

Resources