In my work I come across a lot of gnarly sql, and I had the bright idea of writing a program to parse the sql and print it out neatly. I made most of it pretty quickly, but I ran into a problem that I don't know how to solve.
So let's pretend the sql is "select foo from bar where 1". My thought was that there is always a keyword followed by data for it, so all I have to do is parse a keyword, and then capture all gibberish before the next keyword and store that for later cleanup, if it is worthwhile. Here's the code:
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Char
import Data.Text (strip)
newtype Statement = Statement [Atom]
data Atom = Branch String [Atom] | Leaf String deriving Show
trim str = reverse $ trim' (reverse $ trim' str)
where
trim' (' ':xs) = trim' xs
trim' str = str
printStatement atoms = mapM_ printAtom atoms
printAtom atom = loop 0 atom
where
loop depth (Leaf str) = putStrLn $ (replicate depth ' ') ++ str
loop depth (Branch str atoms) = do
putStrLn $ (replicate depth ' ') ++ str
mapM_ (loop (depth + 2)) atoms
keywords :: [String]
keywords = [
"select",
"update",
"delete",
"from",
"where"]
keywordparser :: Parsec String u String
keywordparser = try ((choice $ map string keywords) <?> "keywordparser")
stuffparser :: Parsec String u String
stuffparser = manyTill anyChar (eof <|> (lookAhead keywordparser >> return ()))
statementparser = do
key <- keywordparser
stuff <- stuffparser
return $ Branch key [Leaf (trim stuff)]
<?> "statementparser"
tp = parse (many statementparser) ""
The key here is the stuffparser. That is the stuff in between the keywords that could be anything from column lists to where criteria. This function catches all characters leading up to a keyword. But it needs something else before it is finished. What if there is a subselect? "select id,(select product from products) from bar". Well in that case if it hits that keyword, it screws everything up, parses it wrong and screws up my indenting. Also where clauses can have parenthesis as well.
So I need to change that anyChar into another combinator that slurps up characters one at a time but also tries to look for parenthesis, and if it finds them, traverse and capture all that, but also if there are more parenthesis, do that until we have fully closed the parenthesis, then concatenate it all and return it. Here's what I've tried, but I can't quite get it to work.
stuffparser :: Parsec String u String
stuffparser = fmap concat $ manyTill somechars (eof <|> (lookAhead keywordparser >> return ()))
where
somechars = parens <|> fmap (\c -> [c]) anyChar
parens= between (char '(') (char ')') somechars
This will error like so:
> tp "select asdf(qwerty) from foo where 1"
Left (line 1, column 14):
unexpected "w"
expecting ")"
But I can't think of any way to rewrite this so that it works. I've tried to use manyTill on the parenthesis part, but I end up having trouble getting it to typecheck when I have both string producing parens and single chars as alternatives. Does anyone have any suggestions on how to go about this?
Yeah, between might not work for what you're looking for. Of course, for your use case, I'd follow hammar's suggestion and grab an off-the-shelf SQL parser. (personal opinion: or, try not to use SQL unless you really have to; the idea to use strings for database queries was imho a historical mistake).
Note: I add an operator called <++> which will concatenate the results of two parsers, whether they are strings or characters. (code at bottom.)
First, for the task of parsing parenthesis: the top level will parse some stuff between the relevant characters, which is exactly what the code says,
parseParen = char '(' <++> inner <++> char ')'
Then, the inner function should parse anything else: non-parens, possibly including another set of parenthesis, and non-paren junk that follows.
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "()") <++> option "" (parseParen <++> inner)
I'll make the assumption that for the rest of the solution, what you want to do is analgous to splitting things up by top-level SQL keywords. (i.e. ignoring those in parenthesis). Namely, we'll have a parser that will behave like so,
Main> parseTest parseSqlToplevel "select asdf(select m( 2) fr(o)m w where n) from b where delete 4"
[(Select," asdf(select m( 2) fr(o)m w where n) "),(From," b "),(Where," "),(Delete," 4")]
Suppose we have a parseKw parser that will get the likes of select, etc. After we consume a keyword, we need to read until the next [top-level] keyword. The last trick to my solution is using the lookAhead combinator to determine whether the next word is a keyword, and put it back if so. If it's not, then we consume a parenthesis or other character, and then recurse on the rest.
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))
My entire solution is as follows
-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g
data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)
parseKw =
(Select <$ string "select") <|>
(Update <$ string "update") <|>
(Delete <$ string "delete") <|>
(From <$ string "from") <|>
(Where <$ string "where") <?>
"keyword (select, update, delete, from, where)"
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> many1 (noneOf "() \t")) <++> parseOther))
parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "()") <++> option "" (parseParen <++> inner)
edit - version with quote support
you can do the same thing as with the parens to support quotes,
import Control.Applicative hiding (many, (<|>))
import Text.Parsec
import Text.Parsec.Combinator
-- overloaded operator to concatenate string results from parsers
class CharOrStr a where toStr :: a -> String
instance CharOrStr Char where toStr x = [x]
instance CharOrStr String where toStr = id
infixl 4 <++>
f <++> g = (\x y -> toStr x ++ toStr y) <$> f <*> g
data Keyword = Select | Update | Delete | From | Where deriving (Eq, Show)
parseKw =
(Select <$ string "select") <|>
(Update <$ string "update") <|>
(Delete <$ string "delete") <|>
(From <$ string "from") <|>
(Where <$ string "where") <?>
"keyword (select, update, delete, from, where)"
-- consume spaces, then eat a word or parenthesis
parseOther = many space <++>
(("" <$ lookAhead (try parseKw)) <|> -- if there's a keyword, put it back!
option "" ((parseParen <|> parseQuote <|> many1 (noneOf "'() \t")) <++> parseOther))
parseSqlToplevel = many ((,) <$> parseKw <*> (space <++> parseOther)) <* eof
parseQuote = char '\'' <++> inner <++> char '\'' where
inner = many (noneOf "'\\") <++>
option "" (char '\\' <++> anyChar <++> inner)
parseParen = char '(' <++> inner <++> char ')' where
inner = many (noneOf "'()") <++>
(parseQuote <++> inner <|> option "" (parseParen <++> inner))
I tried it with parseTest parseSqlToplevel "select ('a(sdf'())b". cheers
Related
Inspired by the following project, I am working with linear expressions and I have defined the following structure and parser.
data AExp
= Lit Rational
| Var String
| AExp :+: AExp
| Rational :*: AExp
deriving (Eq)
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.Expr
import Text.Parsec.Language (javaStyle)
import Text.Parsec.String
import Control.Monad (void, ap)
import qualified Text.Parsec.Token as Token
Token.TokenParser {..} = Token.makeTokenParser javaStyle
binary name fun = Infix (fun <$ reservedOp name) AssocLeft
whitespace :: Parser ()
whitespace = void $ many $ oneOf " \n\t"
regularParse :: Parser a -> String -> Either ParseError a
regularParse p = parse p ""
rational :: Parser Rational
rational = do
whitespace
num <- many1 digit
void $ char '/'
den <- many1 digit
whitespace
return $ toRational $ (read num)/ (read den)
aexp :: Parser AExp
aexp = buildExpressionParser table term
where term = Lit <$> rational
<|> Var <$> identifier
<|> try ((:*:) <$> (rational <* reservedOp "*") <*> aexp)
<|> try (parens aexp)
table = [ [ binary "+" (:+:)]]
My problem is multiplication (:*:) and in general binary operations between two different types (Rational - AExp). The following example shows my result.
Main>regularParse aexp "10/1 * x"
Right (Lit 10 % 1)
That is, it does not match with multiplication (*), it matches with the literal variable Lit.
I have looked for some examples of parser of expressions, but in which I always found the multiplication is a binary operation between AExp, that is, the structure and parser are of this style:
data AExp
= Lit Rational
| Var String
| AExp:+: AExp
| AExp:*: AExp
deriving (Eq)
aexp :: Parser AExp
aexp = buildExpressionParser table term
where term = Lit <$> rational
<|> Var <$> identifier
<|> try ((:*:) <$> (rational <* reservedOp "*") <*> aexp)
<|> try (parens aexp)
table = [[binary "+" (:+:)],
[binary "*" (:*:)]]
The project I am following is defined in this way.
I could try to take another focus for the parser, but for me it is easier to follow the project guide, since many of my structures are very similar.
How could I define the multiplication parser (:*:) or is there any example with to guide me?
Thanks in advance
The problem is here:
term = Lit <$> rational
<|> ...
<|> try ((:*:) <$> (rational <* reservedOp "*") <*> aexp)
<|> ...
This will try parsing a lone rational first; if that succeeds, the remaining branches will not be attempted. This policy of committing to the first successful parse was chosen for efficiency. Just changing the order will get you over this hump and to whatever your next problem will be (there's always one, isn't there??).
term = try ((:*:) <$> (rational <* reservedOp "*") <*> aexp)
<|> Lit <$> rational
<|> ...
I thought a bit more about my original problem. I want to make a parser for linear expressions.
A limited, but sufficient solution is:
varAExp :: Parser AExp
varAExp = do
x <- identifier
return $ Var x
aexp :: Parser AExp
aexp = buildExpressionParser table term
where term = try ((:*:) <$> (rational <* reservedOp "*") <*> varAExp)
<|> Lit <$> rational
<|> Var <$> identifier
<|> try (parens aexp)
table = [[binary "+" (:+:) ]]
With this parser I can work with expressions of this type:
arit_1 = regularParse aexp "10/1*x + 3/2*y + 1/1 + 3/1"
I cannot express the multiplication of numbers, but as I said before it is enough for me.
thanks for your help
I was working on "Write Yourself a Scheme in 48 hours" to learn Haskell and I've run into a problem I don't really understand. It's for question 2 from the exercises at the bottom of this section.
The task is to rewrite
import Text.ParserCombinators.Parsec
parseString :: Parser LispVal
parseString = do
char '"'
x <- many (noneOf "\"")
char '"'
return $ String x
such that quotation marks which are properly escaped (e.g. in "This sentence \" is nonsense") get accepted by the parser.
In an imperative language I might write something like this (roughly pythonic pseudocode):
def parseString(input):
if input[0] != "\"" or input[len(input)-1] != "\"":
return error
input = input[1:len(input) - 1] # slice off quotation marks
output = "" # This is the 'zero' that accumulates over the following loop
# If there is a '"' in our string we want to make sure the previous char
# was '\'
for n in range(len(input)):
if input[n] == "\"":
try:
if input[n - 1] != "\\":
return error
catch IndexOutOfBoundsError:
return error
output += input[n]
return output
I've been looking at the docs for Parsec and I just can't figure out how to work this as a monadic expression.
I got to this:
parseString :: Parser LispVal
parseString = do
char '"'
regular <- try $ many (noneOf "\"\\")
quote <- string "\\\""
char '"'
return $ String $ regular ++ quote
But this only works for one quotation mark and it has to be at the very end of the string--I can't think of a functional expression that does the work that my loops and if-statements do in the imperative pseudocode.
I appreciate you taking your time to read this and give me advice.
Try something like this:
dq :: Char
dq = '"'
parseString :: Parser Val
parseString = do
_ <- char dq
x <- many ((char '\\' >> escapes) <|> noneOf [dq])
_ <- char dq
return $ String x
where
escapes = dq <$ char dq
<|> '\n' <$ char 'n'
<|> '\r' <$ char 'r'
<|> '\t' <$ char 't'
<|> '\\' <$ char '\\'
The solution is to define a string literal as a starting quote + many valid characters + an ending quote where a "valid character" is either a an escape sequence or non-quote.
So there is a one line change to parseString:
parseString = do char '"'
x <- many validChar
char '"'
return $ String x
and we add the definitions:
validChar = try escapeSequence <|> satisfy ( /= '"' )
escapeSequence = do { char '\\'; anyChar }
escapeSequence may be refined to allow a limited set of escape sequences.
I'm trying to parse a small language with Haskell-like syntax, using parsec-layout. The two key features that don't seem to interact too well with each other are:
Function application syntax is juxtaposition, i.e. if F and E are terms, F E is the syntax for F applied to E.
Indentation can be used to denote nesting, i.e. the following two are equivalent:
X = case Y of
A -> V
B -> W
X = case Y of A -> V; B -> W
I haven't managed to figure out a combination of skipping and keeping whitespace that would allow me to parse a list of such definitions. Here's my simplified code:
import Text.Parsec hiding (space, runP)
import Text.Parsec.Layout
import Control.Monad (void)
type Parser = Parsec String LayoutEnv
data Term = Var String
| App Term Term
| Case Term [(String, Term)]
deriving Show
name :: Parser String
name = spaced $ (:) <$> upper <*> many alphaNum
kw :: String -> Parser ()
kw = void . spaced . string
reserved :: String -> Parser ()
reserved s = try $ spaced $ string s >> notFollowedBy alphaNum
term :: Parser Term
term = foldl1 App <$> part `sepBy1` space
where
part = choice [ caseBlock
, Var <$> name
]
caseBlock = Case <$> (reserved "case" *> term <* reserved "of") <*> laidout alt
alt = (,) <$> (name <* kw "->") <*> term
binding :: Parser (String, Term)
binding = (,) <$> (name <* kw "=") <*> term
-- https://github.com/luqui/parsec-layout/issues/1
trim :: String -> String
trim = reverse . dropWhile (== '\n') . reverse
runP :: Parser a -> String -> Either ParseError a
runP p = runParser (p <* eof) defaultLayoutEnv "" . trim
If I try to run it on input like
s = unlines [ "A = case B of"
, " X -> Y Z"
, "C = D"
]
via runP (laidout binding) s, it fails on the application Y Z:
(line 2, column 10):
expecting space or semi-colon
However, if I change the definition of term to
term = foldl1 App <$> many1 part
then it doesn't stop parsing the term at the start of the (unindented!) third line, leading to
(line 3, column 4):
expecting semi-colon
I think the problem has to do with that name already eliminates the following space, so the sepBy1 in the definition of term doesn't see it.
Consider these simplified versions of term:
term0 = foldl1 App <$> (Var <$> name) `sepBy1` space
term1 = foldl1 App <$> (Var <$> name') `sepBy1` space
name' = (:) <$> upper <*> many alphaNum
term3 = foldl1 App <$> many (Var <$> name)
Then:
runP term0 "A B C" -- fails
runP term1 "A B C" -- succeeds
runP term3 "A B C" -- succeeds
I think part of the solution is to define
part = [ caseBlock, Var <$> name' ]
where name' is as above. However, there are still some issues.
I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:
EXP : EXP *
| LIT EXP
| LIT
I've tried to implement this in Haskell as:
expr = try star
<|> try litE
<|> lit
litE = do c <- noneOf "*"
rest <- expr
return (c : rest)
lit = do c <- noneOf "*"
return [c]
star = do content <- expr
char '*'
return (content ++ "*")
There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.
Any thoughts?
You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!
You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.
Here's my attempt (I threw in |, +, and ? for good measure):
import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
data Term = Literal Char
| Sequence [Term]
| Repeat (Int, Maybe Int) Term
| Choice [Term]
deriving ( Show )
term :: Parser Term
term = buildExpressionParser ops atom where
ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
, Postfix (Repeat (1, Nothing) <$ char '+')
, Postfix (Repeat (0, Just 1) <$ char '?')
]
, [ Infix (return sequence) AssocRight
]
, [ Infix (choice <$ char '|') AssocRight
]
]
atom = msum [ Literal <$> lit
, parens term
]
lit = noneOf "*+?|()"
sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
parens = between (char '(') (char ')')
seqTerms (Sequence ts) = ts
seqTerms t = [t]
choiceTerms (Choice ts) = ts
choiceTerms t = [t]
main = parseTest term "he(llo)*|wor+ld?"
Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:
lit :: Parser (Char, Maybe Char)
lit = do
c <- noneOf "*"
s <- optionMaybe $ char '*'
return (c, s)
Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:
import Control.Applicative ((<$>))
data Term = Literal Char
| Sequence [Term]
| Star Term
expr :: Parser Term
expr = Sequence <$> many term
term :: Parser Term
term = do
c <- lit
s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
return $ if isNothing s
then Literal c
else Star $ Literal c
Maybe a more experienced Haskeller will come along with a better solution.
I try to do this:
Parse a Text in the form:
Some Text #{0,0,0} some Text #{0,0,0}#{0,0,0} more Text #{0,0,0}
into a list of some data structure:
[Inside "Some Text ",Outside (0,0,0),Inside " some Text ",Outside (0,0,0),Outside (0,0,0),Inside " more Text ",Outside (0,0,0)]
So these #{a,b,c}-bits should turn into different things as the rest of the text.
I have this code:
module ParsecTest where
import Text.ParserCombinators.Parsec
import Monad
type Reference = (Int, Int, Int)
data Transc = Inside String | Outside Reference
deriving (Show)
text :: Parser Transc
text = do
x <- manyTill anyChar ((lookAhead reference) <|> (eof >> return (Inside "")));
return (Inside x)
transc = reference <|> text
alot :: Parser [Transc]
alot = do
manyTill transc eof
reference :: Parser Transc
reference = try (do{ char '#';
char '{';
a <- number;
char ',';
b <- number;
char ',';
c <- number;
char '}';
return (Outside (a,b,c)) })
number :: Parser Int
number = do{ x <- many1 digit;
return (read x) }
This works as expected. You can test this in ghci by typing
parseTest alot "Some Text #{0,0,0} some Text #{0,0,0}#{0,0,0} more Text #{0,0,0}"
But I think it's not nice.
1) Is the use of lookAhead really necessary for my problem?
2) Is the return (Inside "") an ugly hack?
3) Is there generally a more concise/smarter way to archieve the same?
1) I think you do need lookAhead as you need the result of that parse. It would be nice to avoid running that parser twice by having a Parser (Transc,Maybe Transc) to indicate an Inside with an optional following Outside. If performance is an issue, then this is worth doing.
2) Yes.
3) Applicatives
number2 :: Parser Int
number2 = read <$> many1 digit
text2 :: Parser Transc
text2 = (Inside .) . (:)
<$> anyChar
<*> manyTill anyChar (try (lookAhead reference2) *> pure () <|> eof)
reference2 :: Parser Transc
reference2 = ((Outside .) .) . (,,)
<$> (string "#{" *> number2 <* char ',')
<*> number2
<*> (char ',' *> number2 <* char '}')
transc2 = reference2 <|> text2
alot2 = many transc2
You may want to rewrite the beginning of reference2 using a helper like aux x y z = Outside (x,y,z).
EDIT: Changed text to deal with inputs that don't end with an Outside.