Lazy parsing of large files with millions of data points - parsing

I have a parser I'm trying to write and I've gone over multiple versions of it and I can't seem to keep the memory usage down. I'm trying to parse the wikipedia sql dumps and in this example take the page entries file and throw them all in one giant vector (13 million pages). I got this working before, but I sharded the file into smaller pieces and aggregated them from the filesystem. Anyway, below is my attempt to parse these files in one pass. I'm pulling out the lines "INSERT INTO ... (data),(data)..(data);". I'm letting the other lines just fail in the parser and not yield values. The insert lines average 1MB in size and 8k entries.
I've tried what feels like a thousand versions of this with deepseqs and outputting lists of page entries per line instead of the vector I have now and using lineC with a partial parser instead of the linesUnboundedC. I assume there must be some high level concept I'm missing that is keeping the memory usage exponential. The file is 5GB and I'm easily blowing through 16GB of memory, but I can't seem to nail down a lazy thunk memory leak anywhere. I would assume that this code would be able to store each line independently and then move on without the extra memory to the next line, but I can't figure out what is wrong. I had no issue counting the amount of entries or using a more complex monoid to get data from the file using the same parser.
Any help is much appreciated!
{-# LANGUAGE OverloadedStrings #-}
module Parser.Helper where
import Conduit
import Control.DeepSeq
import Control.Monad (void)
import Control.Monad.Primitive
import Data.Attoparsec.Text
import Data.Maybe (catMaybes)
import Data.Text (Text)
import qualified Data.Vector as V
import Data.Vector.Generic (Vector)
import System.IO (openFile,IOMode(..))
data Page = Page !Int !Text !Bool
deriving Show
instance NFData Page where
rnf (Page pid title isRedirect) = pid `seq` title `seq` isRedirect `seq` ()
parseFile
:: (Foldable t, Vector v e)
=> FilePath -> Parser (t e) -> IO [v e]
parseFile fp parser =
do
handle <- openFile fp ReadMode
runConduit $ sourceHandle handle
.| decodeUtf8LenientC
.| peekForeverE linesUnboundedC
.| vectors parser
.| sinkList
vectors
:: (Foldable t, Vector v e, MonadBase base m, Control.Monad.Primitive.PrimMonad base)
=> Parser (t e) -> ConduitM Text (v e) m ()
vectors parser =
do
vectorBuilderC (1024*1024)
(\f ->
peekForeverE $ do
ml <- await
case ml of
Nothing -> return ()
Just l ->
case parseOnly parser l of
Left _ -> return ()
Right v -> mapM_ f v
)
parsePageLine :: Parser (V.Vector Page)
parsePageLine =
do
string "INSERT INTO" *> skipWhile (/= '(')
V.fromList . catMaybes <$> sepBy' parsePageField (char ',') <* char ';'
parsePageField :: Parser (Maybe Page)
parsePageField =
do
void $ char '('
pid <- parseInt <* char ','
namespace <- parseInt <* char ','
title <- parseTextField <* char ','
_ <- skipField <* char ','
_ <- skipField <* char ','
redirect <- parseInt <* char ','
void $ sepBy' skipField (char ',')
void $ char ')'
let ret = case namespace == 0 of
True -> Just $ Page pid title (redirect == 1)
False -> Nothing
return $ force ret
parseTextField :: Parser Text
parseTextField = char '\'' *> scan False f <* char '\''
where
f :: Bool -> Char -> Maybe Bool
f False '\'' = Nothing
f False '\\' = Just True
f _ _ = Just False
skipField :: Parser ()
skipField = void $ scan False f
where
f :: Bool-> Char -> Maybe Bool
f False '\\' = Just True
f False ',' = Nothing
f False ')' = Nothing
f _ _ = Just False
parseInt :: Parser Int
parseInt = signed $ decimal

Related

Haskell - Parsec with state

I have a file in which the state of a game is saved in String format. This string consists of a list of moves, seperated by ,. From this list of moves I have to reconstruct the game state. Thus, conceptually, for each move I parse I want to modify the gamestate appropriatly and pass this gamestate to the parsing of the next move. Conceptually, this could be equivalent with having an empty list at the start and for each move consing the parsed move to that list. At the end you should have a list with all parsed moves.
I made the code example below as a simplified version to parse alfabetic letters and push these on the list. The core concept I want to learn is how to have an initial state, pass this on for each parsing cycle and return the final state using parsec. someState is the empty list initially.
parseExample :: State -> Parser [Char]
parseExample someState = do spaces
c <- char
c : someState
return someState
The simplest way to incorporate a "state" into a parser is not to do it at all. Let's say we have a tic-tac-toe board:
data Piece = X | O | N deriving (Show)
type Board = [[Piece]]
To parse a list of moves:
X11,O00,X01
into a board [[O,X,N],[N,X,N],[N,N,N]] representing the game state:
O | X |
---+---+---
| X |
---+---+---
| |
we can separate the parser, which just generates a list of moves:
data Move = Move Piece Int Int
moves :: Parser [Move]
moves = sepBy move (char ',')
where move = Move <$> piece <*> num <*> num
piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
from the functions that regenerate the game state:
board0 :: Board
board0 = [[N,N,N],[N,N,N],[N,N,N]]
game :: [Move] -> Board
game = foldl' turn board0
turn :: Board -> Move -> Board
turn brd (Move p r c) = brd & ix r . ix c .~ p
and then connect them together in a loadGame function:
loadGame :: String -> Board
loadGame str =
case parse moves "" str of
Left err -> error $ "parse error: " ++ show err
Right mvs -> game mvs
This should be your go-to solution for this kind of problem: parse first into a simple stateless intermediate form, and then process this intermediate form in a "stateful" computation.
If you really want to build up the state during the parse, there are several ways to do it. In this particular case, given the definition of turn above, we can parse directly into a Board by incorporating the fold from the game function into the parser:
moves1 :: Parser Board
moves1 = foldl' turn board0 <$> sepBy move (char ',')
where move = Move <$> piece <*> num <*> num
piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
but this won't generalize too well if you have multiple parsers that need to operate on a single underlying state.
To actually thread a state through a set of parsers, you can use the "user state" feature of Parsec. Define a parser with a Board user state:
type Parser' = Parsec String Board
and then a parser for a single move that modifies the user state:
move' :: Parser' ()
move' = do
m <- Move <$> piece <*> num <*> num
modifyState (flip turn m)
where piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
Note that move''s return type is () because its action is implemented as a side-effect on the user state.
Now, the act of simply parsing a list of moves:
moves' :: Parser' ()
moves' = sepBy move' (char ',')
will generate the final game state:
loadGame' :: String -> Board
loadGame' str =
case runParser (moves' >> getState) [[N,N,N],[N,N,N],[N,N,N]] "" str of
Left err -> error $ "parse error: " ++ show err
Right brd -> brd
Here, loadGame' runs the parser on the user state with moves' and then uses a getState call to fetch the final board.
A nearly equivalent solution, since ParsecT is a monad transformer, is to create a ParsecT ... (State Board) monad transformer stack with a standard State layer. For example:
type Parser'' = ParsecT String () (Control.Monad.State.State Board)
move'' :: Parser'' ()
move'' = do
m <- Move <$> piece <*> num <*> num
modify (flip turn m)
where piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
moves'' :: Parser'' ()
moves'' = void $ sepBy move'' (char ',')
loadGame'' :: String -> Board
loadGame'' str =
case runState (runParserT moves'' () "" str) board0 of
(Left err, _) -> error $ "parse error: " ++ show err
(Right (), brd) -> brd
However, both these approaches of building up a state while parsing are weird and non-standard. A parser written in this form will be harder to understand and modify than the standard approach. Also, the intended usage for a user state is to maintain state that's necessary for the parser to decide how to perform the actual parse. For example, if you were parsing a language with dynamic operator precedence, you might want to maintain the current set of operator precedences as states, so when you parse an infixr 8 ** line, you can modify the state to correctly parse subsequent expressions. Using the user state to actually build up the result of the parse is not the intended usage.
Anyway, here's the code I used:
import Control.Lens
import Control.Monad
import Control.Monad.State
import Data.Foldable
import Text.Parsec
import Text.Parsec.Char
import Text.Parsec.String
data Piece = X | O | N deriving (Show)
type Board = [[Piece]]
data Move = Move Piece Int Int
-- *Standard parsing approach
moves :: Parser [Move]
moves = sepBy move (char ',')
where move = Move <$> piece <*> num <*> num
piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
board0 :: Board
board0 = [[N,N,N],[N,N,N],[N,N,N]]
game :: [Move] -> Board
game = foldl' turn board0
turn :: Board -> Move -> Board
turn brd (Move p r c) = brd & ix r . ix c .~ p
loadGame :: String -> Board
loadGame str =
case parse moves "" str of
Left err -> error $ "parse error: " ++ show err
Right mvs -> game mvs
-- *Incoporate fold into parser
moves1 :: Parser Board
moves1 = foldl' turn board0 <$> sepBy move (char ',')
where move = Move <$> piece <*> num <*> num
piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
-- *Non-standard effectful parser
type Parser' = Parsec String Board
move' :: Parser' ()
move' = do
m <- Move <$> piece <*> num <*> num
modifyState (flip turn m)
where piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
moves' :: Parser' ()
moves' = void $ sepBy move' (char ',')
loadGame' :: String -> Board
loadGame' str =
case runParser (moves' >> getState) board0 "" str of
Left err -> error $ "parse error: " ++ show err
Right brd -> brd
-- *Monad transformer stack
type Parser'' = ParsecT String () (Control.Monad.State.State Board)
move'' :: Parser'' ()
move'' = do
m <- Move <$> piece <*> num <*> num
modify (flip turn m)
where piece = X <$ char 'X' <|> O <$ char 'O'
num = read . (:[]) <$> digit
moves'' :: Parser'' ()
moves'' = void $ sepBy move'' (char ',')
loadGame'' :: String -> Board
loadGame'' str =
case runState (runParserT moves'' () "" str) board0 of
(Left err, _) -> error $ "parse error: " ++ show err
(Right (), brd) -> brd
-- *Tests
main = do
print $ loadGame "X11,O00,X01"
print $ loadGame' "X11,O00,X01"
print $ loadGame'' "X11,O00,X01"
You probably want to use foldl (if I am understanding your question correctly). So you'll end up with a function like:
module Main where
import Data.Text
import Data.String
main :: IO ()
main =
putStrLn (show $ parseGameState "a, b, c")
data State = State deriving (Show)
parseGameState :: String -> [State]
parseGameState stateString = parsedState where
parsedState = Prelude.foldl mkNewStateFromPreviousAndMove [] moves where
moves = splitOn (fromString ",") (fromString stateString)
mkNewStateFromPreviousAndMove oldStates move = oldStates ++ [newState previousState move] where
previousState = Prelude.last oldStates
newState previousState move = State
What this is doing is:
Taking the CSV move string as input.
It is then splitting that string into a list of move strings.
Then, we start with an empty list, and fold the move strings into this list by applying mkNewStateFromPreviousAndMove to each element in the moves list and the last element of the list which is built up by the fold.
Note, you will need to add the following deps to your package.yaml file (if using stack):
text
This dep is used to split strings.

How can I parse a float with a comma in place of the decimal point?

I want to parse Float values from a file where they are stored using comma as the decimal separator. Thus i need a function myParse :: String -> Float such that, for instance, myParse "23,46" == 23.46.
I have some ideas about how to do this, but they all seem overcomplicated, for example:
replace , with a . in the string and use read; or
follow this FP Complete blogpost (entitled Parsing Floats With Parsec), and challenge the curse of the monomorphism restriction.
Is there a simpler way, or do I really need to use a parsing library? In the second case, could you please paste some suggestions in order to get me started? The monomorphism restriction scares me, and I believe that there has to be a way to do this without using language extensions.
Replacing , by . and then call read is straightforward enough; you just need to remember to use your own specialized function instead of plain old read:
readFloatWithComma :: String -> Float
readFloatWithComma = read . sanitize
where
sanitize = map (\c -> if c == ',' then '.' else c)
In GHCi:
λ> readFloatWithComma "23,46"
23.46
Regarding the parsec approach, despite what the article you link to suggest, the monomorphism restriction needs not be a worry, as long as you have type signatures for all your top-level bindings. In particular, the following code doesn't need any language extensions to compile properly (at least, in GHC 7.10.1):
import Text.Parsec
import Text.Parsec.String ( Parser )
import Control.Applicative hiding ( (<|>) )
infixr 5 <++>
(<++>) :: Applicative f => f [a] -> f [a] -> f [a]
a <++> b = (++) <$> a <*> b
infixr 5 <:>
(<:>) :: Applicative f => f a -> f [a] -> f [a]
a <:> b = (:) <$> a <*> b
number :: Parser String
number = many1 digit
plus :: Parser String
plus = char '+' *> number
minus :: Parser String
minus = char '-' <:> number
integer :: Parser String
integer = plus <|> minus <|> number
float :: Parser Float
float = fmap rd $ integer <++> decimal <++> exponent
where rd = read :: String -> Float
decimal = option "" $ ('.' <$ char ',') <:> number
exponent = option "" $ oneOf "eE" <:> integer
In GHCi:
λ> parseTest float "23,46"
23.46

Is there any trick about translating BNF to Parsec program?

The BNF that match function call chain (like x(y)(z)...):
expr = term T
T = (expr) T
| EMPTY
term = (expr)
| VAR
Translate it to Parsec program that looks so tricky.
term :: Parser Term
term = parens expr <|> var
expr :: Parser Term
expr = do whiteSpace
e <- term
maybeAddSuffix e
where addSuffix e0 = do e1 <- parens expr
maybeAddSuffix $ TermApp e0 e1
maybeAddSuffix e = addSuffix e
<|> return e
Could you list all the design patterns about translating BNF to Parsec program?
The simplest think you could do if your grammar is sizeable is to just use the Alex/Happy combo. It is fairly straightforward to use, accepts the BNF format directly - no human translation needed - and perhaps most importantly, produces blazingly fast parsers/lexers.
If you are dead set on doing it with parsec (or you are doing this as a learning exercise), I find it easier in general to do it in two stages; first lexing, then parsing. Parsec will do both!
First write the appropriate types:
{-# LANGUAGE LambdaCase #-}
import Text.Parsec
import Text.Parsec.Combinator
import Text.Parsec.Prim
import Text.Parsec.Pos
import Text.ParserCombinators.Parsec.Char
import Control.Applicative hiding ((<|>))
import Control.Monad
data Term = App Term Term | Var String deriving (Show, Eq)
data Token = LParen | RParen | Str String deriving (Show, Eq)
type Lexer = Parsec [Char] () -- A lexer accepts a stream of Char
type Parser = Parsec [Token] () -- A parser accepts a stream of Token
Parsing a single token is simple. For simplicity, a variable is 1 or more letters. You can of course change this however you like.
oneToken :: Lexer Token
oneToken = (char '(' >> return LParen) <|>
(char ')' >> return RParen) <|>
(Str <$> many1 letter)
Parsing the entire token stream is just parsing a single token many times, possible separated by whitespace:
lexer :: Lexer [Token]
lexer = spaces >> many1 (oneToken <* spaces)
Note the placement of spaces: this way, white space is accepted at the beginning and end of the string.
Since Parser uses a custom token type, you have to use a custom satisfy function. Fortunately, this is almost identical to the existing satisfy.
satisfy' :: (Token -> Bool) -> Parser Token
satisfy' f = tokenPrim show
(\src _ _ -> incSourceColumn src 1)
(\x -> if f x then Just x else Nothing)
Then we can write parsers for each of the primitive tokens.
lparen = satisfy' $ \case { LParen -> True ; _ -> False }
rparen = satisfy' $ \case { RParen -> True ; _ -> False }
strTok = (\(Str s) -> s) <$> (satisfy' $ \case { Str {} -> True ; _ -> False })
As you may imagine, parens would be useful for our purposes. It is very straightforward to write.
parens :: Parser a -> Parser a
parens = between lparen rparen
Now the interesting parts.
term, expr, var :: Parser Term
term = parens expr <|> var
var = Var <$> strTok
These two should be fairly obvious to you.
Parec contains combinators option and optionMaybe which are useful when you you need to "maybe do something".
expr = do
e0 <- term
option e0 (parens expr >>= \e1 -> return (App e0 e1))
The last line means - try to apply the parser given to option - if it fails, instead return e0.
For testing you can do:
tokAndParse = runParser (lexer <* eof) () "" >=> runParser (expr <* eof) () ""
The eof attached to each parser is to make sure that the entire input is consumed; the string cannot be a member of the grammar if there are extra trailing characters. Note - your example x(y)(z) is not actually in your grammar!
>tokAndParse "x(y)(z)"
Left (line 1, column 5):
unexpected LParen
expecting end of input
But the following is
>tokAndParse "(x(y))(z)"
Right (App (App (Var "x") (Var "y")) (Var "z"))

Haskell Parsec - error messages are less helpful while using custom tokens

I'm working on seperating lexing and parsing stages of a parser. After some tests, I realized error messages are less helpful when I'm using some tokens other than Parsec's Char tokens.
Here are some examples of Parsec's error messages while using Char tokens:
ghci> P.parseTest (string "asdf" >> spaces >> string "ok") "asdf wrong"
parse error at (line 1, column 7):
unexpected "w"
expecting space or "ok"
ghci> P.parseTest (choice [string "ok", string "nop"]) "wrong"
parse error at (line 1, column 1):
unexpected "w"
expecting "ok" or "nop"
So, string parser shows what string is expected when found an unexpected string, and choice parser shows what are alternatives.
But when I use same combinators with my tokens:
ghci> Parser.parseTest ((tok $ Ide "asdf") >> (tok $ Ide "ok")) "asdf "
parse error at "test" (line 1, column 1):
unexpected end of input
In this case, it doesn't print what was expected.
ghci> Parser.parseTest (choice [tok $ Ide "ok", tok $ Ide "nop"]) "asdf "
parse error at (line 1, column 1):
unexpected (Ide "asdf","test" (line 1, column 1))
And when I use choice, it doesn't print alternatives.
I expect this behavior to be related with combinator functions, and not with tokens, but seems like I'm wrong. How can I fix this?
Here's the full lexer + parser code:
Lexer:
module Lexer
( Token(..)
, TokenPos(..)
, tokenize
) where
import Text.ParserCombinators.Parsec hiding (token, tokens)
import Control.Applicative ((<*), (*>), (<$>), (<*>))
data Token = Ide String
| Number String
| Bool String
| LBrack
| RBrack
| LBrace
| RBrace
| Keyword String
deriving (Show, Eq)
type TokenPos = (Token, SourcePos)
ide :: Parser TokenPos
ide = do
pos <- getPosition
fc <- oneOf firstChar
r <- optionMaybe (many $ oneOf rest)
spaces
return $ flip (,) pos $ case r of
Nothing -> Ide [fc]
Just s -> Ide $ [fc] ++ s
where firstChar = ['A'..'Z'] ++ ['a'..'z'] ++ "_"
rest = firstChar ++ ['0'..'9']
parsePos p = (,) <$> p <*> getPosition
lbrack = parsePos $ char '[' >> return LBrack
rbrack = parsePos $ char ']' >> return RBrack
lbrace = parsePos $ char '{' >> return LBrace
rbrace = parsePos $ char '}' >> return RBrace
token = choice
[ ide
, lbrack
, rbrack
, lbrace
, rbrace
]
tokens = spaces *> many (token <* spaces)
tokenize :: SourceName -> String -> Either ParseError [TokenPos]
tokenize = runParser tokens ()
Parser:
module Parser where
import Text.Parsec as P
import Control.Monad.Identity
import Lexer
parseTest :: Show a => Parsec [TokenPos] () a -> String -> IO ()
parseTest p s =
case tokenize "test" s of
Left e -> putStrLn $ show e
Right ts' -> P.parseTest p ts'
tok :: Token -> ParsecT [TokenPos] () Identity Token
tok t = token show snd test
where test (t', _) = case t == t' of
False -> Nothing
True -> Just t
SOLUTION:
Ok, after fp4me's answer and reading Parsec's Char source more carefully, I ended up with this:
{-# LANGUAGE FlexibleContexts #-}
module Parser where
import Text.Parsec as P
import Control.Monad.Identity
import Lexer
parseTest :: Show a => Parsec [TokenPos] () a -> String -> IO ()
parseTest p s =
case tokenize "test" s of
Left e -> putStrLn $ show e
Right ts' -> P.parseTest p ts'
type Parser a = Parsec [TokenPos] () a
advance :: SourcePos -> t -> [TokenPos] -> SourcePos
advance _ _ ((_, pos) : _) = pos
advance pos _ [] = pos
satisfy :: (TokenPos -> Bool) -> Parser Token
satisfy f = tokenPrim show
advance
(\c -> if f c then Just (fst c) else Nothing)
tok :: Token -> ParsecT [TokenPos] () Identity Token
tok t = (Parser.satisfy $ (== t) . fst) <?> show t
Now I'm getting same error messages:
ghci> Parser.parseTest (choice [tok $ Ide "ok", tok $ Ide "nop"]) " asdf"
parse error at (line 1, column 1):
unexpected (Ide "asdf","test" (line 1, column 3))
expecting Ide "ok" or Ide "nop"
A beginning of solution can be to define your choice function in the Parser,
use a specific unexpected function to override unexpected error and finally
use the <?> operator to override the expecting message:
mychoice [] = mzero
mychoice (x:[]) = (tok x <|> myUnexpected) <?> show x
mychoice (x:xs) = ((tok x <|> mychoice xs) <|> myUnexpected) <?> show (x:xs)
myUnexpected = do
input <- getInput
unexpected $ (id $ first input )
where
first [] = "eof"
first (x:xs) = show $ fst x
and call your parser like that :
ghci> Parser.parseTest (mychoice [Ide "ok", Ide "nop"]) "asdf "
parse error at (line 1, column 1):
unexpected Ide "asdf"
expecting [Ide "ok",Ide "nop"]

Using Parsec to parse regular expressions

I'm trying to learn Parsec by implementing a small regular expression parser. In BNF, my grammar looks something like:
EXP : EXP *
| LIT EXP
| LIT
I've tried to implement this in Haskell as:
expr = try star
<|> try litE
<|> lit
litE = do c <- noneOf "*"
rest <- expr
return (c : rest)
lit = do c <- noneOf "*"
return [c]
star = do content <- expr
char '*'
return (content ++ "*")
There are some infinite loops here though (e.g. expr -> star -> expr without consuming any tokens) which makes the parser loop forever. I'm not really sure how to fix it though, because the very nature of star is that it consumes its mandatory token at the end.
Any thoughts?
You should use Parsec.Expr.buildExprParser; it is ideal for this purpose. You simply describe your operators, their precedence and associativity, and how to parse an atom, and the combinator builds the parser for you!
You probably also want to add the ability to group terms with parens so that you can apply * to more than just a single literal.
Here's my attempt (I threw in |, +, and ? for good measure):
import Control.Applicative
import Control.Monad
import Text.ParserCombinators.Parsec
import Text.ParserCombinators.Parsec.Expr
data Term = Literal Char
| Sequence [Term]
| Repeat (Int, Maybe Int) Term
| Choice [Term]
deriving ( Show )
term :: Parser Term
term = buildExpressionParser ops atom where
ops = [ [ Postfix (Repeat (0, Nothing) <$ char '*')
, Postfix (Repeat (1, Nothing) <$ char '+')
, Postfix (Repeat (0, Just 1) <$ char '?')
]
, [ Infix (return sequence) AssocRight
]
, [ Infix (choice <$ char '|') AssocRight
]
]
atom = msum [ Literal <$> lit
, parens term
]
lit = noneOf "*+?|()"
sequence a b = Sequence $ (seqTerms a) ++ (seqTerms b)
choice a b = Choice $ (choiceTerms a) ++ (choiceTerms b)
parens = between (char '(') (char ')')
seqTerms (Sequence ts) = ts
seqTerms t = [t]
choiceTerms (Choice ts) = ts
choiceTerms t = [t]
main = parseTest term "he(llo)*|wor+ld?"
Your grammar is left-recursive, which doesn’t play nice with try, as Parsec will repeatedly backtrack. There are a few ways around this. Probably the simplest is just making the * optional in another rule:
lit :: Parser (Char, Maybe Char)
lit = do
c <- noneOf "*"
s <- optionMaybe $ char '*'
return (c, s)
Of course, you’ll probably end up wrapping things in a data type anyway, and there are a lot of ways to go about it. Here’s one, off the top of my head:
import Control.Applicative ((<$>))
data Term = Literal Char
| Sequence [Term]
| Star Term
expr :: Parser Term
expr = Sequence <$> many term
term :: Parser Term
term = do
c <- lit
s <- optionMaybe $ char '*' -- Easily extended for +, ?, etc.
return $ if isNothing s
then Literal c
else Star $ Literal c
Maybe a more experienced Haskeller will come along with a better solution.

Resources