Parsing large matrices in Haskell - parsing

I have recently started learning Haskell, I'm testing what I've learnt using the Google KickStart code challanges.
The problem I am looking at essentially asks to read square matrices from standard input and sum the elements on the diagonals parallel to the main diagonal.
I have written a simple program to solve the problem, this passes the first test but not the second.
The problem seems to be that the input in the second test is much larger than the first and the parsers I have written are either too slow or use too much memory.
The input is formatted as a sequence of matrices
n
a11 ... a1n
...
an1 ... ann
My first attempt using builtin function
type Matrix = [[Int]]
parseInput:: String -> [Matrix]
parseInput = parseInput' . lines
where
parseInput':: [String] -> [Matrix]
parseInput' [] = []
parseInput' (i:is) = let (m,rest) = splitAt (read i) is in
(map ((map read).words) m) : (parseInput' rest)
main = do
input <- getContents
let result = show $ parseInput input
putStr result
The second attempt using the parsec library
import Text.Parsec
import Text.Parsec.String (Parser)
type Row = [Int]
type Matrix = [[Int]]
matrixSize::(Integral a, Read a) => Parser a
matrixSize = do
n <- ( many1 digit ) <* newline
return $ read n
matrixRow:: Parser Row
matrixRow = do
row <- (sepBy1 (many1 digit) (char ' ')) <* (newline <|> (eof >> return ' '))
return $ map read row
parseMatrix:: Parser Matrix
parseMatrix = do
n <- matrixSize
mat <- count n matrixRow
return mat
parseInput:: Parser [Matrix]
parseInput = (many1 parseMatrix)
prepareInput:: String -> [Matrix]
prepareInput input = case parse parseInput "" input of
Right l -> l
Left _ -> []
main = do
input <- getContents
let result = show $ prepareInput input
putStr result
According to Google website the second test instance contains at most 100 matrices of which 10 are of size at most 1000 with elements of size at most 10^7, the remaining matrices are smaller of size at most 100.
I have generated a random test file with 10 large matrices of size at most 1000 and it turns out to be roughly 30MB in size.
Running either parser on the test file takes roughly 13s on my machine and the second uses more than 1GB of memory.
Both fail the requirements for running on the server: the first exceeds 20s execution time and the second fails the memory constraint of 1GB.
Is there a better way for parsing large files?
Is there any inefficiency that can be solved in my code?
Any advice will be greatly appreciated!

Related

Operating on parsed data with attoparsec

Background
I've written a logfile parser using attoparsec. All my smaller parsers succeed, as does the composed final parser. I've confirmed this with tests. But I'm stumbling over performing operations with the parsed stream.
What I've tried
I started by trying to pass the successfully parsed input to a function. But all the seems to get is Done (), which I'm presuming means the logfile has been consumed by this point.
prepareStats :: Result Log -> IO ()
prepareStats r =
case r of
Fail _ _ _ -> putStrLn $ "Parsing failed"
Done _ parsedLog -> putStrLn "Success" -- This now has a [LogEntry] array. Do something with it.
main :: IO ()
main = do
[f] <- getArgs
logFile <- B.readFile (f :: FilePath)
let results = parseOnly parseLog logFile
putStrLn "TBC"
What I'm trying to do
I want to accumulate some stats from the logfile as I consume the input. For example, I'm parsing response codes and I'd like to count how many 2** responses there were and how many 4/5** ones. I'm parsing the number of bytes each response returned as Ints, and I'd like to efficiently sum these (sounds like a foldl'?). I've defined a data type like this:
data Stats = Stats {
successfulRequestsPerMinute :: Int
, failingRequestsPerMinute :: Int
, meanResponseTime :: Int
, megabytesPerMinute :: Int
} deriving Show
And I'd like to constantly update that as I parse the input. But the part of performing operations as I consume is where I got stuck. So far print is the only function I've successfully passed output to and it showed the parsing is succeeding by returning Done before printing the output.
My main parser(s) look like this:
parseLogEntry :: Parser LogEntry
parseLogEntry = do
ip <- logItem
_ <- char ' '
logName <- logItem
_ <- char ' '
user <- logItem
_ <- char ' '
time <- datetimeLogItem
_ <- char ' '
firstLogLine <- quotedLogItem
_ <- char ' '
finalRequestStatus <- intLogItem
_ <- char ' '
responseSizeB <- intLogItem
_ <- char ' '
timeToResponse <- intLogItem
return $ LogEntry ip logName user time firstLogLine finalRequestStatus responseSizeB timeToResponse
type Log = [LogEntry]
parseLog :: Parser Log
parseLog = many $ parseLogEntry <* endOfLine
Desired outcome
I want to pass each parsed line to a function that will update the above data type. Ideally I want this to be very memory efficient because it'll be operating on large files.
You have to make your unit of parsing a single log entry rather than a list of log entries.
It's not pretty, but here is an example of how to interleave parsing and processing:
(Depends on bytestring, attoparsec and mtl)
{-# LANGUAGE NoMonomorphismRestriction, FlexibleContexts #-}
import qualified Data.ByteString.Char8 as BS
import qualified Data.Attoparsec.ByteString.Char8 as A
import Data.Attoparsec.ByteString.Char8 hiding (takeWhile)
import Data.Char
import Control.Monad.State.Strict
aWord :: Parser BS.ByteString
aWord = skipSpace >> A.takeWhile isAlphaNum
getNext :: MonadState [a] m => m (Maybe a)
getNext = do
xs <- get
case xs of
[] -> return Nothing
(y:ys) -> put ys >> return (Just y)
loop iresult =
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; loop (parse aWord x')
Partial _ -> do
mx <- getNext
case mx of
Just y -> loop (feed iresult y)
Nothing -> case feed iresult BS.empty of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword; return ()
Partial _ -> error $ "partial returned" -- probably can't happen
process :: Show a => a -> IO ()
process w = putStrLn $ "got a word: " ++ show w
theWords = map BS.pack [ "this is a te", "st of the emergency ", "broadcasting sys", "tem"]
main = runStateT (loop (Partial (parse aWord))) theWords
Notes:
We parse a aWord at a time and call process after each word is recognized.
Use feed to feed the parser more input when it returns a Partial.
Feed the parser an empty string when there is no more input left.
When Done is return, process the recognized word and continue with parse aWord.
getNext is just an example of a monadic function which gets the next unit of input. Replace it with your own version - i.e. something that reads the next line from a file.
Update
Here is a solution using parseWith as #dfeuer suggested:
noMoreInput = fmap null get
loop2 x = do
iresult <- parseWith (fmap (fromMaybe BS.empty) getNext) aWord x
case iresult of
Fail _ _ msg -> error $ "parse failed: " ++ msg
Done x' aword -> do lift $ process aword;
if BS.null x'
then do b <- noMoreInput
if b then return ()
else loop2 x'
else loop2 x'
Partial _ -> error $ "huh???" -- this really can't happen
main2 = runStateT (loop2 BS.empty) theWords
If each log entry is exactly one line, here's a simpler solution:
do loglines <- fmap BS.lines $ BS.readfile "input-file.log"
foldl' go initialStats loglines
where
go stats logline =
case parseOnly yourParser logline of
Left e -> error $ "oops: " ++ e
Right r -> let stats' = ... combine r with stats ...
in stats'
Basically you are just reading the file line-by-line and calling parseOnly on each line and accumulating the results.
This is properly done with a streaming library
main = do
f:_ <- getArgs
withFile f ReadMode $ \h -> do
result <- foldStream $ streamProcess $ streamHandle h
print result
where
streamHandle = undefined
streamProcess = undefined
foldStream = undefined
where the blanks can be filled by any streaming library, e.g.
import qualified Pipes.Prelude as P
import Pipes
import qualified Pipes.ByteString as PB
import Pipes.Group (folds)
import qualified Control.Foldl as L
import Control.Lens (view) -- or import Lens.Simple (view), or whatever
streamHandle = Pipes.ByteStream.fromHandle :: Handle -> Producer ByteString IO ()
in that case we might then divide the labor further thus:
streamProcess :: Producer ByteString m r -> Producer LogEntry m r
streamProcess p = streamLines p >-> lineParser
streamLines :: Producer ByteString m r -> Producer ByteString m r
streamLines p = L.purely fold L.list (view (Pipes.ByteString.lines p)) >-> P.map B.toStrict
lineParser :: Pipe ByteString LogEntry m r
lineParser = P.map (parseOnly line_parser) >-> P.concat -- concat removes lefts
(This is slightly laborious because pipes is sensible persnickety about accumulating lines, and memory generally: we are just trying to get a producer of individual strict bytestring lines, and then to convert that into a producer of parsed lines, and then to throw out bad parses, if there are any. With io-streams or conduit, things will be basically the same, and that particular step will be easier.)
We are now in a position to fold over our Producer LogEntry IO (). This can be done explicitly using Pipes.Prelude.fold, which makes a strict left fold. Here we will just cop the structure from user5402
foldStream str = P.fold go initial_stats id
where
go stats_till_now new_entry = undefined
If you get used to the use of the foldl library and the application of a fold to a Producer with L.purely fold some_fold, then you can build Control.Foldl.Folds for your LogEntries out of components and slot in different requests as you please.
If you use pipes-attoparsec and include the newline bit in your parser, then you can just write
handleToLogEntries :: Handle -> Producer LogEntry IO ()
handleToLogEntries h = void $ parsed my_line_parser (fromHandle h) >-> P.concat
and get the Producer LogEntry IO () more directly. (This ultra-simple way of writing it will, however, stop at a bad parse; dividing on lines first will be faster than using attoparsec to recognize newlines.) This is very simple with io-streams too, you would write something like
import qualified System.IO.Streams as Streams
io :: Handle -> IO ()
io h = do
bytes <- Streams.handleToInputStream h
log_entries <- Streams.parserToInputStream my_line_parser bytes
fold_result <- Stream.fold go initial_stats log_entries
print fold_result
or to keep with the structure above:
where
streamHandle = Streams.handleToInputStream
streamProcess io_bytes =
io_bytes >>= Streams.parserToInputStream my_line_parser
foldStream io_logentries =
log_entries >>= Stream.fold go initial_stats
Either way, my_line_parser should return a Maybe LogEntry and should recognize the newline.

Parsing PPM images in Haskell

I'm starting to learn Haskell and wish to parse a PPM image for execrsice. The structure of the PPM format is rather simple, but it is tricky. It's described here. First of all, I defined a type for a PPM Image:
data Pixel = Pixel { red :: Int, green :: Int, blue :: Int} deriving(Show)
data BitmapFormat = TextualBitmap | BinaryBitmap deriving(Show)
data Header = Header { format :: BitmapFormat
, width :: Int
, height :: Int
, colorDepth :: Int} deriving(Show)
data PPM = PPM { header :: Header
, bitmap :: [Pixel]
}
bitmap should contain the entire image. This is where the first challange comes - the part that contains the actual image data in PPM can be either textual or binary (described in the header).
For textual bitmaps I wrote the following function:
parseTextualBitmap :: String -> [Pixel]
parseTextualBitmap = map textualPixel . chunksOf 3 . wordsBy isSpace
where textualPixel (r:g:b:[]) = Pixel (read r) (read g) (read b)
I'm not sure what to do with binary bitmaps, though. Using read converts a string representation of numbers to numbers. I want to convert "\x01" to 1 of type Int.
The second challange is parsing the header. I wrote the following function:
parseHeader :: String -> Header
parseHeader = constructHeader . wordsBy isSpace . filterComments
where
filterComments = unlines . map (takeWhile (/= '#')) . lines
formatFromText s
| s == "P6" = BinaryBitmap
| s == "P3" = TextualBitmap
constructHeader (format:width:height:colorDepth:_) =
Header (formatFromText format) (read width) (read height) (read colorDepth)
Which works pretty well. Now I should write the module exported function (let's call it parsePPM) which gets the entire file content (String) and then return PPM. The function should call parseHeader, deterime the bitmap format, call the apropriate parse(Textual|Binary)Bitmap and then construct a PPM with the result. Once parseHeader returns I should start decoding the bitmap from the point that parseHeader stopped in. However, I cannot know in which point of the string parseHeader stopped. The only solution I could think of is that instead of Header, parseHeader will return (Header,String), when the second element of the tuple is the remainder retrieved by constructHeader (which currently named as _). But I'm not really sure it's the "Haskell Way" of doing things.
To sum up my questions:
1. How do I decode the binary format into a list of Pixel
2. How can I know in which point the header ends
Since I'm learning Haskell by myself I have no one to actually review my code, so in addition to answering my questions I will appriciate any comment about the way I code (coding style, bugs, alternative way to do things, etc...).
Lets start with question 2 because it is easier to answer. Your approach is correct: as you parse things, you remove those characters from the input string, and return a tuple containing the result of the parse, and the remaining string. However, thereis no reason to write all this from scratch (except perhaps as an academic exercise) - there are plenty of parsers which will take care of this issue for you. The one I will use is Parsec. If you are new to monadic parsing you should first read the section on Parsec in RWH.
As for question 1, if you use ByteString instead of String, then parsing single bytes is easy since single bytes are the atomic elements of ByteStrings!
There is also the issue of the Char/ByteString interface. With Parsec, this is a non-issue since you can treat a ByteString as a sequence of Byte or Char - we will see this later.
I decided to just write the full parser - this is a very simple language so with all the primitives defined for you in the Parsec library, it is very easy and very concise.
The file header:
import Text.Parsec.Combinator
import Text.Parsec.Char
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Pos
import Data.ByteString (ByteString, pack)
import qualified Data.ByteString.Char8 as C8
import Control.Monad (replicateM)
import Data.Monoid
First, we write the 'primitive' parsers - that is, parsing bytes, parsing textual numbers, and parsing whitespace (which the PPM format uses as a seperator):
parseIntegral :: (Read a, Integral a) => Parser a
parseIntegral = fmap read (many1 digit)
digit parses a single digit - you'll notice that many function names explain what the parser does - and many1 will apply the given parser 1 or more times. Then we read the resulting string to return an actual number (as opposed to a string). In this case, the input ByteString is being treated as text.
parseByte :: Integral a => Parser a
parseByte = fmap (fromIntegral . fromEnum) $ tokenPrim show (\pos tok _ -> updatePosChar pos tok) Just
For this parser, we parse a single Char - which is really just a byte. It is just returned as a Char. We could safely make the return type Parser Word8 because the universe of values that can be returned is [0..255]
whitespace1 :: Parser ()
whitespace1 = many1 (oneOf "\n ") >> return ()
oneOf takes a list of Char and parses any one of the characters in the order given - again, the ByteString is being treated as Text.
Now we can write the parser for the header.
parseHeader :: Parser Header
parseHeader = do
f <- choice $ map try $
[string "P3" >> return TextualBitmap
,string "P6" >> return BinaryBitmap]
w <- whitespace1 >> parseIntegral
h <- whitespace1 >> parseIntegral
d <- whitespace1 >> parseIntegral
return $ Header f w h d
A few notes. choice takes a list of parsers and tries them in order. try p takes the parser p, and 'remembers' the state before p starts parsing. If p succeeds, then try p == p. If p fails, then the state before p started is restored and you pretend you never tried p. This is necessary due to how choice behaves.
For the pixels, we have two choices as of now:
parseTextual :: Header -> Parser [Pixel]
parseTextual h = do
xs <- replicateM (3 * width h * height h) (whitespace1 >> parseIntegral)
return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs
We could use many1 (whitespace 1 >> parseIntegral) - but this wouldn't enforce the fact that we know what the length should be. Then, converting the list of numbers to a list of pixels is trivial.
For binary data:
parseBinary :: Header -> Parser [Pixel]
parseBinary h = do
whitespace1
xs <- replicateM (3 * width h * height h) parseByte
return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs
Note how the two are almost identical. You could probably generalize this function (it would be especially useful if you decided to parse the other types of pixel data - monochrome and greyscale).
Now to bring it all together:
parsePPM :: Parser PPM
parsePPM = do
h <- parseHeader
fmap (PPM h) $
case format h of
TextualBitmap -> parseTextual h
BinaryBitmap -> parseBinary h
This should be self-explanatory. Parse the header, then parse the body based on the format. Here are some examples to try it on. They are the ones from the specification page.
example0 :: ByteString
example0 = C8.pack $ unlines
["P3"
, "4 4"
, "15"
, " 0 0 0 0 0 0 0 0 0 15 0 15"
, " 0 0 0 0 15 7 0 0 0 0 0 0"
, " 0 0 0 0 0 0 0 15 7 0 0 0"
, "15 0 15 0 0 0 0 0 0 0 0 0" ]
example1 :: ByteString
example1 = C8.pack ("P6 4 4 15 ") <>
pack [0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 15, 0, 0, 0, 0, 15, 7,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 7, 0, 0, 0, 15,
0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Several notes: this doesn't handle comments, which are part of the spec. The error messages are not very useful; you can use the <?> function to create your own error messages. The spec also indicates 'The lines should not be longer than 70 characters.' - this is also not enforced.
edit:
Just because you see do-notation, doesn't necessarily mean that you are working with impure code. Some monads (like this parser) are still pure - they are just used for convenience. For example, you can write your parser with the type parser :: String -> (a, String), or, what we have done here, is we use a new type: data Parser a = Parser (String -> (a, String)) and have parser :: Parser a; we then write a monad instance for Parser to get the useful do-notation. To be clear, Parsec supports monadic parsing, but our parser is not monadic - or rather, uses the Identity monad, which is just newtype Identity a = Identity { runIdentity :: a }, and is only necessary because if we used type Identity a = a we would have 'overlapping instances' errors everywhere, which is not good.
>:i Parser
type Parser = Parsec ByteString ()
-- Defined in `Text.Parsec.ByteString'
>:i Parsec
type Parsec s u = ParsecT s u Data.Functor.Identity.Identity
-- Defined in `Text.Parsec.Prim'
So then, the type of Parser is really ParsecT ByteString () Identity. That is, the parser input is ByteString, the user state is () - which just means we aren't using the user state, and the monad in which we are parsing is Identity. ParsecT is itself just a newtype of:
forall b.
State s u
-> (a -> State s u -> ParseError -> m b)
-> (ParseError -> m b)
-> (a -> State s u -> ParseError -> m b)
-> (ParseError -> m b)
-> m b
All those functions in the middle are just used to pretty-print errors. If you are parsing 10's of thousands of characters and an error occurs, you won't be able to just look at it and see where that happened - but Parsec will tell you the line and column. If we specialize all the types to our Parser, and pretend that Identity is just type Identity a = a, then all the monads disappear and you can see that the parser is not impure. As you can see, Parsec is a lot more powerful than is required for this problem - I just used it due to familiarity, but if you were willing to write your own primitive functions like many and digit, then you could get away with using newtype Parser a = Parser (ByteString -> (a, ByteString)).

parsec using between to parse parens

If I wanted to parse a string with multiple parenthesized groups into a list of strings holding each group, for example
"((a b c) a b c)"
into
["((a b c) a b c)","( a b c)"]
How would I do that using parsec? The use of between looks nice but it does not seem possible to separate with a beginning and end value.
I'd use a recursive parser:
data Expr = List [Expr] | Term String
expr :: Parsec String () Expr
expr = recurse <|> terminal
where terminal is your primitives, in this case these seem to be strings of characters so
where terminal = Term <$> many1 letter
and recurse is
recurse = List <$>
(between `on` char) '(' ')' (expr `sepBy1` char ' ')
Now we have a nice tree of Exprs which we can gather with
collect r#(List ts) = r : concatMap collect ts
collect _ = []
While jozefg's solution is almost identical to what I came up with (and I completely agree to all his suggestions), there are some small differences that made me think that I should post a second answer:
Due to the expected result of the initial example, it is not necessary to treat space-separated parts as individual subtrees.
Further it might be interesting to see the part that actually computes the expected results (i.e., a list of strings).
So here is my version. As already suggested by jozefg, split the task into several sub-tasks. Those are:
Parse a string into an algebraic data type representing some kind of tree.
Collect the (desired) subtrees of this tree.
Turn trees into strings.
Concerning 1, we first need a tree data type
import Text.Parsec
import Text.Parsec.String
import Control.Applicative ((<$>))
data Tree = Leaf String | Node [Tree]
and then a function that can parse strings into values of this type.
parseTree :: Parser Tree
parseTree = node <|> leaf
where
node = Node <$> between (char '(') (char ')') (many parseTree)
leaf = Leaf <$> many1 (noneOf "()")
In my version I do consider the hole string between parenthesis as a Leaf node (i.e., I do not split at white-spaces).
Now we need to collect the subtrees of a tree we are interested in:
nodes :: Tree -> [Tree]
nodes (Leaf _) = []
nodes t#(Node ts) = t : concatMap nodes ts
Finally, a Show-instance for Trees allows us to turn them into strings.
instance Show Tree where
showsPrec d (Leaf x) = showString x
showsPrec d (Node xs) = showString "(" . showList xs . showString ")"
where
showList [] = id
showList (x:xs) = shows x . showList xs
Then the original task can be solved, e.g., by:
parseGroups :: Parser [String]
parseGroups = map show . nodes <$> parseTree
> parseTest parseGroups "((a b c) a b c)"
["((a b c) a b c)","(a b c)"]

Checking that a price value in a string is in the correct format

I use n <- getLine to get from user price. How can I check is value correct ? (Price can have '.' and digits and must be greater than 0) ?
It doesn't work:
isFloat = do
n <- getLine
let val = case reads n of
((v,_):_) -> True
_ -> False
If The Input Is Always Valid Or Exceptions Are OK
If you have users entering decimal numbers in the form of "123.456" then this can simply be converted to a Float or Double using read:
n <- getLine
let val = read n
Or in one line (having imported Control.Monad):
n <- liftM read getLine
To Catch Erroneous Input
The above code fails with an exception if the users enter invalid entries. If that's a problem then use reads and listToMaybe (from Data.Maybe):
n <- liftM (fmap fst . listToMaybe . reads) getLine
If that code looks complex then don't sweat it - the below is the same operation but doing all the work with explicit case statements:
n <- getLine
let val = case reads n of
((v,_):_) -> Just v
_ -> Nothing
Notice we pattern match to get the first element of the tuple in the head of the list, The head of the list being (v,_) and the first element is v. The underscore (_) just means "ignore the value in this spot".
If Floating Point Isn't Acceptable
Floating values are well known to be approximate, and not suitable for real world financial computations (but perhaps homework, depending on your professor). In this case you'd want to read the values into a Rational (from Data.Ratio).
n <- liftM maybeRational getLine
...
where
maybeRational :: String -> Maybe Rational
maybeRational str =
let (a,b) = break (=='.') str
in liftM2 (%) (readMaybe a) (readMaybe $ drop 1 b)
readMaybe = fmap fst . listToMaybe . reads
In addition to the parsing advice provided by TomMD, consider using the appropriate monad for error reporting. It allows you to conveniently chain computations which can fail, avoiding explicit error checking on every step.
{-# LANGUAGE FlexibleContexts #-}
import Control.Monad.Error
parsePrice :: MonadError String m => String -> m Double
parsePrice s = do
x <- case reads s of
[(x, "")] -> return x
_ -> throwError "Not a valid real number."
when (x <= 0) $ throwError "Price must be positive."
return x
main = do
n <- getLine
case parsePrice n of
Left err -> putStrLn err
Right x -> putStrLn $ "Price is " ++ show x

F# How to tokenise user input: separating numbers, units, words?

I am fairly new to F#, but have spent the last few weeks reading reference materials. I wish to process a user-supplied input string, identifying and separating the constituent elements. For example, for this input:
XYZ Hotel: 6 nights at 220EUR / night
plus 17.5% tax
the output should resemble something like a list of tuples:
[ ("XYZ", Word); ("Hotel:", Word);
("6", Number); ("nights", Word);
("at", Operator); ("220", Number);
("EUR", CurrencyCode); ("/",
Operator); ("night", Word);
("plus", Operator); ("17.5",
Number); ("%", PerCent); ("tax",
Word) ]
Since I'm dealing with user input, it could be anything. Thus, expecting users to comply with a grammar is out of the question. I want to identify the numbers (could be integers, floats, negative...), the units of measure (optional, but could include SI or Imperial physical units, currency codes, counts such as "night/s" in my example), mathematical operators (as math symbols or as words including "at" "per", "of", "discount", etc), and all other words.
I have the impression that I should use active pattern matching -- is that correct? -- but I'm not exactly sure how to start. Any pointers to appropriate reference material or similar examples would be great.
I put together an example using the FParsec library. The example is not robust at all but it gives a pretty good picture of how to use FParsec.
type Element =
| Word of string
| Number of string
| Operator of string
| CurrencyCode of string
| PerCent of string
let parsePerCent state =
(parse {
let! r = pstring "%"
return PerCent r
}) state
let currencyCodes = [|
pstring "EUR"
|]
let parseCurrencyCode state =
(parse {
let! r = choice currencyCodes
return CurrencyCode r
}) state
let operators = [|
pstring "at"
pstring "/"
|]
let parseOperator state =
(parse {
let! r = choice operators
return Operator r
}) state
let parseNumber state =
(parse {
let! e1 = many1Chars digit
let! r = opt (pchar '.')
let! e2 = manyChars digit
return Number (e1 + (if r.IsSome then "." else "") + e2)
}) state
let parseWord state =
(parse {
let! r = many1Chars (letter <|> pchar ':')
return Word r
}) state
let elements = [|
parseOperator
parseCurrencyCode
parseWord
parseNumber
parsePerCent
|]
let parseElement state =
(parse {
do! spaces
let! r = choice elements
do! spaces
return r
}) state
let parseElements state =
manyTill parseElement eof state
let parse (input:string) =
let result = run parseElements input
match result with
| Success (v, _, _) -> v
| Failure (m, _, _) -> failwith m
It sounds like what you really want is just a lexer. A good alternative to FSParsec would be FSLex. (Good intro tutorial, albiet somewhat dated, can be found on my old blog here.) Using FSLex you can take your input text:
XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax
And get it properly tokenized into something like:
[ Word("XYZ"); Hotel; Int(6); Word("nights"); Word("at"); Int(220); EUR; ... ]
The next step, once you have an List of tokens, is to do some form of pattern matching / analysis to extract semantic information (which I assume is what you are really after). With the normalized token stream, it should be as simple as:
let rec processTokenList tokens =
match tokens with
| Float(x) :: Keyword("EUR") :: rest -> // Dollar amount x
| Word(x) :: Keyword("Hotel") :: rest -> // Hotel x
| hd :: rest -> // Couldn't find anything interesting...
processTokenList rest
That should at least get you started. But note that as your input gets more 'formal', so will the usefulness of your lexing. (And if you only accept a very specific input, then you can use a proper parser and be done with it!)

Resources