Haskell - Parse binary stream - parsing

I'm new to Haskell, so I'm probably missing something very simple.
I'm having a supprisingly hard time trying to parse a structured binary stream.
The binary stream has varying and conditional parts (like a field which determines how many items follow it (ncount), or what type of message follows it (type)).
To get a simple working example, I'm trying to parse this hypothetical binary structure:
+----------------+---------------+--------------
| Magic (8 bits) | type (3 bits) | type message...
+----------------+---------------+--------------
Type 1:
+----------------+-------------+-------------+-----------------+
|ncount (3 bits) | n1 (3 bits) | n1 (3 bits) | nN (3 bits)... |
+----------------+-------------+-------------+-----------------+
Type 2:
+----------------+---------------+
| num1 (7 bits) | num2 (7 bits) |
+----------------+---------------+
...
My code so far:
{-# LANGUAGE RecordWildCards #-}
module Main where
import Data.Bits
import Data.Binary as B
import qualified Data.Binary.Bits.Get as BG
import qualified Data.ByteString as BS
data Header = Header {
magic :: Word8
,mtype :: Word8
,num1 :: Word8
,num2 :: Word8
} deriving (Show)
--instance Show (Get Header) where
-- show (Header {..}) = show . magic
parse_header :: B.Get Header
parse_header = BG.runBitGet parse_header'
-- Example, assume type 2 for now
parse_header' :: BG.BitGet Header
parse_header' = do
magic <- BG.getWord8 8
mtype <- BG.getWord8 3
num1 <- BG.getWord8 7
num2 <- BG.getWord8 7
return $ Header magic mtype num1 num2
main :: IO ()
main = do
putStrLn "Start"
-- File containing binary stream
fstr <- BS.readFile "data/hbin.bin"
let header = parse_header
in
-- How do I print out Header?
print header
-- * No instance for (Show (Get Header))
-- arising from a use of `print'
-- * In the expression: print header
putStrLn "\nEnd"
In which I get the error:
* No instance for (Show (Get Header)) arising from a use of `print'
* In the expression: print header
Obviously, I plan to parse this recursively, but for now I can't even see a value I've read.
I've followed https://wiki.haskell.org/Dealing_with_binary_data but this uses Data.Binary.Strict (binary-strict) which doesn't compile on Windows (atleast on mine).
I've also followed https://hackage.haskell.org/package/binary-bits-0.5/docs/Data-Binary-Bits-Get.html but it doesn't show how to use values you have gotten with getWord8 (Do I need to put them into an Int to read them as decimal?)
Again, I'm new to Haskell and not familiar with Monads (which I believe Get is).

header = parse_header is only giving a new name to the parser. You need a function to run the parser, there's one here (here choosing runGet for simplicity, but you should prefer the other one, to handle the error case more easily):
runGet :: Get a -> ByteString -> a
Note that it takes a lazy ByteString (Data.ByteString.Lazy) instead of a strict one (Data.ByteString).
...
import Data.ByteString.Lazy (toLazy)
...
main = do
fstr <- BS.readFile "data/hbin.bin"
let header = runGet parse_header (fromStrict fstr)
print header
putStrLn "End"

Related

parsing Float with 'get' from ByteString not behaving as expected

In the context of creating a simple STL parser, I found some, at least to me, unexpected behavior in get from Data.Binary. It appears to me that it does not stop reading the ByteString after the 32 bits that I would assume it should.
{-# LANGUAGE FlexibleContexts #-}
module STLTransform where
import qualified Data.ByteString.Lazy as BL
import Data.Binary
import Data.Binary.Get
import Data.Binary.Put
import Data.Word
import Numeric.LinearAlgebra
import GHC.Float
import System.Endian
getTriangle = do
normal <- getR3
vertices <- sequence $ take 3 $ trace (show normal) $ repeat getR3
return (normal, vertices)
where getR3 = fmap (vector.fmap float2Double)
$ sequence $ take 3 $ repeat get
getSTL :: Get ([Word8], Word32, [(Vector R, [Vector R])], Word16)
getSTL = do
header <- sequence $ take 80 $ repeat get
number <- fmap (toBE32) $ get
triangles <- sequence $ take (fromIntegral number) $ repeat getTriangle
attribute <- get
return (header, number, triangles, attribute)
readSTL fn = BL.readFile fn >>= return.runGet getSTL
Attempting to read an STL-file using this code, I would get an error like
*Main STLTransform> readSTL "test.stl"
*** Exception: Data.Binary.Get.runGet at position 1268784: not enough bytes
CallStack (from HasCallStack):
error, called at libraries/binary/src/Data/Binary/Get.hs:351:5 in binary-0.8.6.0:Data.Binary.Get
After some debugging, I isolated the problem to getTriangle, and more specifically the parsing of Float values. By replacing float2Double with (\a -> fromIntegral (a :: Word32))
getTriangle = do
normal <- getR3
vertices <- sequence $ take 3 $ trace (show normal) $ repeat getR3
return (normal, vertices)
where getR3 = fmap (vector.fmap (\a -> fromIntegral (a :: Word32)))
$ sequence $ take 3 $ repeat get
I get code that runs as I intended (although obviously with incorrect numbers). So the the question is, why are Float and Word32 treated differently? Is not Float also a 32 bit value?
The version of the package used is binary-0.8.6.0
EDIT:
The corrected version of the function is
getTriangle :: Get (Vector R, [Vector R], Word16)
getTriangle = do
normal <- getR3
vertices <- sequence $ take 3 $ repeat getR3
attribute <- get
return (normal, vertices, attribute)
where getR3 = fmap (vector.fmap float2Double)
$ sequence $ take 3 $ repeat getFloatle
Unrelated to the stated issue, but corrected from the original post, the attribute is on the triangle level, not on the file level.
The Binary instance of Float parses a pair
(Integer, Int) and then uses encodeFloat :: Integer -> Int -> Float, which is why it takes more than 32 bits.
This is a known issue left around for backwards compatibility: https://github.com/kolmodin/binary/issues/69
Don't use get blindly, there is more than one way to encode things. In this case, there are other parsers for 32 bit formats of Float: https://hackage.haskell.org/package/binary-0.8.7.0/docs/Data-Binary-Get.html#v:getFloatbe

Generating a parser given a list of tokens

Background
I'm trying to implement a date printing and parsing system using Parsec.
I have successfully implemented a printing function of type
showDate :: String -> Date -> Parser String
It takes parses a formatting string and creates a new string based on the tokens that the formatted string presented.
For example
showDate "%d-%m-%Y" $ Date 2015 3 17
has the output Right "17-3-2015"
I already wrote a tokenizer to use in the showDate function, so I thought that I could just use the output of that to somehow generate a parser using the function readDate :: [Token] -> Parser Date. My idea quickly came to a halt as I realised I had no idea how to implement this.
What I want to accomplish
Assume we have the following functions and types (the implementation doesn't matter):
data Token = DayNumber | Year | MonthNumber | DayOrdinal | Literal String
-- Parses four digits and returns an integer
pYear :: Parser Integer
-- Parses two digits and returns an integer
pMonthNum :: Parser Int
-- Parses two digits and returns an integer
pDayNum :: Parser Int
-- Parses two digits and an ordinal suffix and returns an integer
pDayOrd :: Parser Int
-- Parses a string literal
pLiteral :: String -> Parser String
The parser readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year] should be equivalent to
do
d <- pDayNum
pLiteral "-"
m <- pMonthNum
pLiteral "-"
y <- pYear
return $ Date y m d
Similarly, the parser readDate [Literal "~~", MonthNumber,Literal "hello",DayNumber,Literal " ",Year] should be equivalent to
do
pLiteral "~~"
m <- pMonthNum
pLiteral "hello"
d <- pDayNum
pLiteral " "
y <- pYear
return $ Date y m d
My intuition suggests there's some kind of concat/map/fold using monad bindings that I can use for this, but I have no idea.
Questions
Is parsec the right tool for this?
Is my approach convoluted or ineffective?
If not, how do I achieve this functionality?
If so, what should I try to do instead?
Your Tokens are instructions in a small little language for date formats [Token].
import Data.Functor
import Text.Parsec
import Text.Parsec.String
data Date = Date Int Int Int deriving (Show)
data Token = DayNumber | Year | MonthNumber | Literal String
In order to interpret this language, we need a type that represents the state of the interpreter. We start off not knowing any of the components of the Date and then discover them as we encounter DayNumber, Year, or MonthNumber. The following DateState represents the state of knowing or not knowing each of the components of the Date.
data DateState = DateState {dayState :: (Maybe Int), monthState :: (Maybe Int), yearState :: (Maybe Int)}
We will start interpreting a [Token] with DateState Nothing Nothing Nothing.
Each Token will be converted into a function that reads the DateState and produces a parser that computes the new DateState.
readDateToken :: Token -> DateState -> Parser DateState
readDateToken (DayNumber) ds =
do
day <- pNatural
return ds {dayState = Just day}
readDateToken (MonthNumber) ds =
do
month <- pNatural
return ds {monthState = Just month}
readDateToken (Year) ds =
do
year <- pNatural
return ds {yearState = Just year}
readDateToken (Literal l) ds = string l >> return ds
pNatural :: Num a => Parser a
pNatural = fromInteger . read <$> many1 digit
To read a date interpreting a [Token] we will first convert it into a list of functions that decide how to parse a new state based on the current state with map readDateToken :: [Token] -> [DateState -> Parser DateState]. Then, starting with a parser that succeeds with the initial state return (DateState Nothing Nothing Nothing) we will bind all of these functions together with >>=. If the resulting DateState doesn't completely define the Date we will complain that the [Token]s was invalid. We also could have checked this ahead of time. If you want to include invalid date errors as parsing errors this would also be the place to check that the Date is valid and doesn't represent a non-existent date like April 31st.
readDate :: [Token] -> Parser Date
readDate tokens =
do
dateState <- foldl (>>=) (return (DateState Nothing Nothing Nothing)) . map readDateToken $ tokens
case dateState of
DateState (Just day) (Just month) (Just year) -> return (Date day month year)
_ -> fail "Date format is incomplete"
We will run a few examples.
runp p s = runParser p () "runp" s
main = do
print . runp (readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year]) $ "12-3-456"
print . runp (readDate [Literal "~~", MonthNumber,Literal "hello",DayNumber,Literal " ",Year]) $ "~~3hello12 456"
print . runp (readDate [DayNumber,Literal "-",MonthNumber,Literal "-",Year,Literal "-",Year]) $ "12-3-456-789"
print . runp (readDate [DayNumber,Literal "-",MonthNumber]) $ "12-3"
This results in the following outputs. Notice that when we asked to read the Year twice, the second of the two years was used in the Date. You can choose a different behavior by modifying the definitions for readDateToken and possibly modifying the DateState type. When the [Token] didn't specify how to read one of the date fields we get the error Date format is incomplete with a slightly incorrect description; this could be improved upon.
Right (Date 12 3 456)
Right (Date 12 3 456)
Right (Date 12 3 789)
Left "runp" (line 1, column 5):
unexpected end of input
expecting digit
Date format is incomplete

Converting normal attoparsec parser code to conduit/pipe based

I have written a following parsing code using attoparsec:
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
tab
b <- decimal
return $ Test a b
tParser :: Parser [Test]
tParser = many' $ testParser <* endOfLine
This works fine for small sized files, I execute it like this:
main :: IO ()
main = do
text <- TL.readFile "./testFile"
let (Right a) = parseOnly (manyTill anyChar endOfLine *> tParser) text
print a
But when the size of the file is greater than 70MB, it consumes tons of memory. As a solution, I thought I would use attoparsec-conduit. After going through their API, I'm not sure how to make them work together. My parser has the type Parser Test but it's sinkParser actually accepts parser of type Parser a b. I'm interested in how to execute this parser in constant memory ? (A pipes based solution is also acceptable, but I'm not used to the Pipes API.)
The first type parameter to Parser is just the data type of the input (either Text or ByteString). You can provide your testParser function as the argument to sinkParser and it will work fine. Here's a short example:
{-# LANGUAGE OverloadedStrings #-}
import Conduit (liftIO, mapM_C, runResourceT,
sourceFile, ($$), (=$))
import Data.Attoparsec.Text (Parser, decimal, endOfLine, space)
import Data.Conduit.Attoparsec (conduitParser)
data Test = Test {
a :: Int,
b :: Int
} deriving (Show)
testParser :: Parser Test
testParser = do
a <- decimal
space
b <- decimal
endOfLine
return $ Test a b
main :: IO ()
main = runResourceT
$ sourceFile "foo.txt"
$$ conduitParser testParser
=$ mapM_C (liftIO . print)
Here is the pipes solution (assuming that you are using a Text-based parser):
import Pipes
import Pipes.Text.IO (fromHandle)
import Pipes.Attoparsec (parsed)
import qualified System.IO as IO
main = IO.withFile "./testfile" IO.ReadMode $ \handle -> runEffect $
for (parsed (testParser <* endOfLine) (fromHandle handle)) (lift . print)

Insert a character into parser combinator character stream in Haskell

This question is related to both Parsec and uu-parsinglib. When we write parser combinators, they process characters streams from compiler. Is it somehow possible to parse a character and put it back (or return another character back) to the input stream?
I want for example to parse input "test + 5", parse the t, e, s, t and after recognition of test pattern, put for example v character back into the character stream, so while continuating the parsing process we are matching against v + 5
I do not want to use this in any particular case for now - I want to deeply learn the possibilities.
I'm not sure if it's possible with these parsers directly, but in general you can accomplish it by combining parsers with some streaming that allows injecting leftovers.
For example, using attoparsec-conduit you can turn a parser into a conduit using
sinkParser :: (AttoparsecInput a, MonadThrow m)
=> Parser a b -> Consumer a m b
where Consumer is a special kind of conduit that doesn't produce any output, only receives input and returns a final value.
Since conduits support leftovers, you can create a helper method that converts a parser that optionally returns a value to be pushed into the stream into a conduit:
import Data.Attoparsec.Types
import Data.Conduit
import Data.Conduit.Attoparsec
import Data.Functor
reinject :: (AttoparsecInput a, MonadThrow m)
=> Parser a (Maybe a, b) -> Consumer a m b
reinject p = do
(lo, r) <- sinkParser p
maybe (return ()) leftover lo
return r
Then you convert standard parsers to conduits using sinkParser and these special parsers using reinject, and then combine conduits instead of parsers.
I think the simplest way to archive this is to build a multi-layered parser. Think of a lexer + parser combination. This is a clean approach to this problem.
You have to separate the two kind of parsing. The search-and-replace parsing goes to the first parser and the build-the-AST parsing to the second. Or you can create an intermediate token representation.
import Text.Parsec
import Text.Parsec.String
parserLvl1 :: Parser String
parserLvl1 = many (try (string "test" >> return 'v') <|> anyChar)
parserLvl2 :: Parser Plus
parserLvl2 = do text1 <- many (noneOf "+")
char '+'
text2 <- many (noneOf "+")
return $ Plus text1 text2
data Plus = Plus String String
deriving Show
wholeParse :: String -> Either ParseError Plus
wholeParse source = do res1 <- parse parserLvl1 "lvl1" source
res2 <- parse parserLvl2 "lvl2" res1
return res2
Now you can parse your example. wholeParse "test+5" results in Right (Plus "v" "5").
Possible variations:
Create a class and an instance for combining wrapped parser stages. (Possibly carrying parser state.)
Create an intermediate representation, a stream of tokens
This is easily done in uu-parsinglib using the pSwitch function. But the question is why you want to do so? Because the v is missing from the input? In that case uu-parsinglib will perform error correction automatically so you do not need something like this. Otherwise you can write
pSwitch :: (st1 -> (st2, st2 -> st1)) -> P st2 a -> P st1 a
pInsert_v = pSwitch (\st1 -> (prepend v st2, id) (pSucceed ())
It depends on your actual state type how the v is actually added, so you will have to define the function prepend yourself. I do not know e.g. how such an insertion would influence the current position in the file etc.
Doaitse Swierstra

Haskell: Traverse through a String/Text File

I am trying to read a script file then process and output it to a html file. In my script file, whenever there is a #title(this is a title), I will add tag [header] this is a title [/header] in my html output. So my approach is to first read the script file, write the content to a string, process the string, then write the string to html file.
In other to recognize the #title, I will need to read character by character in the string. When I read '#', I will need to detect the next character to see if they are t i t l e.
QUESTION: How do I traverse through a string (which is a list of char) in Haskell?
You could use a simple recursion trick, for example
findTag [] = -- end of list code.
findTag ('#':xs)
| take 5 xs == "title" = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) = findTag xs
so basically you just pattern match if the next char (head of list) is '#' and then you check if the next 5 characters form "title". if so you can then continue your parsing code. if next character isnt '#' you just continue the recursing. Once the list is empty you reach the first pattern match.
Someone else might have a better solution.
I hope this answers your question.
edit:
For a bit more flexibility, if you want to find a specific tag you could do this:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take (length tagName) xs == tagName = -- your code for #title
| otherwise = findTag xs
findTag (_:xs) _ = findTag xs
This way if you do
findTag text "title"
You'll specifically look for the title, and you can always change the tagname to whatever you want.
Another edit:
findTag [] _ = -- end of list code.
findTag ('#':xs) tagName
| take tLength xs == tagName = getTagContents tLength xs
| otherwise = findTag xs
where tLength = length tagName
findTag (_:xs) _ = findTag xs
getTagContents :: Int -> String -> String
getTagContents len = takeWhile (/=')') . drop (len + 1)
to be honest, it's getting a bit messy but here's what's happening:
You first drop the length of the tagName, then one more for the open bracket, and then you finish off by using takeWhile to take the characters until the closing bracket.
Evidently your problem falls into parsing category. As wisely stated by Daniel Wagner, for maintainability reasons you're much better off approaching it generally with a parser.
Another thing is if you want to work with textual data efficiently, you're better off using Text instead of String.
Here's how you could solve your problem using the Attoparsec parser library:
-- For autocasting of hardcoded strings to `Text` type
{-# LANGUAGE OverloadedStrings #-}
-- Import a way more convenient prelude, excluding symbols conflicting
-- with the parser library. See
-- http://hackage.haskell.org/package/classy-prelude
import ClassyPrelude hiding (takeWhile, try)
-- Exclude the standard Prelude
import Prelude ()
import Data.Attoparsec.Text
-- A parser and an inplace converter for title
title = do
string "#title("
r <- takeWhile $ notInClass ")"
string ")"
return $ "[header]" ++ r ++ "[/header]"
-- A parser which parses the whole document to parts which are either
-- single-character `Text`s or modified titles
parts =
(try endOfInput >> return []) ++
((:) <$> (try title ++ (singleton <$> anyChar)) <*> parts)
-- The topmost parser which concats all parts into a single text
top = concat <$> parts
-- A sample input
input = "aldsfj#title(this is a title)sdlfkj#title(this is a title2)"
-- Run the parser and output result
main = print $ parseOnly top input
This outputs
Right "aldsfj[header]this is a title[/header]sdlfkj[header]this is a title2[/header]"
P.S. ClassyPrelude reimplements ++ as an alias for Monoid's mappend, so you can replace it with mappend, <> or Alternative's <|> if you want.
For pattern search-and-replace, you can use
streamEdit.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
title :: Parsec Void String String
title = do
void $ string "#title("
someTill anySingle $ string ")"
editor t = "[header]" ++ t ++ "[/header]"
streamEdit title editor " #title(this is a title) "
" [header]this is a title[/header] "

Resources