I have recently started learning Haskell, I'm testing what I've learnt using the Google KickStart code challanges.
The problem I am looking at essentially asks to read square matrices from standard input and sum the elements on the diagonals parallel to the main diagonal.
I have written a simple program to solve the problem, this passes the first test but not the second.
The problem seems to be that the input in the second test is much larger than the first and the parsers I have written are either too slow or use too much memory.
The input is formatted as a sequence of matrices
n
a11 ... a1n
...
an1 ... ann
My first attempt using builtin function
type Matrix = [[Int]]
parseInput:: String -> [Matrix]
parseInput = parseInput' . lines
where
parseInput':: [String] -> [Matrix]
parseInput' [] = []
parseInput' (i:is) = let (m,rest) = splitAt (read i) is in
(map ((map read).words) m) : (parseInput' rest)
main = do
input <- getContents
let result = show $ parseInput input
putStr result
The second attempt using the parsec library
import Text.Parsec
import Text.Parsec.String (Parser)
type Row = [Int]
type Matrix = [[Int]]
matrixSize::(Integral a, Read a) => Parser a
matrixSize = do
n <- ( many1 digit ) <* newline
return $ read n
matrixRow:: Parser Row
matrixRow = do
row <- (sepBy1 (many1 digit) (char ' ')) <* (newline <|> (eof >> return ' '))
return $ map read row
parseMatrix:: Parser Matrix
parseMatrix = do
n <- matrixSize
mat <- count n matrixRow
return mat
parseInput:: Parser [Matrix]
parseInput = (many1 parseMatrix)
prepareInput:: String -> [Matrix]
prepareInput input = case parse parseInput "" input of
Right l -> l
Left _ -> []
main = do
input <- getContents
let result = show $ prepareInput input
putStr result
According to Google website the second test instance contains at most 100 matrices of which 10 are of size at most 1000 with elements of size at most 10^7, the remaining matrices are smaller of size at most 100.
I have generated a random test file with 10 large matrices of size at most 1000 and it turns out to be roughly 30MB in size.
Running either parser on the test file takes roughly 13s on my machine and the second uses more than 1GB of memory.
Both fail the requirements for running on the server: the first exceeds 20s execution time and the second fails the memory constraint of 1GB.
Is there a better way for parsing large files?
Is there any inefficiency that can be solved in my code?
Any advice will be greatly appreciated!
so I can't get my head around this problem:
I have the following code:
data Number = NumberInt Integer
| NumberFloat Double
deriving(Show, Eq)
intParser :: Parser Integer
--code of the parser
doubleParser :: Parser Double
--code of the parser
intOrFloat :: Parser Number
intOrFloat = -- to do
one of my approaches was to implement the intOrFloat the following way:
intOrFloat :: Parser Number
intOrFloat =
(do
e<- doubleParser
let result = (e :: Number)
pure result)
<|>
(do
f<- intParser
let result = (f :: Number)
pure result
)
But I always end up with the error:
Couldn't match expected type ‘Number’ with actual type ‘Integer’
Could somebody please explain me how to combine the two parsers to a new parser with another type ? I don't understand what the problem is.
I am using parsec.
I am new to Haskell so please be gentle.
Thank you.
You can't simply cast the value to call it a Number; you have to construct a new value of type Number.
do
e <- DoubleParser -- e :: Double, assuming success
let result = NumberFloat e -- result :: Number
pure result
The above can simply be written as
NumberFloat <$> DoubleParser
In full,
intOrFloat :: Parser Number
intOrFloat = NumberFloat <$> doubleParser <|> NumberInt <$> intParser
I'm learning some techniques to make a very simple Haskell parser that serves to calculation consistence (addition, subtraction and other trivial operations). Library I use is Parsec. Although I've got some comprehension on binary calculation, it seems to be tough to me if I try to make a unary operator function, for example that of negation (~). There is a code snippet I use to implement parsing for multiplication:
import Text.Parsec hiding(digit)
import Data.Functor
type Parser a = Parsec String () a
digit :: Parser Char
digit = oneOf ['0'..'9']
number :: Parser Integer
number = read <$> many1 digit
applyMany :: a -> [a -> a] -> a
applyMany x [] = x
applyMany x (h:t) = applyMany (h x) t
multiplication :: Parser Integer
multiplication = do
lhv <- number
spaces
char '*'
spaces
rhv <- number
return $ lhv * rhv
Switching to an unary operation, my code for factorial as follows:
fact :: Parser Integer
fact = do
spaces
char '!'
rhv <- number
spaces
return $ factorial rhv
factorial :: Parser Integer -> Parser Integer
factorial n
| n == 0 || n == 1 = 1
| otherwise = n * factorial (n-1)
And once module is getting loaded, an error message appears just like that:
Couldn't match type `Integer'
with `ParsecT String () Data.Functor.Identity.Identity Integer'
Expected type: Parser Integer
Actual type: Integer
Confusingly, it's a hard case for me to realize what's wrong with my comprehension about unary ops comparing them to binary ones. Hoping any help to fix that.
factorial doesn't define a parser; it computes a factorial. The type should just be Integer -> Integer, not Parser Integer -> Parser Integer.
As a learning project I am currently trying to build a parser for the ELF file format in Haskell. Elf binary files have a section called a program header, and this header contains a word dedicated to flags. The program header contains two fields for flags: one for 64 bit flags, one for 32 bit flags, in non-sequential order.
Like so:
data ProgramHeader =
ProgramHeader {
getSegmentType :: SegmentType
, get64SegmentFlags :: SegmentFlag
, getOffset :: Offset
, getVirtualAddress :: Address
, getPhysicalAddress :: Address
, getFileSize :: Size
, getMemorySize :: Size
, get32SegmentFlags :: SegmentFlag
, getAlignment :: Either Word32 Word64
}
And parsing looks like this:
parseProgramHeader :: WordSize -> Endianness -> Get Program
parseProgramHeader wordsize en = do
st <- parseSegmentType
flags64 <- parseSegmentFlag
offset <- parseVariant wordsize en
virtualAddress <- parseVariant wordsize en
physicalAddress <- parseVariant wordsize en
fileSize <- parseVariant wordsize en
memorySize <- parseVariant wordsize en
flags32 <- parseSegmentFlag
a <- parseVariableWord wordsize
return $ ProgramHeader st flags64 offset virtualAddress physicalAddress fileSize memorySize flags32 a
Now I'd like to be able to merge those two SegmentFlag fields into a single one as such:
data ProgramHeader =
ProgramHeader {
getSegmentType :: SegmentType
, getSegmentFlags :: SegmentFlag
, getOffset :: Offset
, getVirtualAddress :: Address
, getPhysicalAddress :: Address
, getFileSize :: Size
, getMemorySize :: Size
, getAlignment :: Either Word32 Word64
}
But, I would still need to parse the flags word twice (in each corresponding position), keeping only the relevant one.
My intuition is that it means I need to bind the parseSegmentFlag value only for the corresponding wordsize; while still calling parseSegmentFlag if it's not the case, but discarding the value.
In pseudocode:
parseProgramHeader :: WordSize -> Endianness -> Get Program
parseProgramHeader wordsize en = do
st <- parseSegmentType
if wordsize == Bit64
then
do flags <- parseSegmentFlag
else
discard $ parseSegmentFlag
[...]
if wordsize == Bit32
then do flags <- parseSegmentFlag
else do discard $ parseSegmentFlag
Now I have absolutely no idea how I could perform that elegantly in Haskell. Any ideas? Pointers?
Thanks!
Another option, just leave the code as you have it, then at the end pick out the version that is relevant, i.e.
... -- as before
let flags = case ws of
Bit32 -> flags32
Bit64 -> flags64
return $ ProgramHeader st flags ...
(I use case instead of if because if you happen to add another case, you'll get a warning here instead of falling through to whichever branch you picked as the default)
I'm starting to learn Haskell and wish to parse a PPM image for execrsice. The structure of the PPM format is rather simple, but it is tricky. It's described here. First of all, I defined a type for a PPM Image:
data Pixel = Pixel { red :: Int, green :: Int, blue :: Int} deriving(Show)
data BitmapFormat = TextualBitmap | BinaryBitmap deriving(Show)
data Header = Header { format :: BitmapFormat
, width :: Int
, height :: Int
, colorDepth :: Int} deriving(Show)
data PPM = PPM { header :: Header
, bitmap :: [Pixel]
}
bitmap should contain the entire image. This is where the first challange comes - the part that contains the actual image data in PPM can be either textual or binary (described in the header).
For textual bitmaps I wrote the following function:
parseTextualBitmap :: String -> [Pixel]
parseTextualBitmap = map textualPixel . chunksOf 3 . wordsBy isSpace
where textualPixel (r:g:b:[]) = Pixel (read r) (read g) (read b)
I'm not sure what to do with binary bitmaps, though. Using read converts a string representation of numbers to numbers. I want to convert "\x01" to 1 of type Int.
The second challange is parsing the header. I wrote the following function:
parseHeader :: String -> Header
parseHeader = constructHeader . wordsBy isSpace . filterComments
where
filterComments = unlines . map (takeWhile (/= '#')) . lines
formatFromText s
| s == "P6" = BinaryBitmap
| s == "P3" = TextualBitmap
constructHeader (format:width:height:colorDepth:_) =
Header (formatFromText format) (read width) (read height) (read colorDepth)
Which works pretty well. Now I should write the module exported function (let's call it parsePPM) which gets the entire file content (String) and then return PPM. The function should call parseHeader, deterime the bitmap format, call the apropriate parse(Textual|Binary)Bitmap and then construct a PPM with the result. Once parseHeader returns I should start decoding the bitmap from the point that parseHeader stopped in. However, I cannot know in which point of the string parseHeader stopped. The only solution I could think of is that instead of Header, parseHeader will return (Header,String), when the second element of the tuple is the remainder retrieved by constructHeader (which currently named as _). But I'm not really sure it's the "Haskell Way" of doing things.
To sum up my questions:
1. How do I decode the binary format into a list of Pixel
2. How can I know in which point the header ends
Since I'm learning Haskell by myself I have no one to actually review my code, so in addition to answering my questions I will appriciate any comment about the way I code (coding style, bugs, alternative way to do things, etc...).
Lets start with question 2 because it is easier to answer. Your approach is correct: as you parse things, you remove those characters from the input string, and return a tuple containing the result of the parse, and the remaining string. However, thereis no reason to write all this from scratch (except perhaps as an academic exercise) - there are plenty of parsers which will take care of this issue for you. The one I will use is Parsec. If you are new to monadic parsing you should first read the section on Parsec in RWH.
As for question 1, if you use ByteString instead of String, then parsing single bytes is easy since single bytes are the atomic elements of ByteStrings!
There is also the issue of the Char/ByteString interface. With Parsec, this is a non-issue since you can treat a ByteString as a sequence of Byte or Char - we will see this later.
I decided to just write the full parser - this is a very simple language so with all the primitives defined for you in the Parsec library, it is very easy and very concise.
The file header:
import Text.Parsec.Combinator
import Text.Parsec.Char
import Text.Parsec.ByteString
import Text.Parsec
import Text.Parsec.Pos
import Data.ByteString (ByteString, pack)
import qualified Data.ByteString.Char8 as C8
import Control.Monad (replicateM)
import Data.Monoid
First, we write the 'primitive' parsers - that is, parsing bytes, parsing textual numbers, and parsing whitespace (which the PPM format uses as a seperator):
parseIntegral :: (Read a, Integral a) => Parser a
parseIntegral = fmap read (many1 digit)
digit parses a single digit - you'll notice that many function names explain what the parser does - and many1 will apply the given parser 1 or more times. Then we read the resulting string to return an actual number (as opposed to a string). In this case, the input ByteString is being treated as text.
parseByte :: Integral a => Parser a
parseByte = fmap (fromIntegral . fromEnum) $ tokenPrim show (\pos tok _ -> updatePosChar pos tok) Just
For this parser, we parse a single Char - which is really just a byte. It is just returned as a Char. We could safely make the return type Parser Word8 because the universe of values that can be returned is [0..255]
whitespace1 :: Parser ()
whitespace1 = many1 (oneOf "\n ") >> return ()
oneOf takes a list of Char and parses any one of the characters in the order given - again, the ByteString is being treated as Text.
Now we can write the parser for the header.
parseHeader :: Parser Header
parseHeader = do
f <- choice $ map try $
[string "P3" >> return TextualBitmap
,string "P6" >> return BinaryBitmap]
w <- whitespace1 >> parseIntegral
h <- whitespace1 >> parseIntegral
d <- whitespace1 >> parseIntegral
return $ Header f w h d
A few notes. choice takes a list of parsers and tries them in order. try p takes the parser p, and 'remembers' the state before p starts parsing. If p succeeds, then try p == p. If p fails, then the state before p started is restored and you pretend you never tried p. This is necessary due to how choice behaves.
For the pixels, we have two choices as of now:
parseTextual :: Header -> Parser [Pixel]
parseTextual h = do
xs <- replicateM (3 * width h * height h) (whitespace1 >> parseIntegral)
return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs
We could use many1 (whitespace 1 >> parseIntegral) - but this wouldn't enforce the fact that we know what the length should be. Then, converting the list of numbers to a list of pixels is trivial.
For binary data:
parseBinary :: Header -> Parser [Pixel]
parseBinary h = do
whitespace1
xs <- replicateM (3 * width h * height h) parseByte
return $ map (\[a,b,c] -> Pixel a b c) $ chunksOf 3 xs
Note how the two are almost identical. You could probably generalize this function (it would be especially useful if you decided to parse the other types of pixel data - monochrome and greyscale).
Now to bring it all together:
parsePPM :: Parser PPM
parsePPM = do
h <- parseHeader
fmap (PPM h) $
case format h of
TextualBitmap -> parseTextual h
BinaryBitmap -> parseBinary h
This should be self-explanatory. Parse the header, then parse the body based on the format. Here are some examples to try it on. They are the ones from the specification page.
example0 :: ByteString
example0 = C8.pack $ unlines
["P3"
, "4 4"
, "15"
, " 0 0 0 0 0 0 0 0 0 15 0 15"
, " 0 0 0 0 15 7 0 0 0 0 0 0"
, " 0 0 0 0 0 0 0 15 7 0 0 0"
, "15 0 15 0 0 0 0 0 0 0 0 0" ]
example1 :: ByteString
example1 = C8.pack ("P6 4 4 15 ") <>
pack [0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 15, 0, 0, 0, 0, 15, 7,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 7, 0, 0, 0, 15,
0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
Several notes: this doesn't handle comments, which are part of the spec. The error messages are not very useful; you can use the <?> function to create your own error messages. The spec also indicates 'The lines should not be longer than 70 characters.' - this is also not enforced.
edit:
Just because you see do-notation, doesn't necessarily mean that you are working with impure code. Some monads (like this parser) are still pure - they are just used for convenience. For example, you can write your parser with the type parser :: String -> (a, String), or, what we have done here, is we use a new type: data Parser a = Parser (String -> (a, String)) and have parser :: Parser a; we then write a monad instance for Parser to get the useful do-notation. To be clear, Parsec supports monadic parsing, but our parser is not monadic - or rather, uses the Identity monad, which is just newtype Identity a = Identity { runIdentity :: a }, and is only necessary because if we used type Identity a = a we would have 'overlapping instances' errors everywhere, which is not good.
>:i Parser
type Parser = Parsec ByteString ()
-- Defined in `Text.Parsec.ByteString'
>:i Parsec
type Parsec s u = ParsecT s u Data.Functor.Identity.Identity
-- Defined in `Text.Parsec.Prim'
So then, the type of Parser is really ParsecT ByteString () Identity. That is, the parser input is ByteString, the user state is () - which just means we aren't using the user state, and the monad in which we are parsing is Identity. ParsecT is itself just a newtype of:
forall b.
State s u
-> (a -> State s u -> ParseError -> m b)
-> (ParseError -> m b)
-> (a -> State s u -> ParseError -> m b)
-> (ParseError -> m b)
-> m b
All those functions in the middle are just used to pretty-print errors. If you are parsing 10's of thousands of characters and an error occurs, you won't be able to just look at it and see where that happened - but Parsec will tell you the line and column. If we specialize all the types to our Parser, and pretend that Identity is just type Identity a = a, then all the monads disappear and you can see that the parser is not impure. As you can see, Parsec is a lot more powerful than is required for this problem - I just used it due to familiarity, but if you were willing to write your own primitive functions like many and digit, then you could get away with using newtype Parser a = Parser (ByteString -> (a, ByteString)).