Problems decoding a file strictly with Binary in Haskell - parsing

I'm trying to read and decode a binary file strictly, which seems to work most of the time. But unfortunately in a few cases my Program fails with
"too few bytes. Failed reading at byte position 1"
I guess Binary its decode function thinks there is no data available,
but I know there is and just rerunning the program works fine.
I've tried several solutions, but neither was able to solve my problem :(
using withBinaryFile:
decodeFile' path = withBinaryFile path ReadMode doDecode
where
doDecode h = do c <- LBS.hGetContents h
return $! decode c
reading the whole file with strict ByteString and decoding from it:
decodeFile' path = decode . LBS.fromChunks . return <$> BS.readFile path
adding some more strictness
decodeFile' path = fmap (decode . LBS.fromChunks . return) $! BS.readFile path
Any ideas what is going on here and how to solve the issue?
Thanks!
EDIT: I think I've figured out my problem. It is not about strictly reading the file. I have a number of processes mostly reading from the file, but from time to time one needs to write to it which will truncate the file first and add the new content then. So for writing I need to set a file lock first, which seems not to be done when "Binary.encodeFile" is used (when I say process I don't mean threads, but real instances of the same program being run).
EDIT Finally had some time to solve my problem using POSIX IO and File Locks. I've had no more Problems since.
Just in case someone is interested in my current solution or maybe someone is able to point out errors/problems I'll post my solution here.
Safe encoding to File:
safeEncodeFile path value = do
fd <- openFd path WriteOnly (Just 0o600) (defaultFileFlags {trunc = True})
waitToSetLock fd (WriteLock, AbsoluteSeek, 0, 0)
let cs = encode value
let outFn = LBS.foldrChunks (\c rest -> writeChunk fd c >> rest) (return ()) cs
outFn
closeFd fd
where
writeChunk fd bs = unsafeUseAsCString bs $ \ptr ->
fdWriteBuf fd (castPtr ptr) (fromIntegral $ BS.length bs)
and decoding a File:
safeDecodeFile def path = do
e <- doesFileExist path
if e
then do fd <- openFd path ReadOnly Nothing
(defaultFileFlags{nonBlock=True})
waitToSetLock fd (ReadLock, AbsoluteSeek, 0, 0)
c <- fdGetContents fd
let !v = decode $! c
return v
else return def
fdGetContents fd = lazyRead
where
lazyRead = unsafeInterleaveIO loop
loop = do blk <- readBlock fd
case blk of
Nothing -> return LBS.Empty
Just c -> do cs <- lazyRead
return (LBS.Chunk c cs)
readBlock fd = do buf <- mallocBytes 4096
readSize <- fdReadBuf fd buf 4096
if readSize == 0
then do free buf
closeFd fd
return Nothing
else do bs <- unsafePackCStringFinalizer buf
(fromIntegral readSize)
(free buf)
return $ Just bs
With qualified imports for strict and lazy Bytestrings as:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS
import qualified Data.ByteString.Lazy.Internal as LBS

It would be helpful if you could produce some minimum code snippet that runs and demonstrates the problem. Right now I am not convinced this isn't a problem with your program tracking which handles are opened/closed and the reads/writes getting in the way of each other. Here is example test code I made that works fine.
import Data.Trie as T
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import Data.Binary
import System.IO
tmp = "blah"
main = do
let trie = T.fromList [(B.pack [p], p) | p <- [0..]]
(file,hdl) <- openTempFile "/tmp" tmp
B.hPutStr hdl (B.concat $ L.toChunks $ encode trie)
hClose hdl
putStrLn file
t <- B.readFile file
let trie' = decode (L.fromChunks [t])
print (trie' == trie)

Related

Haskell: Lazily read binary file with binary

I'm trying to read in a binary file and parse it lazily using the 'binary' package. The package documentation gives an example of how to do this without forcing all the input for a scenario very similar to mine:
example2 :: BL.ByteString -> [Trade]
example2 input
| BL.null input = []
| otherwise =
let (trade, rest, _) = runGetState getTrade input 0
in trade : example2 rest
However, this uses the deprecated runGetState function, which itself points you towards the runGetIncremental function.
The problem is that the 'runGetIncremental' function seems to force the remaining input to be a strict bytestring, thus forcing it to load the whole file into memory. Indeed, I'm seeing memory usage of around 6GB when I try to run this. Even the implementation of runGetState now seems to be based on runGetIncremental and then reconverts the strict bytestring back to a lazy one using chunk.
Can I get the behaviour as described in the tutorial, or is this now unsupported by binary? If the latter, what's the best way to do this? I have a little experience using conduit, but it's not clear to me how I could use it here.
You can do this using pipes-binary and pipes-bytestring. Here's a helper function for your benefit:
import Control.Monad (void)
import Data.Binary
import Pipes
import Pipes.Binary (decodeMany)
import Pipes.ByteString (fromHandle)
import qualified Pipes.Prelude as P
import System.IO
decodeHandle :: (Binary a) => Handle -> Producer a IO ()
decodeHandle handle = void $ decodeMany (fromHandle handle) >-> P.map snd
The void and map snd are there because decodeMany actually returns more information (like byte offsets and parsing errors). If you actually want that information, then just remove them.
Here's an example of how you might use decodeHandle, using a quick skeleton for Trade I threw together:
data Trade = Trade
instance Binary Trade where
get = return Trade
put _ = return ()
instance Show Trade where show _ = "Trade"
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
for (decodeHandle handle) $ \trade -> do
lift $ print (trade :: Trade)
-- do more with the parsed trade
You can use for to loop over the decoded trades and handle them, or if you prefer you can use pipe composition:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
decodeHandle handle >-> P.print
This will be lazy and only decode as many trades as you actually need. So if you insert a take in between the decoder and the printer, it will only read as much input as necessary to process the requested number of trades:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
for (decodeHandle handle >-> P.take 4) $ \trade -> do
... -- This will only process the first 4 trades
-- or using purely pipe composition:
main = withFile "inFile.txt" ReadMode $ \handle -> runEffect $
decodeHandle handle >-> P.take 4 >-> P.print

F# records, usage, code clarity

Background:
I find myself harnessing F# Records a lot. Currently I am working on a project for packet dissection & replay of a proprietary binary protocol (a protocol that is very strangely designed ...).
We define the skeleton record for the packet.
type bytes = byte array
type packetSkeleton = {
part1 : bytes
part2 : bytes
... }
Now, it is easy to use this to 'dissect' our packet, (really just giving names to the byte fields).
let dissect (raw : bytes) =
let slice a b = raw.[a..b]
{ part1 = slice 0 4
part2 = slice 4 5
... }
This works perfectly even for longish packets, we can even use some neat recursive functions if there is a predicable pattern to the slicing.
So I dissect the packet, pull out the fields that I need and create a packet based off the packetSkeleton using the fields I took from the dissection, which by now is starting to look a bit awkward:
let createAuthStub a b c d e f g h ... =
{ part1 = a; part2 = b
part3 = d; ...
}
Then, after creating the populated stub, I need to deserialise it to a form that can be put on the wire:
(* packetSkeleton -> byte array *)
let deserialise (packet : packetSkeleton) =
[| packet.part1; packet.part2; ... |]
let xab = dissect buf
let authStub = createAuthStub xab.part1 1 2 xab.part9 ...
deserialise authStub |> send
So it ends up that I have 3 areas, the record type, the creation of the record for a given packet, and the deserialised byte array. Something tells me that this is a poor design choice on my part in terms of code clarity, and I can already feel it starting to shoot me in the foot even at this early stage.
Questions:
a) Am I using the correct datatype for such a project? Is my approach correct?
b) Should I just give up on trying to make this code feel clean?
As I am kinda coding this by touch and go, I would appreciate some insights!
P.S I realise that this problem is quite suited for C, but F# is more fun (additionally verification of the dissector later on sounds appealing)!
If a packet could be rather large packetSkeleton might grow unwieldy. Another option is to work with the raw bytes and define a module that reads/writes each part.
module Packet
let Length = 42
let GetPart1 src = src.[0..3]
let SetPart1 src dst = Array.blit src 0 dst 0 4
let GetPart2 src = src.[4..5]
let SetPart2 src dst = Array.blit src 0 dst 4 2
...
open Packet
let createAuthStub bytes b c =
let resp = Array.zeroCreate Packet.Length
SetPart1 (GetPart1 bytes)
SetPart2 b resp
SetPart3 c resp
SetPart4 (GetPart9 bytes)
resp
This removes the need for de/serialization functions (and probably helps performance a bit).
EDIT
Creating a wrapper type is another option
type Packet(bytes: byte[]) =
new() = Packet(Array.zeroCreate Packet.Length)
static member Length = 42
member x.Part1
with get() = bytes.[0..3]
and set value = Array.blit value 0 bytes 0 4
...
which might reduce code a bit:
let createAuthStub (req: Packet) b c =
let resp = Packet()
resp.Part1 <- req.Part1
resp.Part2 <- b
resp.Part3 <- c
resp.Part4 <- req.Part9
resp
I think your approach is essentially sound - but of course, it is difficult to tell without knowing more details.
I think one key idea that shows in your code and that is key to functional architecture is the separation between types (used to model the problem domain) and the processing functionality that creates the values of the domain model, processes it and formats them.
In your case:
The types bytes and packetSkeleton model the problem domain
The function createAuthStub processes your domain (and I agree with Daniel that it might be more readable if it took the whole packetSkeleton as an argument)
The function deserialize turns your domain back to bytes
I think this way of structuring code is quite good, because it separates different concerns of the program. I even wrote an article that tries to describe this as a more general programming approach.

How to parse a 7GB file, with Data.ByteString?

I have to parse a file, and indeed a have to read it first, here is my program :
import qualified Data.ByteString.Char8 as B
import System.Environment
main = do
args <- getArgs
let path = args !! 0
content <- B.readFile path
let lines = B.lines content
foobar lines
foobar :: [B.ByteString] -> IO()
foobar _ = return ()
but, after the compilation
> ghc --make -O2 tmp.hs
the execution goes through the following error when called with a 7Gigabyte file.
> ./tmp big_big_file.dat
> tmp: {handle: big_big_file.dat}: hGet: illegal ByteString size (-1501792951): illegal operation
thanks for any reply!
The length of ByteStrings are Int. If Int is 32 bits, a 7GB file will exceed the range of Int and the buffer request will be for a wrong size and can easily request a negative size.
The code for readFile converts the file size to Int for the buffer request
readFile :: FilePath -> IO ByteString
readFile f = bracket (openBinaryFile f ReadMode) hClose
(\h -> hFileSize h >>= hGet h . fromIntegral)
and if that overflows, an "illegal ByteString size" error or a segmentation fault are the most likely outcomes.
If at all possible, use lazy ByteStrings to handle files that big. In your case, you pretty much have to make it possible, since with 32 bit Ints, a 7GB ByteString is impossible to create.
If you need the lines to be strict ByteStrings for the processing, and no line is exceedingly long, you can go through lazy ByteStrings to achieve that
import qualified Data.ByteString.Lazy.Char8 as LC
import qualified Data.ByteString.Char8 as C
main = do
...
content <- LC.readFile path
let llns = LC.lines content
slns = map (C.concat . LC.toChunks) llns
foobar slns
but if you can modify your processing to deal with lazy ByteStrings, that will probably be better overall.
Strict ByteStrings only support up to 2 GiB of memory. You need to use lazy ByteStrings for it to work.

How to pass a string to a function, which can create a binary later?

This is my function, when I call my_conv("2312144", 10, 10), it gives me "bad argument" error
my_conv(S, Start, End) ->
Res = <<Start:8, End:8, S:1024>>.
A string cannot be used inside a binary expression without conversion. You need to convert the string to a binary by using list_to_binary(S).
I would recommend the following expression:
my_conv(S, Start, End) ->
list_to_binary(<<Start:8, End:8>>, S]).
(Note here that list_to_binary/1 actually accepts a deep IO list and not just a pure string).
If you intend to pad your binary to 1024 bytes (or 1040 including your newlines) you can do so afterwards:
my_conv(S, Start, End) ->
pad(1040, list_to_binary(<<Start:8, End:8>>, S])).
pad(Width, Binary) ->
case Width = byte_size(Binary) of
N when N =< 0 -> Binary;
N -> <<Binary/binary, 0:(N*8)>>
end.

How to get a pointer value in Haskell?

I wish to manipulate data on a very low level.
Therefore I have a function that receives a virtual memory address as an integer and "does stuff" with this memory address. I interfaced this function from C, so it has the type (CUInt -> a).
The memory I want to link is a Word8 in a file. Sadly, I have no idea how to access the pointer value to that Word8.
To be clear, I do not need the value of the Word8, i need the value to the virtual memory address, which is the value of the pointer to it.
For the sake of a simple example, say you want to add an offset to the pointer.
Front matter:
module Main where
import Control.Monad (forM_)
import Data.Char (chr)
import Data.Word (Word8)
import Foreign.ForeignPtr (ForeignPtr, withForeignPtr)
import Foreign.Ptr (Ptr, plusPtr)
import Foreign.Storable (peek)
import System.IO.MMap (Mode(ReadOnly), mmapFileForeignPtr)
Yes, you wrote that you don't want the value of the Word8, but I've retrieved it with peek to demonstrate that the pointer is valid. You might be tempted to return the Ptr from inside withForeignPtr, but the documentation warns against that:
Note that it is not safe to return the pointer from the action and use it after the action completes. All uses of the pointer should be inside the withForeignPtr bracket. The reason for this unsafeness is the same as for unsafeForeignPtrToPtr below: the finalizer may run earlier than expected, because the compiler can only track usage of the ForeignPtr object, not a Ptr object made from it.
The code is straightforward:
doStuff :: ForeignPtr Word8 -> Int -> IO ()
doStuff fp i =
withForeignPtr fp $ \p -> do
let addr = p `plusPtr` i
val <- peek addr :: IO Word8
print (addr, val, chr $ fromIntegral val)
To approximate “a Word8 in a File” from your question, the main program memory-maps a file and uses that buffer to do stuff with memory addresses.
main :: IO ()
main = do
(p,offset,size) <- mmapFileForeignPtr path mode range
forM_ [0 .. size-1] $ \i -> do
doStuff p (offset + i)
where
path = "/tmp/input.dat"
mode = ReadOnly
range = Nothing
-- range = Just (4,3)
Output:
(0x00007f1b40edd000,71,'G')
(0x00007f1b40edd001,117,'u')
(0x00007f1b40edd002,116,'t')
(0x00007f1b40edd003,101,'e')
(0x00007f1b40edd004,110,'n')
(0x00007f1b40edd005,32,' ')
(0x00007f1b40edd006,77,'M')
(0x00007f1b40edd007,111,'o')
(0x00007f1b40edd008,114,'r')
(0x00007f1b40edd009,103,'g')
(0x00007f1b40edd00a,101,'e')
(0x00007f1b40edd00b,110,'n')
(0x00007f1b40edd00c,33,'!')
(0x00007f1b40edd00d,10,'\n')
You are probably looking for ptrToIntPtr and probably fromIntegral to make it a CUInt.
Note that a CUInt cannot represent a pointer on all platforms, though.

Resources