How to parse a 7GB file, with Data.ByteString? - parsing

I have to parse a file, and indeed a have to read it first, here is my program :
import qualified Data.ByteString.Char8 as B
import System.Environment
main = do
args <- getArgs
let path = args !! 0
content <- B.readFile path
let lines = B.lines content
foobar lines
foobar :: [B.ByteString] -> IO()
foobar _ = return ()
but, after the compilation
> ghc --make -O2 tmp.hs
the execution goes through the following error when called with a 7Gigabyte file.
> ./tmp big_big_file.dat
> tmp: {handle: big_big_file.dat}: hGet: illegal ByteString size (-1501792951): illegal operation
thanks for any reply!

The length of ByteStrings are Int. If Int is 32 bits, a 7GB file will exceed the range of Int and the buffer request will be for a wrong size and can easily request a negative size.
The code for readFile converts the file size to Int for the buffer request
readFile :: FilePath -> IO ByteString
readFile f = bracket (openBinaryFile f ReadMode) hClose
(\h -> hFileSize h >>= hGet h . fromIntegral)
and if that overflows, an "illegal ByteString size" error or a segmentation fault are the most likely outcomes.
If at all possible, use lazy ByteStrings to handle files that big. In your case, you pretty much have to make it possible, since with 32 bit Ints, a 7GB ByteString is impossible to create.
If you need the lines to be strict ByteStrings for the processing, and no line is exceedingly long, you can go through lazy ByteStrings to achieve that
import qualified Data.ByteString.Lazy.Char8 as LC
import qualified Data.ByteString.Char8 as C
main = do
...
content <- LC.readFile path
let llns = LC.lines content
slns = map (C.concat . LC.toChunks) llns
foobar slns
but if you can modify your processing to deal with lazy ByteStrings, that will probably be better overall.

Strict ByteStrings only support up to 2 GiB of memory. You need to use lazy ByteStrings for it to work.

Related

How much memory used in Erlang VM?

i want to konw this data struct will use how much memory in Erlang VM?
[{"3GPP-UTRAN-FDD", [{"utran-cell-id-3gpp","CID1000"}, "1996-12-19t16%3a39%3a57-08%3a00", "1996-12-19t15%3a39%3a27%2e20-08%3a00"]}]
In my application, every process will store this data in self's loop data, and the numbert of this proces will be 120000.
The result which i test:
don't store this data, the memory will be:
memory[kB]: proc 1922806, atom 2138, bin 24890, code 72757, ets 459321
store this data, the momory will be:
memory[kB]: proc 1684032, atom 2138, bin 24102, code 72757, ets 459080
So the big difference is the memoery used by proc: (1922806 - 1684032) / 1024 = 233M.
After research, i find an insterting thing:
L = [{"3GPP-UTRAN-FDD", [{"utran-cell-id-3gpp","CID1000"}, "1996-12-19t16%3a39%3a57-08%3a00", "1996-12-19t15%3a39%3a27%2e20-08%3a00"]}].
B = list_to_binary(io_lib:format("~p", L)).
erts_debug:size(B). % The output is 6
The memory just use 6 words after using binary? How to explain this?
There are two useful functions for measuring the size of an Erlang term: erts_debug:size/1 and erts_debug:flat_size/1. Both of these functions return the size of the term in words.
erts_debug:flat_size/1 gives you the total size of a message without term-sharing optimization. This is guaranteed to be the size of the term if it is copied to a new heap, as with message passing and ets tables.
erts_debug:size/1 gives you the size of the term as it is in the current process' heap, allowing for memory usage optimization by sharing repeated terms.
Here is a demonstration of the differences:
1> MyTerm = {atom, <<"binary">>, 1}.
{atom,<<"binary">>,1}
2> MyList = [ MyTerm || _ <- lists:seq(1, 100) ].
[{atom,<<"binary">>,1}|...]
3> erts_debug:size(MyList).
210
4> erts_debug:flat_size(MyList).
1200
As you can see, there is a significant difference in the sizes due to term sharing.
As for your specific term, I used the Erlang shell (R16B03) and measured the term with flat_size. According to this, the memory usage of your term is: 226 words (1808B, 1.77KB).
This is a lot of memory to use for what appears to be a simple term, but that is outside of the scope of this question.
the size of the whole binary is 135 bytes when you do it list_to_binary(io_lib:format("~p", L))., if you are on a 64 bit system it represents 4.375 words so 6 words should be the correct size, but you have lost the direct access to the internal structure.
Strange but can be understood:
19> erts_debug:flat_size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,1000)])).
6
20> erts_debug:flat_size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
6
21> size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
10000
22> (list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
<<"myeyrltgyfnytajecrgtonkdcxlnaoqcsswdnepnmdxfrwnnlbzdaxknqarfyiwewlugrtgjgklblpdkvgpecglxmfrertdfanzukfolpphqvkkwrpmb"...>>
23> erts_debug:display({list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])}).
{<<10000 bytes>>}
"{<<10000 bytes>>}\n"
24>
This means that the erts_debug:flat_size return the size of the variable (which is roughly a type information, a pointer to the data and its size), but not the size of the binary data itself. The binary data is stored elsewhere and can be shared by different variables.

F# records, usage, code clarity

Background:
I find myself harnessing F# Records a lot. Currently I am working on a project for packet dissection & replay of a proprietary binary protocol (a protocol that is very strangely designed ...).
We define the skeleton record for the packet.
type bytes = byte array
type packetSkeleton = {
part1 : bytes
part2 : bytes
... }
Now, it is easy to use this to 'dissect' our packet, (really just giving names to the byte fields).
let dissect (raw : bytes) =
let slice a b = raw.[a..b]
{ part1 = slice 0 4
part2 = slice 4 5
... }
This works perfectly even for longish packets, we can even use some neat recursive functions if there is a predicable pattern to the slicing.
So I dissect the packet, pull out the fields that I need and create a packet based off the packetSkeleton using the fields I took from the dissection, which by now is starting to look a bit awkward:
let createAuthStub a b c d e f g h ... =
{ part1 = a; part2 = b
part3 = d; ...
}
Then, after creating the populated stub, I need to deserialise it to a form that can be put on the wire:
(* packetSkeleton -> byte array *)
let deserialise (packet : packetSkeleton) =
[| packet.part1; packet.part2; ... |]
let xab = dissect buf
let authStub = createAuthStub xab.part1 1 2 xab.part9 ...
deserialise authStub |> send
So it ends up that I have 3 areas, the record type, the creation of the record for a given packet, and the deserialised byte array. Something tells me that this is a poor design choice on my part in terms of code clarity, and I can already feel it starting to shoot me in the foot even at this early stage.
Questions:
a) Am I using the correct datatype for such a project? Is my approach correct?
b) Should I just give up on trying to make this code feel clean?
As I am kinda coding this by touch and go, I would appreciate some insights!
P.S I realise that this problem is quite suited for C, but F# is more fun (additionally verification of the dissector later on sounds appealing)!
If a packet could be rather large packetSkeleton might grow unwieldy. Another option is to work with the raw bytes and define a module that reads/writes each part.
module Packet
let Length = 42
let GetPart1 src = src.[0..3]
let SetPart1 src dst = Array.blit src 0 dst 0 4
let GetPart2 src = src.[4..5]
let SetPart2 src dst = Array.blit src 0 dst 4 2
...
open Packet
let createAuthStub bytes b c =
let resp = Array.zeroCreate Packet.Length
SetPart1 (GetPart1 bytes)
SetPart2 b resp
SetPart3 c resp
SetPart4 (GetPart9 bytes)
resp
This removes the need for de/serialization functions (and probably helps performance a bit).
EDIT
Creating a wrapper type is another option
type Packet(bytes: byte[]) =
new() = Packet(Array.zeroCreate Packet.Length)
static member Length = 42
member x.Part1
with get() = bytes.[0..3]
and set value = Array.blit value 0 bytes 0 4
...
which might reduce code a bit:
let createAuthStub (req: Packet) b c =
let resp = Packet()
resp.Part1 <- req.Part1
resp.Part2 <- b
resp.Part3 <- c
resp.Part4 <- req.Part9
resp
I think your approach is essentially sound - but of course, it is difficult to tell without knowing more details.
I think one key idea that shows in your code and that is key to functional architecture is the separation between types (used to model the problem domain) and the processing functionality that creates the values of the domain model, processes it and formats them.
In your case:
The types bytes and packetSkeleton model the problem domain
The function createAuthStub processes your domain (and I agree with Daniel that it might be more readable if it took the whole packetSkeleton as an argument)
The function deserialize turns your domain back to bytes
I think this way of structuring code is quite good, because it separates different concerns of the program. I even wrote an article that tries to describe this as a more general programming approach.

Problems decoding a file strictly with Binary in Haskell

I'm trying to read and decode a binary file strictly, which seems to work most of the time. But unfortunately in a few cases my Program fails with
"too few bytes. Failed reading at byte position 1"
I guess Binary its decode function thinks there is no data available,
but I know there is and just rerunning the program works fine.
I've tried several solutions, but neither was able to solve my problem :(
using withBinaryFile:
decodeFile' path = withBinaryFile path ReadMode doDecode
where
doDecode h = do c <- LBS.hGetContents h
return $! decode c
reading the whole file with strict ByteString and decoding from it:
decodeFile' path = decode . LBS.fromChunks . return <$> BS.readFile path
adding some more strictness
decodeFile' path = fmap (decode . LBS.fromChunks . return) $! BS.readFile path
Any ideas what is going on here and how to solve the issue?
Thanks!
EDIT: I think I've figured out my problem. It is not about strictly reading the file. I have a number of processes mostly reading from the file, but from time to time one needs to write to it which will truncate the file first and add the new content then. So for writing I need to set a file lock first, which seems not to be done when "Binary.encodeFile" is used (when I say process I don't mean threads, but real instances of the same program being run).
EDIT Finally had some time to solve my problem using POSIX IO and File Locks. I've had no more Problems since.
Just in case someone is interested in my current solution or maybe someone is able to point out errors/problems I'll post my solution here.
Safe encoding to File:
safeEncodeFile path value = do
fd <- openFd path WriteOnly (Just 0o600) (defaultFileFlags {trunc = True})
waitToSetLock fd (WriteLock, AbsoluteSeek, 0, 0)
let cs = encode value
let outFn = LBS.foldrChunks (\c rest -> writeChunk fd c >> rest) (return ()) cs
outFn
closeFd fd
where
writeChunk fd bs = unsafeUseAsCString bs $ \ptr ->
fdWriteBuf fd (castPtr ptr) (fromIntegral $ BS.length bs)
and decoding a File:
safeDecodeFile def path = do
e <- doesFileExist path
if e
then do fd <- openFd path ReadOnly Nothing
(defaultFileFlags{nonBlock=True})
waitToSetLock fd (ReadLock, AbsoluteSeek, 0, 0)
c <- fdGetContents fd
let !v = decode $! c
return v
else return def
fdGetContents fd = lazyRead
where
lazyRead = unsafeInterleaveIO loop
loop = do blk <- readBlock fd
case blk of
Nothing -> return LBS.Empty
Just c -> do cs <- lazyRead
return (LBS.Chunk c cs)
readBlock fd = do buf <- mallocBytes 4096
readSize <- fdReadBuf fd buf 4096
if readSize == 0
then do free buf
closeFd fd
return Nothing
else do bs <- unsafePackCStringFinalizer buf
(fromIntegral readSize)
(free buf)
return $ Just bs
With qualified imports for strict and lazy Bytestrings as:
import qualified Data.ByteString as BS
import qualified Data.ByteString.Lazy as LBS
import qualified Data.ByteString.Lazy.Internal as LBS
It would be helpful if you could produce some minimum code snippet that runs and demonstrates the problem. Right now I am not convinced this isn't a problem with your program tracking which handles are opened/closed and the reads/writes getting in the way of each other. Here is example test code I made that works fine.
import Data.Trie as T
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import Data.Binary
import System.IO
tmp = "blah"
main = do
let trie = T.fromList [(B.pack [p], p) | p <- [0..]]
(file,hdl) <- openTempFile "/tmp" tmp
B.hPutStr hdl (B.concat $ L.toChunks $ encode trie)
hClose hdl
putStrLn file
t <- B.readFile file
let trie' = decode (L.fromChunks [t])
print (trie' == trie)

F# ref-mutable vars vs object fields

I'm writing a parser in F#, and it needs to be as fast as possible (I'm hoping to parse a 100 MB file in less than a minute). As normal, it uses mutable variables to store the next available character and the next available token (i.e. both the lexer and the parser proper use one unit of lookahead).
My current partial implementation uses local variables for these. Since closure variables can't be mutable (anyone know the reason for this?) I've declared them as ref:
let rec read file includepath =
let c = ref ' '
let k = ref NONE
let sb = new StringBuilder()
use stream = File.OpenText file
let readc() =
c := stream.Read() |> char
// etc
I assume this has some overhead (not much, I know, but I'm trying for maximum speed here), and it's a little inelegant. The most obvious alternative would be to create a parser class object and have the mutable variables be fields in it. Does anyone know which is likely to be faster? Is there any consensus on which is considered better/more idiomatic style? Is there another option I'm missing?
You mentioned that local mutable values cannot be captured by a closure, so you need to use ref instead. The reason for this is that mutable values captured in the closure need to be allocated on the heap (because closure is allocated on the heap).
F# forces you to write this explicitly (using ref). In C# you can "capture mutable variable", but the compiler translates it to a field in a heap-allocated object behind the scene, so it will be on the heap anyway.
Summary is: If you want to use closures, mutable variables need to be allocated on the heap.
Now, regarding your code - your implementation uses ref, which creates a small object for every mutable variable that you're using. An alternative would be to create a single object with multiple mutable fields. Using records, you could write:
type ReadClosure = {
mutable c : char
mutable k : SomeType } // whatever type you use here
let rec read file includepath =
let state = { c = ' '; k = NONE }
// ...
let readc() =
state.c <- stream.Read() |> char
// etc...
This may be a bit more efficient, because you're allocating a single object instead of a few objects, but I don't expect the difference will be noticeable.
There is also one confusing thing about your code - the stream value will be disposed after the function read returns, so the call to stream.Read may be invalid (if you call readc after read completes).
let rec read file includepath =
let c = ref ' '
use stream = File.OpenText file
let readc() =
c := stream.Read() |> char
readc
let f = read a1 a2
f() // This would fail!
I'm not quite sure how you're actually using readc, but this may be a problem to think about. Also, if you're declaring it only as a helper closure, you could probably rewrite the code without closure (or write it explicitly using tail-recursion, which is translated to imperative loop with mutable variables) to avoid any allocations.
I did the following profiling:
let test() =
tic()
let mutable a = 0.0
for i=1 to 10 do
for j=1 to 10000000 do
a <- a + float j
toc("mutable")
let test2() =
tic()
let a = ref 0.0
for i=1 to 10 do
for j=1 to 10000000 do
a := !a + float j
toc("ref")
the average for mutable is 50ms, while ref 600ms. The performance difference is due to that mutable variables are in stack, while ref variables are in managed heap.
The relative difference is big. However, 10^8 times of access is a big number. And the total time is acceptable. So don't worry too much about the performance of ref variables. And remember:
Premature optimization is the root of
all evil.
My advice is you first finish your parser, then consider optimizing it. You won't know where the bottomneck is until you actually run the program. One good thing about F# is that its terse syntax and functional style well support code refactoring. Once the code is done, optimizing it would be convenient. Here's an profiling example.
Just another example, we use .net arrays everyday, which is also in managed heap:
let test3() =
tic()
let a = Array.create 1 0.0
for i=1 to 10 do
for j=1 to 10000000 do
a.[0] <- a.[0] + float j
toc("array")
test3() runs about the same as ref's. If you worry too much of variables in managed heap, then you won't use array anymore.

How to get a pointer value in Haskell?

I wish to manipulate data on a very low level.
Therefore I have a function that receives a virtual memory address as an integer and "does stuff" with this memory address. I interfaced this function from C, so it has the type (CUInt -> a).
The memory I want to link is a Word8 in a file. Sadly, I have no idea how to access the pointer value to that Word8.
To be clear, I do not need the value of the Word8, i need the value to the virtual memory address, which is the value of the pointer to it.
For the sake of a simple example, say you want to add an offset to the pointer.
Front matter:
module Main where
import Control.Monad (forM_)
import Data.Char (chr)
import Data.Word (Word8)
import Foreign.ForeignPtr (ForeignPtr, withForeignPtr)
import Foreign.Ptr (Ptr, plusPtr)
import Foreign.Storable (peek)
import System.IO.MMap (Mode(ReadOnly), mmapFileForeignPtr)
Yes, you wrote that you don't want the value of the Word8, but I've retrieved it with peek to demonstrate that the pointer is valid. You might be tempted to return the Ptr from inside withForeignPtr, but the documentation warns against that:
Note that it is not safe to return the pointer from the action and use it after the action completes. All uses of the pointer should be inside the withForeignPtr bracket. The reason for this unsafeness is the same as for unsafeForeignPtrToPtr below: the finalizer may run earlier than expected, because the compiler can only track usage of the ForeignPtr object, not a Ptr object made from it.
The code is straightforward:
doStuff :: ForeignPtr Word8 -> Int -> IO ()
doStuff fp i =
withForeignPtr fp $ \p -> do
let addr = p `plusPtr` i
val <- peek addr :: IO Word8
print (addr, val, chr $ fromIntegral val)
To approximate “a Word8 in a File” from your question, the main program memory-maps a file and uses that buffer to do stuff with memory addresses.
main :: IO ()
main = do
(p,offset,size) <- mmapFileForeignPtr path mode range
forM_ [0 .. size-1] $ \i -> do
doStuff p (offset + i)
where
path = "/tmp/input.dat"
mode = ReadOnly
range = Nothing
-- range = Just (4,3)
Output:
(0x00007f1b40edd000,71,'G')
(0x00007f1b40edd001,117,'u')
(0x00007f1b40edd002,116,'t')
(0x00007f1b40edd003,101,'e')
(0x00007f1b40edd004,110,'n')
(0x00007f1b40edd005,32,' ')
(0x00007f1b40edd006,77,'M')
(0x00007f1b40edd007,111,'o')
(0x00007f1b40edd008,114,'r')
(0x00007f1b40edd009,103,'g')
(0x00007f1b40edd00a,101,'e')
(0x00007f1b40edd00b,110,'n')
(0x00007f1b40edd00c,33,'!')
(0x00007f1b40edd00d,10,'\n')
You are probably looking for ptrToIntPtr and probably fromIntegral to make it a CUInt.
Note that a CUInt cannot represent a pointer on all platforms, though.

Resources