Attoparsec allocates a ton of memory on large 'take' call - parsing

So I am writing a packet sniffing app. Basically I wanted it to sniff for tcp sessions, and then parse them to see if they are http, and if they are, and if they have the right content type, etc, save them as a file on my hard drive.
So, to that end, I wanted it to be efficient. Since the current http library is string based, and I will be dealing with large files, and I only really needed to parse http responses, I decided to roll my own in attoparsec.
When I finished my program, I found that when I was parsing a 9 meg http response with a wav file in it, when I profiled it, it was allocating a gig of memory when it was trying to parse out the body of the http response. When I look at the HTTP.prof I see some lines:
httpBody Main 362 1 0.0 0.0 93.8 99.3
take Data.Attoparsec.Internal 366 1201 0.0 0.0 93.8 99.3
takeWith Data.Attoparsec.Internal 367 3603 0.0 0.0 93.8 99.3
demandInput Data.Attoparsec.Internal 375 293 0.0 0.0 93.8 99.2
prompt Data.Attoparsec.Internal 378 293 0.0 0.0 93.8 99.2
+++ Data.Attoparsec.Internal 380 586 93.8 99.2 93.8 99.2
So as you can see, somewhere within httpbody, take is called 1201 times, causing 500+ (+++) concatenations of bytestrings, which causes an absurd amount of memory allocation.
Here's the code. N is just the content length of the http response, if there is one. If there isn't one it just tries to take everything.
I wanted it to return a lazy bytestring of 1000 or so character bytestrings, but even if I change it to just take n and return a strict bytestring, it still has those allocations in it (and it uses 14 gig of memory).
httpBody n = do
x <- if n > 0
then AC.take n
else AC.takeWhile (\_ -> True)
if B.length x == 0
then return Nothing
else return (Just x)
I was reading a blog by the guy who did combinatorrent and he was having the same issue, but I never heard of a resolution. Has anyone ever run across this problem before or found a solution?
Edit: Okay, well I left this up the entire day and got nothing. After researching the problem I don't think there is a way to do it without adding a lazy bytestring accessor to attoparsec. I also looked at all the other libraries and they either lacked bytestrings or other things.
So I found a workaround. If you think about an http request, it goes headers, newline, newline, body. Since the body is last, and parsing returns a tuple with both what you parsed and what is remaining of the bytestring, I can skip parsing the body inside attoparsec and instead pluck the body straight off the bytestring that is left.
parseHTTPs bs = if P.length results == 0
then Nothing
else Just results
where results = foldParse(bs, [])
foldParse (bs,rs) = case ACL.parse httpResponse bs of
ACL.Done rest r -> addBody (rest,rs) r
otherwise -> rs
addBody (rest,rs) http = foldParse (rest', rs')
where
contentlength = ((read . BU.toString) (maybe "0" id (hdrContentLength (rspHeaders http))))
rest' = BL.drop contentlength rest
rs' = rs ++ [http { rspBody = body' }]
body'
| contentlength == 0 = Just rest
| BL.length rest == 0 = Nothing
| otherwise = Just (BL.take contentlength rest)
httpResponse = do
(code, desc) <- statusLine
hdrs <- many header
endOfLine
-- body <- httpBody ((read . BU.toString) (maybe "0" id (hdrContentLength parsedHeaders)))
return Response { rspCode = code, rspReason = desc, rspHeaders = parseHeaders hdrs, rspBody = undefined }
It is a little messy, but ultimately it works fast and allocates nothing more than I wanted. So basically you fold over the bytestring collecting http data structures, then in between collections, I check the content length of the structure I just got, pull an appropriate amount from the remaining bytestring, and then continue on if there is any bytestring left.
Edit: I actually finished up this project. Works like a charm. I isn't cabalized properly but if someone wants to view the entire source, you can find it at https://github.com/onmach/Audio-Sniffer.

combinatorrent guy here :)
If memory serves, the problem with attoparsec is that demands input a little bit at a time, building up a lazy bytestring which is finally concatenated. My "solution" was to roll the input function myself. That is, I get the input stream for attoparsec from a network socket and I know how many bytes to expect in a message. Basically, I split into two cases:
The message is small: Read up to 4k from the socket and eat that Bytestring a little bit at a time (slices of bytestrings are fast and we throw away the 4k after it has been exhausted).
The message is "large" (large here means around 16 Kilobyte in bittorrent speak): We calculate how much the 4k chunk we have can fulfill, and then we simply request the underlying network socket to fill things in. We now have two bytestrings, the remaining part of the 4k chunk and the large chunk. They have all data, so concatenating those and parsing them in is what we do.
You may be able to optimize the concatenation step away.
The TL;DR version: I handle it outside attoparsec and handroll the loop to avoid the problem.
The relevant combinatorrent commit is fc131fe24, see
https://github.com/jlouis/combinatorrent/commit/fc131fe24207909dd980c674aae6aaba27b966d4
for the details.

Related

F# seq behavior

I'm a little baffled about the inner work of the sequence expression in F#.
Normally if we make a sequential file reader with seq with no intentional caching of data
seq {
let mutable current = file.Read()
while current <> -1 do
yield current
}
We will end up with some weird behavior if we try to do some re-iterate or backtracking, My Idea of this was, since Read() is a function calling some mutable value we can't expect the output to be correct if we re-iterate. But then this behaves nicely even on boundary reading?
let Read path =
seq {
use fp = System.IO.File.OpenRead path
let buf = [| for _ in 0 .. 1024 -> 0uy |]
let mutable pos = 1
let mutable current = 0
while pos <> 0 do
if current = 0 then
pos <- fp.Read(buf, 0, 1024)
if pos > 0 && current < pos then
yield buf.[current]
current <- (current + 1) % 1024
}
let content = Read "some path"
We clearly use the same buffer to enhance performance, but assuming that we read the 1025 byte, it will trigger an update to the buffer, if we then try to read any byte with position < 1025 after we still get the correct output. How can that be and what are the difference?
Your question is a bit unclear, so I'll try to guess.
When you create a seq { }, you're essentially creating a state machine which will run only as far as it needs to. When you request the very first element from it, it'll start at the top and run until your first yield instruction. Then, when you request another value, it'll run from that point until the next yield, and so on.
Keep in mind that a seq { } produces an IEnumerable<'T>, which is like a "plan of execution". Each time you start to iterate the sequence (for example by calling Seq.head), a call to GetEnumerator is made behind the scenes, which causes a new IEnumerator<'T> to be created. It is the IEnumerator which does the actual providing of values. You can think of it in more classical terms as having an array over which you can iterate (an iterable or enumerable) and many pointers over that array, each of which are at different points in the array (many iterators or enumerators).
In your first code, file is most likely external to the seq block. This means that the file you are reading from is baked into the plan of execution; no matter how many times you start to iterate the sequence, you'll always be reading from the same file. This is obviously going to cause unpredictable behaviour.
However, in your second code, the file is opened as part of the seq block's definition. This means that you'll get a new file handle each time you iterate the sequence or, essentially, a new file handle per enumerator. The reason this code works is that you can't reverse an enumerator or iterate over it multiple times, not with a single thread at least.
(Now, if you were to manually get an enumerator and advance it over multiple threads, you'd probably run into problems very quickly. But that is a different topic.)

Julia - nested loops are consuming a lot of memory

I have a nested loop iteration scheme that's taking up a lot of memory. I do understand that I should not have globals but even after I included everything in a function, the memory situation didn't improve. It just accumulates after each iteration as if there is no garbage collection.
Here is a workable example that's similar to my code.
I have two files. First, functions.jl:
##functions.jl
module functions
function getMatrix(A)
L = rand(A,A);
return L;
end
function loopOne(A, B)
res = 0;
for i = 1:B
res = inv(getMatrix(A));
end
return(res);
end
end
Second file main.jl:
##main.jl
include("functions.jl")
function main(C)
res = 0;
A = 50;
B = 30;
for i =1:C
mat = functions.loopOne(A,B);
res = mat .+ 1;
end
return res;
end
main(100)
When I execute julia main.jl, the memory increases as I extend C in main(C) (sometimes to more than millions of allocations and 10GiBs when I increase C to 1000000).
I know that the example looks useless but it resembles the structure that I have. Can someone please help? Thank you.
UPDATE:
Michael K. Borregaard gave an answer which is very helpful:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
However, when I time it, allocations and memory still increase as I dial up C.
#time some_descriptive_name(30)
0.057177 seconds (11.77 k allocations: 58.278 MiB, 9.58% gc time)
#time some_descriptive_name(60)
0.113808 seconds (23.53 k allocations: 116.518 MiB, 9.63% gc time)
I believe that the problem comes from the inv function. If I change the code to:
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= res .+ 1
end
return res
end
The memory and allocations will then stay constant:
#time some_descriptive_name(3)
0.000007 seconds (8 allocations: 39.438 KiB)
#time some_descriptive_name(60)
0.000037 seconds (8 allocations: 39.438 KiB)
Is there a way to "clear" the memory after using inv? Since I'm not creating anything new or storing anything new, the memory usage should stay constant.
A few pointers at least:
The getMatrix function allocates a new AxA matrix every time. That will certainly consume memory. It is better to avoid the allocations if you can, e.g. by using rand! to fill an existing array with random values.
The res = 0 line defines res as an Int but you subsequently assign a Matrix{Float} to it (the result of inv(getMatrix)). Changing the type of a variable in the code makes it hard for the compiler to figure out what the type is, which makes for slow code.
It seems you have a module called functions but you don't write it.
The res = inv code line constantly overwrites the value, so the loop does nothing!
The structure and code looks like C++. Try looking at the style guide.
Here's how the code would look in a more ideomatic way that avoids allocations:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
end
Comments:
Use a module if you like - it's up to you whether to put things in different files. Module name in caps.
If you can, it's an advantage to use functions that overwrite values of an existing container. Such functions end with ! to signal that they will modify the arguments (like passing a variably by reference without making it const in C++).
Use the .= operator to indicate that you're not creating a new container, you're overwriting the elements of the existing one. The rand! function overwrites mymatrix.
The return keyword is not strictly needed, but DNF suggested it is better style in a comment
The main convention isn't used in Julia, as most code gets called by the user, not by execution of a program.
Compact assignment format for multiple variables.
Note that in this case, none of these optimisation matter much, as 99% of the calculation time is spent in the expensive inv function.
RESPONSE TO THE UPDATE:
There's nothing wrong with the inv function, it just is a costly operation to do. But again I think you may be misunderstanding what the memory counting does. It is not that the memory use is increasing, like you would be looking for in C++ if you had a pointer to an object that was never released (a memory leak). The memory use is constant, but the total sum of allocations increase, because the inv function has to make some internal allocations.
Consider this example
for i in 1:n
b = [1, 2, 3, 4] # Here a length-4 Array{Int64} is initialized in memory, cost is 32 bytes
end # Here, that memory is released.
For each run through the for loop, 32 bytes is allocated, and 32 bytes is released. When the loop ends, regardless of n, 0 bytes will be allocated from this operation. But Julia's memory tracking only adds the allocations - so after running the code you will see allocation of 32*n bytes.
The reason julia does this is that allocating space in the RAM is one of the costliest operations in computing - so reducing allocations is a good way to speed up code. But you cannot avoid it.
There is thus nothing wrong with your code (in the new format) - the memory allocation and time taken you see is just a result of doing a big (expensive) operation.

How to read a file block in Rails without read it again from beginning

I have a growing file (log) that I need to read by blocks.
I make a call by Ajax to get a specified number of lines.
I used File.foreach to read the lines I want, but it reads always from the beginning and I need to return only the lines I want, directly.
Example Pseudocode:
#First call:
File.open and return 0 to 10 lines
#Second call:
File.open and return 11 to 20 lines
#Third call:
File.open and return 21 to 30 lines
#And so on...
Is there anyway to make this?
Solution 1: Reading the whole file
The proposed solution here:
https://stackoverflow.com/a/5052929/1433751
..is not an efficient solution in your case, because it requires you to read all the lines from the file for each AJAX request, even if you just need the last 10 lines of the logfile.
That's an enormous waste of time, and in computing terms the solving time (i.e. process the whole logfile in blocks of size N) approaches exponential solving time.
Solution 2: Seeking
Since your AJAX calls request sequential lines we can implement a much more efficient approach by seeking to the correct position before reading, using IO.seek and IO.pos.
This requires you to return some extra data (the last file position) back to the AJAX client at the same time you return the requested lines.
The AJAX request then becomes a function call of this form request_lines(position, line_count), which enables the server to IO.seek(position) before reading the requested count of lines.
Here's the pseudocode for the solution:
Client code:
LINE_COUNT = 10
pos = 0
loop {
data = server.request_lines(pos, LINE_COUNT)
display_lines(data.lines)
pos = data.pos
break if pos == -1 # Reached end of file
}
Server code:
def request_lines(pos, line_count)
file = File.open('logfile')
# Seek to requested position
file.seek(pos)
# Read the requested count of lines while checking for EOF
lines = count.times.map { file.readline if !file.eof? }.compact
# Mark pos with -1 if we reached EOF during reading
pos = file.eof? ? -1 : file.pos
f.close
# Return data
data = { lines: lines, pos: pos }
end

F# records, usage, code clarity

Background:
I find myself harnessing F# Records a lot. Currently I am working on a project for packet dissection & replay of a proprietary binary protocol (a protocol that is very strangely designed ...).
We define the skeleton record for the packet.
type bytes = byte array
type packetSkeleton = {
part1 : bytes
part2 : bytes
... }
Now, it is easy to use this to 'dissect' our packet, (really just giving names to the byte fields).
let dissect (raw : bytes) =
let slice a b = raw.[a..b]
{ part1 = slice 0 4
part2 = slice 4 5
... }
This works perfectly even for longish packets, we can even use some neat recursive functions if there is a predicable pattern to the slicing.
So I dissect the packet, pull out the fields that I need and create a packet based off the packetSkeleton using the fields I took from the dissection, which by now is starting to look a bit awkward:
let createAuthStub a b c d e f g h ... =
{ part1 = a; part2 = b
part3 = d; ...
}
Then, after creating the populated stub, I need to deserialise it to a form that can be put on the wire:
(* packetSkeleton -> byte array *)
let deserialise (packet : packetSkeleton) =
[| packet.part1; packet.part2; ... |]
let xab = dissect buf
let authStub = createAuthStub xab.part1 1 2 xab.part9 ...
deserialise authStub |> send
So it ends up that I have 3 areas, the record type, the creation of the record for a given packet, and the deserialised byte array. Something tells me that this is a poor design choice on my part in terms of code clarity, and I can already feel it starting to shoot me in the foot even at this early stage.
Questions:
a) Am I using the correct datatype for such a project? Is my approach correct?
b) Should I just give up on trying to make this code feel clean?
As I am kinda coding this by touch and go, I would appreciate some insights!
P.S I realise that this problem is quite suited for C, but F# is more fun (additionally verification of the dissector later on sounds appealing)!
If a packet could be rather large packetSkeleton might grow unwieldy. Another option is to work with the raw bytes and define a module that reads/writes each part.
module Packet
let Length = 42
let GetPart1 src = src.[0..3]
let SetPart1 src dst = Array.blit src 0 dst 0 4
let GetPart2 src = src.[4..5]
let SetPart2 src dst = Array.blit src 0 dst 4 2
...
open Packet
let createAuthStub bytes b c =
let resp = Array.zeroCreate Packet.Length
SetPart1 (GetPart1 bytes)
SetPart2 b resp
SetPart3 c resp
SetPart4 (GetPart9 bytes)
resp
This removes the need for de/serialization functions (and probably helps performance a bit).
EDIT
Creating a wrapper type is another option
type Packet(bytes: byte[]) =
new() = Packet(Array.zeroCreate Packet.Length)
static member Length = 42
member x.Part1
with get() = bytes.[0..3]
and set value = Array.blit value 0 bytes 0 4
...
which might reduce code a bit:
let createAuthStub (req: Packet) b c =
let resp = Packet()
resp.Part1 <- req.Part1
resp.Part2 <- b
resp.Part3 <- c
resp.Part4 <- req.Part9
resp
I think your approach is essentially sound - but of course, it is difficult to tell without knowing more details.
I think one key idea that shows in your code and that is key to functional architecture is the separation between types (used to model the problem domain) and the processing functionality that creates the values of the domain model, processes it and formats them.
In your case:
The types bytes and packetSkeleton model the problem domain
The function createAuthStub processes your domain (and I agree with Daniel that it might be more readable if it took the whole packetSkeleton as an argument)
The function deserialize turns your domain back to bytes
I think this way of structuring code is quite good, because it separates different concerns of the program. I even wrote an article that tries to describe this as a more general programming approach.

Erlang: erl shell hangs after building a large data structure

As suggested in answers to a previous question, I tried using Erlang proplists to implement a prefix trie.
The code seems to work decently well... But, for some reason, it doesn't play well with the interactive shell. When I try to run it, the shell hangs:
> Trie = trie:from_dict(). % Creates a trie from a dictionary
% ... the trie is printed ...
% Then nothing happens
I see the new trie printed to the screen (ie, the call to trie:from_dict() has returned), then the shell just hangs. No new > prompt comes up and ^g doesn't do anything (but ^c will eventually kill it off).
With a very small dictionary (the first 50 lines of /usr/share/dict/words), the hang only lasts a second or two (and the trie is built almost instantly)... But it seems to grow exponentially with the size of the dictionary (100 words takes 5 or 10 seconds, I haven't had the patience to try larger wordlists). Also, as the shell is hanging, I notice that the beam.smp process starts eating up a lot of memory (somewhere between 1 and 2 gigs).
So, is there anything obvious that could be causing this shell hang and incredible memory usage?
Some various comments:
I have a hunch that the garbage collector is at fault, but I don't know how to profile or create an experiment to test that.
I've tried profiling with eprof and nothing obvious showed up.
Here is my "add string to trie" function:
add([], Trie) ->
[ stop | Trie ];
add([Ch|Rest], Trie) ->
SubTrie = proplists:get_value(Ch, Trie, []),
NewSubTrie = add(Rest, SubTrie),
NewTrie = [ { Ch, NewSubTrie } | Trie ],
% Arbitrarily decide to compress key/value list once it gets
% more than 60 pairs.
if length(NewTrie) > 60 ->
proplists:compact(NewTrie);
true ->
NewTrie
end.
The problem is (amongst others ? -- see my comment) that you are always adding a new {Ch, NewSubTrie} tuple to your proplist Tries, no matter if Ch already existed, or not.
Instead of
NewTrie = [ { Ch, NewSubTrie } | Trie ]
you need something like:
NewTrie = lists:keystore(Ch, 1, Trie, {Ch, NewSubTrie})
You're not really building a trie here. Your end result is effectively a randomly ordered proplist of proplists that requires full scans at each level when walking the list. Tries are typically implied ordering based on position in the array (or list).
Here's an implementation that uses tuples as the storage mechanism. Calling set only rebuilds the root and direct path tuples.
(note: would probably have to make the pair a triple (adding size) make delete work with any efficiency)
I believe erlang tuples are really just arrays (thought I read that somewhere), so lookup should be super fast, and modify is probably straight forward. Maybe this is faster with the array module, but I haven't really played with it much to know.
this version also stores an arbitrary value, so you can do things like:
1> c(trie).
{ok,trie}
2> trie:get("ab",trie:set("aa",bar,trie:new("ab",foo))).
foo
3> trie:get("abc",trie:set("aa",bar,trie:new("ab",foo))).
undefined
4>
code (entire module): note2: assumes lower case non empty string keys
-module(trie).
-compile(export_all).
-define(NEW,{ %% 26 pairs, to avoid cost of calculating a new level at runtime
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},{undefined,nodepth},
{undefined,nodepth},{undefined,nodepth}
}
).
-define(POS(Ch), Ch - $a + 1).
new(Key,V) -> set(Key,V,?NEW).
set([H],V,Trie) ->
Pos = ?POS(H),
{_,SubTrie} = element(Pos,Trie),
setelement(Pos,Trie,{V,SubTrie});
set([H|T],V,Trie) ->
Pos = ?POS(H),
{SubKey,SubTrie} = element(Pos,Trie),
case SubTrie of
nodepth -> setelement(Pos,Trie,{SubKey,set(T,V,?NEW)});
SubTrie -> setelement(Pos,Trie,{SubKey,set(T,V,SubTrie)})
end.
get([H],Trie) ->
{Val,_} = element(?POS(H),Trie),
Val;
get([H|T],Trie) ->
case element(?POS(H),Trie) of
{_,nodepth} -> undefined;
{_,SubTrie} -> get(T,SubTrie)
end.

Resources