When using immutable dictionaries in F# , how much overhead is there when adding / removing entries?
Will it treat entire buckets as immutable and clone those and only recreate the bucket whos item has changed?
Even if that is the case, it seems like there is alot of copying that needs to be done in order to create the new dictionary(?)
I looked at the implementation of the F# Map<K,V> type and I think it is implemented as a functional AVL tree. It stores the values in the inner nodes of the tree as well as in the leafs and for each node, it makes sure that |height(left) - height(right)| <= 1.
A
/ \
B C
/ \
D E
I think that the both average and worst-case complexities are O(log(n)):
Insert we need to clone all nodes on the path from the root to the newly inserted element and the height of the tree is at most O(log(n)). On the "way back", the tree may need to rebalance each node, but that's also only O(log(n))
Remove is similar - we find the element and then clone all nodes from the root to that element (rebalancing nodes on the way back to the root)
Note that other data-structures that don't need to rebalance all nodes from the root to the current one on insertion/deletion won't be really useful in the immutable scenario, because you need to create new nodes for the entire path anyway.
A lot of the tree structure can be reused. I don't know the algorithmic complexity offhand, I would guess on average there's only like amortized logN 'waste'...
Why not try to write a program to measure? (We'll see if I can get motivated tonight to try it myself.)
EDIT
Ok, here is something I hacked. I haven't decided if there's any useful data here or not.
open System
let rng = new Random()
let shuffle (array : _[]) =
let n = array.Length
for x in 1..n do
let i = n-x
let j = rng.Next(i+1)
let tmp = array.[i]
array.[i] <- array.[j]
array.[j] <- tmp
let TryTwoToThe k =
let N = pown 2 k
GC.Collect()
let a = Array.init N id
let makeRandomTreeAndDiscard() =
shuffle a
let mutable m = Map.empty
for i in 0..N-1 do
m <- m.Add(i,i)
for i in 1..20 do
makeRandomTreeAndDiscard()
for i in 1..20 do
makeRandomTreeAndDiscard()
for i in 1..20 do
makeRandomTreeAndDiscard()
#time
// run these as separate interactions
printfn "16"
TryTwoToThe 16
printfn "17"
TryTwoToThe 17
printfn "18"
TryTwoToThe 18
When I run this in FSI on my box, I get
--> Timing now on
>
16
Real: 00:00:08.079, CPU: 00:00:08.062, GC gen0: 677, gen1: 30, gen2: 1
>
17
Real: 00:00:17.144, CPU: 00:00:17.218, GC gen0: 1482, gen1: 47, gen2: 4
>
18
Real: 00:00:37.790, CPU: 00:00:38.421, GC gen0: 3400, gen1: 1059, gen2: 17
which suggests the memory may be scaling super-linearly but not too badly. I am presuming that the gen0 collections are roughly a good proxy for the 'waste' of rebalancing the tree. But it is late so I am not sure if I have thought this through well enough. :)
Related
I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Update:
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.
I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
.\bin\Release\ConsoleApplication3.exe
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
Program.fs:
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
sw
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
[<EntryPoint>]
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
total
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
0
Interactive.fsx:
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference
i want to konw this data struct will use how much memory in Erlang VM?
[{"3GPP-UTRAN-FDD", [{"utran-cell-id-3gpp","CID1000"}, "1996-12-19t16%3a39%3a57-08%3a00", "1996-12-19t15%3a39%3a27%2e20-08%3a00"]}]
In my application, every process will store this data in self's loop data, and the numbert of this proces will be 120000.
The result which i test:
don't store this data, the memory will be:
memory[kB]: proc 1922806, atom 2138, bin 24890, code 72757, ets 459321
store this data, the momory will be:
memory[kB]: proc 1684032, atom 2138, bin 24102, code 72757, ets 459080
So the big difference is the memoery used by proc: (1922806 - 1684032) / 1024 = 233M.
After research, i find an insterting thing:
L = [{"3GPP-UTRAN-FDD", [{"utran-cell-id-3gpp","CID1000"}, "1996-12-19t16%3a39%3a57-08%3a00", "1996-12-19t15%3a39%3a27%2e20-08%3a00"]}].
B = list_to_binary(io_lib:format("~p", L)).
erts_debug:size(B). % The output is 6
The memory just use 6 words after using binary? How to explain this?
There are two useful functions for measuring the size of an Erlang term: erts_debug:size/1 and erts_debug:flat_size/1. Both of these functions return the size of the term in words.
erts_debug:flat_size/1 gives you the total size of a message without term-sharing optimization. This is guaranteed to be the size of the term if it is copied to a new heap, as with message passing and ets tables.
erts_debug:size/1 gives you the size of the term as it is in the current process' heap, allowing for memory usage optimization by sharing repeated terms.
Here is a demonstration of the differences:
1> MyTerm = {atom, <<"binary">>, 1}.
{atom,<<"binary">>,1}
2> MyList = [ MyTerm || _ <- lists:seq(1, 100) ].
[{atom,<<"binary">>,1}|...]
3> erts_debug:size(MyList).
210
4> erts_debug:flat_size(MyList).
1200
As you can see, there is a significant difference in the sizes due to term sharing.
As for your specific term, I used the Erlang shell (R16B03) and measured the term with flat_size. According to this, the memory usage of your term is: 226 words (1808B, 1.77KB).
This is a lot of memory to use for what appears to be a simple term, but that is outside of the scope of this question.
the size of the whole binary is 135 bytes when you do it list_to_binary(io_lib:format("~p", L))., if you are on a 64 bit system it represents 4.375 words so 6 words should be the correct size, but you have lost the direct access to the internal structure.
Strange but can be understood:
19> erts_debug:flat_size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,1000)])).
6
20> erts_debug:flat_size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
6
21> size(list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
10000
22> (list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])).
<<"myeyrltgyfnytajecrgtonkdcxlnaoqcsswdnepnmdxfrwnnlbzdaxknqarfyiwewlugrtgjgklblpdkvgpecglxmfrertdfanzukfolpphqvkkwrpmb"...>>
23> erts_debug:display({list_to_binary([random:uniform(26) + $a - 1 || _ <- lists:seq(1,10000)])}).
{<<10000 bytes>>}
"{<<10000 bytes>>}\n"
24>
This means that the erts_debug:flat_size return the size of the variable (which is roughly a type information, a pointer to the data and its size), but not the size of the binary data itself. The binary data is stored elsewhere and can be shared by different variables.
I have been using the CSV provider to load files of about 300k to 1M rows (50~120megs). It works very well and is very fast. It can load most files in under a second.
Here is the output from 64-bit FSI on windows loading a file of about 400k rows and 25 fields.
#time
let Csv2 = CsvFile.Parse(testfile)
let parsedRows = Csv2.Rows |> Seq.toArray
#time
--> Timing now on
Real: 00:00:00.056, CPU: 00:00:00.093, GC gen0: 0, gen1: 0, gen2: 0
But when I load the same file into Deedle
#time
let dCsv = Frame.ReadCsv(testfile)
#time;;
--> Timing now on
Real: 00:01:39.197, CPU: 00:01:41.119, GC gen0: 6324, gen1: 417, gen2: 13
It takes over 1m 40s. I know some extra time is necessary as Deedle is doing much more than the static csv parser above, but over 1m 40s secs seems high. Can I somehow shorten it?
By default, the Frame.ReadCsv function attempts to infer the type of the columns by looking at the contents. I think this might be adding most of the overhead here. You can try specifying inferTypes=false to disable this completely (then it'll load the data as strings) or you can use inferRows=10 to infer the types from the first few rows. This should work well enough and be faster:
let df = Frame.ReadCsv(testfile, inferRows=10)
Maybe we should make something this the default option. If this does not fix the problem, please submit a GitHub issue and we'll look into that!
I need to use Some/None options in heavy numerical simulations. The following micro benchmark gives me Fast = 485 and Slow = 5890.
I do not like nulls and even if I liked them I cannot use null because The type 'float' does not have 'null' as a proper value.
Ideally there would be a compiler option that would compile Some/None into value/null so there would be no runtime penalty. Is that possible? Or how shall I make Some/None efficient?
let s = System.Diagnostics.Stopwatch()
s.Start()
for h in 0 .. 1000 do
Array.init 100000 (fun i -> (float i + 1.)) |> ignore
printfn "Fast = %d" s.ElapsedMilliseconds
s.Restart()
for h in 0 .. 1000 do
Array.init 100000 (fun i -> Some (float i + 1.)) |> ignore
printfn "Slow = %d" s.ElapsedMilliseconds
None is actually already represented as null. But since option<_> is a reference type (which is necessary for null to be a valid value in the .NET type system), creating Some instances will necessarily require heap allocations. One alternative is to use the .NET System.Nullable<_> type, which is similar to option<_>, except that:
it's a value type, so no heap allocation is needed
it only supports value types as elements, so you can create an option<string>, but not a Nullable<string>. For your use case this seems like an unimportant factor.
it has runtime support so that boxing a nullable without a value results in a null reference, which would be impossible otherwise
Keep in mind that your benchmark does very little work, so the results are probably not typical of what you'd see with your real workload. Try to use a more meaningful benchmark based on your actual scenario if at all possible.
As a side note, you get more meaningful diagnostics (including garbage collection statistics) if you use the #time directive in F# rather than bothering with the Stopwatch.
I have pieces of code like this in a project and I realize it's not
written in a functional way:
let data = Array.zeroCreate(3 + (int)firmwareVersions.Count * 27)
data.[0] <- 0x09uy //drcode
data.[1..2] <- firmwareVersionBytes //Number of firmware versions
let mutable index = 0
let loops = firmwareVersions.Count - 1
for i = 0 to loops do
let nameBytes = ASCIIEncoding.ASCII.GetBytes(firmwareVersions.[i].Name)
let timestampBytes = this.getTimeStampBytes firmwareVersions.[i].Timestamp
let sizeBytes = BitConverter.GetBytes(firmwareVersions.[i].Size) |> Array.rev
data.[index + 3 .. index + 10] <- nameBytes
data.[index + 11 .. index + 24] <- timestampBytes
data.[index + 25 .. index + 28] <- sizeBytes
data.[index + 29] <- firmwareVersions.[i].Status
index <- index + 27
firmwareVersions is a List which is part of a csharp library.
It has (and should not have) any knowledge of how it will be converted into
an array of bytes. I realize the code above is very non-functional, so I tried
changing it like this:
let headerData = Array.zeroCreate(3)
headerData.[0] <- 0x09uy
headerData.[1..2] <- firmwareVersionBytes
let getFirmwareVersionBytes (firmware : FirmwareVersion) =
let nameBytes = ASCIIEncoding.ASCII.GetBytes(firmware.Name)
let timestampBytes = this.getTimeStampBytes firmware.Timestamp
let sizeBytes = BitConverter.GetBytes(firmware.Size) |> Array.rev
Array.concat [nameBytes; timestampBytes; sizeBytes]
let data =
firmwareVersions.ToArray()
|> Array.map (fun f -> getFirmwareVersionBytes f)
|> Array.reduce (fun acc b -> Array.concat [acc; b])
let fullData = Array.concat [headerData;data]
So now I'm wondering if this is a better (more functional) way
to write the code. If so... why and what improvements should I make,
if not, why not and what should I do instead?
Suggestions, feedback, remarks?
Thank you
Update
Just wanted to add some more information.
This is part of some library that handles the data for a binary communication
protocol. The only upside I see of the first version of the code is that
people implementing the protocol in a different language (which is the case
in our situation as well) might get a better idea of how many bytes every
part takes up and where exactly they are located in the byte stream... just a remark.
(As not everybody understand english, but all our partners can read code)
I'd be inclined to inline everything because the whole program becomes so much shorter:
let fullData =
[|yield! [0x09uy; firmwareVersionBytes; firmwareVersionBytes]
for firmware in firmwareVersions do
yield! ASCIIEncoding.ASCII.GetBytes(firmware.Name)
yield! this.getTimeStampBytes firmware.Timestamp
yield! BitConverter.GetBytes(firmware.Size) |> Array.rev|]
If you want to convey the positions of the bytes, I'd put them in comments at the end of each line.
I like your first version better because the indexing gives a better picture of the offsets, which are an important piece of the problem (I assume). The imperative code features the byte offsets prominently, which might be important if your partners can't/don't read the documentation. The functional code emphasises sticking together structures, which would be OK if the byte offsets are not important enough to be mentioned in the documentation either.
Indexing is normally accidental complexity, in which case it should be avoided. For example, your first version's loop could be for firmwareVersion in firmwareVersion instead of for i = 0 to loops.
Also, like Brian says, using constants for the offsets would make the imperative version even more readable.
How often does the code run?
The advantage of 'array concatenation' is that it does make it easier to 'see' the logical portions. The disadvantage is that it creates a lot of garbage (allocating temporary arrays) and may also be slower if used in a tight loop.
Also, I think perhaps your "Array.reduce(...)" can just be "Array.concat".
Overall I prefer the first way (just create one huge array), though I would factor it differently to make the logic more apparent (e.g. have a named constant HEADER_SIZE, etc.).
While we're at it, I'd probably add some asserts to ensure that e.g. nameBytes has the expected length.