F# Deedle's csv file load time - f#

I have been using the CSV provider to load files of about 300k to 1M rows (50~120megs). It works very well and is very fast. It can load most files in under a second.
Here is the output from 64-bit FSI on windows loading a file of about 400k rows and 25 fields.
let Csv2 = CsvFile.Parse(testfile)
let parsedRows = Csv2.Rows |> Seq.toArray
--> Timing now on
Real: 00:00:00.056, CPU: 00:00:00.093, GC gen0: 0, gen1: 0, gen2: 0
But when I load the same file into Deedle
let dCsv = Frame.ReadCsv(testfile)
--> Timing now on
Real: 00:01:39.197, CPU: 00:01:41.119, GC gen0: 6324, gen1: 417, gen2: 13
It takes over 1m 40s. I know some extra time is necessary as Deedle is doing much more than the static csv parser above, but over 1m 40s secs seems high. Can I somehow shorten it?

By default, the Frame.ReadCsv function attempts to infer the type of the columns by looking at the contents. I think this might be adding most of the overhead here. You can try specifying inferTypes=false to disable this completely (then it'll load the data as strings) or you can use inferRows=10 to infer the types from the first few rows. This should work well enough and be faster:
let df = Frame.ReadCsv(testfile, inferRows=10)
Maybe we should make something this the default option. If this does not fix the problem, please submit a GitHub issue and we'll look into that!


How could I know amount of available memory on the server on elixir?

I use 3rd party scripts from my elixir app. How could I know how much memory is available on my working app? I don't need the memory available by the erlang VM, but the whole computer memory
A platform-agnostic way:
system_total_memory: 16754499584,
free_swap: 4194299904,
total_swap: 4194299904,
cached_memory: 931536896,
buffered_memory: 113426432,
free_memory: 13018746880,
total_memory: 16754499584
So to get the total memory in MB:
mbyte = :math.pow(1024, 2) |> Kernel.trunc
|> Keyword.get(:system_total_memory)
|> Kernel.div(mbyte)
The most obvious (but a little bit cumbersome) way that I found is to call vmstat from command line and parse it results:
System.cmd("vmstat", ["-s", "-SM"])
|> elem(0)
|> String.trim()
|> String.split()
|> List.first()
|> String.to_integer()
|> Kernel.*(1_000_000) # convert megabytes to bytes
vmstat is the command which works on ubuntu and returns output like that:
3986 M total memory
3736 M used memory
3048 M active memory
525 M inactive memory
249 M free memory
117 M buffer memory
930 M swap cache
0 M total swap
0 M used swap
0 M free swap
1431707 non-nice user cpu ticks
56301 nice user cpu ticks
232979 system cpu ticks
3267984 idle cpu ticks
84908 IO-wait cpu ticks
0 IRQ cpu ticks
15766 softirq cpu ticks
0 stolen cpu ticks
4179948 pages paged in
6422812 pages paged out
0 pages swapped in
0 pages swapped out
35819291 interrupts
145676723 CPU context switches
1490259647 boot time
67936 forks
Works on ubuntu, should work on every linux

Is F# Interactive supposed to be much slower than compiled?

I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.
I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference

Erlang and Redis: read performance

I suddenly encountered performance problems when trying to read 1M records from Redis sorted set. I used ZSCAN with cursor and batch size 5K.
Code was executed using Erlang R14 on the same machine that hosts Redis. Receiving of 5K elements batch takes near 1 second. Unfortunately, I failed to compile Erlang R16 on this machine, but I think it does not matter.
For comparison, Node.js code with node_redis (hiredis parser) does 1M in 2 seconds. Same results for Python and PHP.
Maybe I do something wrong?
Thanks in advance.
Here is my Erlang code:
-define(COUNT, 5000).
run() ->
{_,Conn} = connect_to_redis(),
connect_to_redis() ->
eredis:start_link("host", 6379, 0, "pass").
read_from_redis(_Conn, 0) ->
read_from_redis(Conn, Cursor) ->
{ok, [Cursor1|_]} = eredis:q(Conn, ["ZSCAN", "if:push:sset:test", Cursor, "COUNT", ?COUNT]),
read_from_redis(Conn, Cursor1).
read_from_redis(Conn) ->
{ok, [Cursor|_]} = eredis:q(Conn, ["ZSCAN", "if:push:sset:test", 0, "COUNT", ?COUNT]),
read_from_redis(Conn, Cursor).
9 out of 10 times, slowness like this is a result of badly written drivers more than it is a result of the system. In this case, the ability to pipeline requests to Redis is going to be important. A client like redo can do pipelining and is maybe faster.
Also, beware measuring one process/thread only. If you want fast concurrent access, it is often balanced out against fast sequential access.
Switching to redis-erl decreased read time of 1M keys to 16 seconds. Not fast, but acceptable.
Here is new code:
-define(COUNT, 200000).
run() ->
redis:connect([{ip, "host"}, {port, 6379}, {db, 0}, {pass, "pass"}]),
read_from_redis(<<"0">>) ->
read_from_redis(Cursor) ->
[{ok, Cursor1}|_] = redis:q(["ZSCAN", "if:push:sset:test", Cursor, "COUNT", ?COUNT]),
read_from_redis() ->
[{ok, Cursor}|_] = redis:q(["ZSCAN", "if:push:sset:test", 0, "COUNT", ?COUNT]),

retrieval of data from ETS table

I know that lookup time is constant for ETS tables. But I also heard that the table is kept outside of the process and when retrieving data, it needs to be moved to the process heap. So, this is expensive. But then, how to explain this:
18> {Time, [[{ok, Binary}]]} = timer:tc(ets, match, [utilo, {a, '$1'}]).
19> size(Binary).
1.7 MB binary takes 0 time to be retrieved from the table!?
EDIT: After I saw Odobenus Rosmarus's answer, I decided to convert the binary to list. Here is the result:
1> {ok, B} = file:read_file("IMG_2171.JPG").
2> size(B).
3> L = binary_to_list(B).
4> length(L).
5> ets:insert(utilo, {a, L}).
6> timer:tc(ets, match, [utilo, {a, '$1'}]).
Now it takes 106000 microseconds to retrieve 1986392 long list from the table which is pretty fast, isn't it? Lists are 2 words per element. Thus the data is 4x1.7MB.
EDIT 2: I started a thread on erlang-question (http://groups.google.com/group/erlang-programming/browse_thread/thread/5581a8b5b27d4fe1) and it turns out that 0.1 second is pretty much the time it takes to do memcpy() (move the data to the process's heap). On the other hand Odobenus Rosmarus's answer explains why retrieving binary takes 0 time.
binaries itself (that longer than 64 bits) are stored in the special heap, outside of process heap.
So, retrieval of binary from the ets table moves to process heap just 'Procbin' part of binary. (roughly it's pointer to start of binary in the binaries memory and size).

Immutable Dictionary overhead?

When using immutable dictionaries in F# , how much overhead is there when adding / removing entries?
Will it treat entire buckets as immutable and clone those and only recreate the bucket whos item has changed?
Even if that is the case, it seems like there is alot of copying that needs to be done in order to create the new dictionary(?)
I looked at the implementation of the F# Map<K,V> type and I think it is implemented as a functional AVL tree. It stores the values in the inner nodes of the tree as well as in the leafs and for each node, it makes sure that |height(left) - height(right)| <= 1.
/ \
/ \
I think that the both average and worst-case complexities are O(log(n)):
Insert we need to clone all nodes on the path from the root to the newly inserted element and the height of the tree is at most O(log(n)). On the "way back", the tree may need to rebalance each node, but that's also only O(log(n))
Remove is similar - we find the element and then clone all nodes from the root to that element (rebalancing nodes on the way back to the root)
Note that other data-structures that don't need to rebalance all nodes from the root to the current one on insertion/deletion won't be really useful in the immutable scenario, because you need to create new nodes for the entire path anyway.
A lot of the tree structure can be reused. I don't know the algorithmic complexity offhand, I would guess on average there's only like amortized logN 'waste'...
Why not try to write a program to measure? (We'll see if I can get motivated tonight to try it myself.)
Ok, here is something I hacked. I haven't decided if there's any useful data here or not.
open System
let rng = new Random()
let shuffle (array : _[]) =
let n = array.Length
for x in 1..n do
let i = n-x
let j = rng.Next(i+1)
let tmp = array.[i]
array.[i] <- array.[j]
array.[j] <- tmp
let TryTwoToThe k =
let N = pown 2 k
let a = Array.init N id
let makeRandomTreeAndDiscard() =
shuffle a
let mutable m = Map.empty
for i in 0..N-1 do
m <- m.Add(i,i)
for i in 1..20 do
for i in 1..20 do
for i in 1..20 do
// run these as separate interactions
printfn "16"
TryTwoToThe 16
printfn "17"
TryTwoToThe 17
printfn "18"
TryTwoToThe 18
When I run this in FSI on my box, I get
--> Timing now on
Real: 00:00:08.079, CPU: 00:00:08.062, GC gen0: 677, gen1: 30, gen2: 1
Real: 00:00:17.144, CPU: 00:00:17.218, GC gen0: 1482, gen1: 47, gen2: 4
Real: 00:00:37.790, CPU: 00:00:38.421, GC gen0: 3400, gen1: 1059, gen2: 17
which suggests the memory may be scaling super-linearly but not too badly. I am presuming that the gen0 collections are roughly a good proxy for the 'waste' of rebalancing the tree. But it is late so I am not sure if I have thought this through well enough. :)
