Single-threaded program profiles 15% of runtime in semaphore_wait_trap - f#

On Mac OS using mono, if I compile and profile the program below, I get the following results:
% fsharpc --nologo -g foo.fs -o foo.exe
% mono --profile=default:stat foo.exe
...
Statistical samples summary
Sample type: cycles
Unmanaged hits: 336 (49.1%)
Managed hits: 349 (50.9%)
Unresolved hits: 1 ( 0.1%)
Hits % Method name
154 22.48 Microsoft.FSharp.Collections.SetTreeModule:height ...
105 15.33 semaphore_wait_trap
74 10.80 Microsoft.FSharp.Collections.SetTreeModule:add ...
...
Note the second entry, semaphore_wait_trap.
Here is the program:
[<EntryPoint>]
let main args =
let s = seq { 1..1000000 } |> Set.ofSeq
s |> Seq.iter (fun _ -> ())
0
I looked in the source for the Set module, but I didn't find any (obvious) locking.
Is my single-threaded program really spending 15% of its execution time messing with semaphores? If it is, can I make it not do that and get a performance boost?

According to Instruments, it's sgen/gc calling semaphore_wait_trap:
Sgen is documented as stopping all other threads while it collects:
Before doing a collection (minor or major), the collector must stop
all running threads so that it can have a stable view of the current
state of the heap, without the other threads changing it
In other words, when the code is trying to allocate memory and a GC is required, the time it takes shows up under semaphore_wait_trap since that's your application thread. I suspect the mono profiler doesn't profile the gc thread itself so you don't see the time in the collection code.
The germane output then is really the GC summary:
GC summary
GC resizes: 0
Max heap size: 0
Object moves: 1002691
Gen0 collections: 123, max time: 14187us, total time: 354803us, average: 2884us
Gen1 collections: 3, max time: 41336us, total time: 60281us, average: 20093us
If you want your code to run faster, don't collect as often.
Understanding the actual cost of collection can be done through dtrace since sgen has dtrace probes.

Related

All allocated Tarantool 2.3 space memtx is occupied

Today the allocated space for memxt tarantool is over - memtx_memory = 5GB, the RAM was really busy at 5GB, after restarting tarantool more than 4GB was freed.
What could be clogged with RAM? What settings can this be related to?
box.slab.info()
---
- items_size: 1308568936
items_used_ratio: 91.21%
quota_size: 5737418240
quota_used_ratio: 13.44%
arena_used_ratio: 89.2%
items_used: 1193572600
quota_used: 1442840576
arena_size: 1442840576
arena_used: 1287551224
box.info()
---
- version: 2.3.2-26-g38e825b
id: 1
ro: false
uuid: d9cb7d78-1277-4f83-91dd-9372a763aafa
package: Tarantool
cluster:
uuid: b6c32d07-b448-47df-8967-40461a858c6d
replication:
1:
id: 1
uuid: d9cb7d78-1277-4f83-91dd-9372a763aafa
lsn: 89759968433
2:
id: 2
uuid: 77557306-8e7e-4bab-adb1-9737186bd3fa
lsn: 9
3:
id: 3
uuid: 28bae7dd-26a8-47a7-8587-5c1479c62311
lsn: 0
4:
id: 4
uuid: 6a09c191-c987-43a4-8e69-51da10cc3ff2
lsn: 0
signature: 89759968442
status: running
vinyl: []
uptime: 606297
lsn: 89759968433
sql: []
gc: []
pid: 32274
memory: []
vclock: {2: 9, 1: 89759968433}
cat /etc/tarantool/instances.available/my_app.lua
...
memtx_memory = 5 * 1024 * 1024 * 1024,
...
Tarantool vesrion 2.3.2, OS CentOs 7
https://i.stack.imgur.com/onV44.png
It's the result of a process called fragmentation.
The simple reason for this process is the next situation:
you have some allocated area for tuples
you put one tuple and next you put another one
when you need to increase the first tuple, a database needs to relocate your tuple at another place with enough capacity. After that, the place for the first tuple will be free but we took the new place for the extended tuple.
You can decrease a fragmentation factor by increasing a tuple size for your case.
Choose the size by estimating your typical data or just find the optimal size via metrics of your workload for a time.

How could I know amount of available memory on the server on elixir?

I use 3rd party scripts from my elixir app. How could I know how much memory is available on my working app? I don't need the memory available by the erlang VM, but the whole computer memory
A platform-agnostic way:
:memsup.start_link
:memsup.get_system_memory_data
[
system_total_memory: 16754499584,
free_swap: 4194299904,
total_swap: 4194299904,
cached_memory: 931536896,
buffered_memory: 113426432,
free_memory: 13018746880,
total_memory: 16754499584
]
So to get the total memory in MB:
mbyte = :math.pow(1024, 2) |> Kernel.trunc
:memsup.get_system_memory_data
|> Keyword.get(:system_total_memory)
|> Kernel.div(mbyte)
The most obvious (but a little bit cumbersome) way that I found is to call vmstat from command line and parse it results:
System.cmd("vmstat", ["-s", "-SM"])
|> elem(0)
|> String.trim()
|> String.split()
|> List.first()
|> String.to_integer()
|> Kernel.*(1_000_000) # convert megabytes to bytes
vmstat is the command which works on ubuntu and returns output like that:
3986 M total memory
3736 M used memory
3048 M active memory
525 M inactive memory
249 M free memory
117 M buffer memory
930 M swap cache
0 M total swap
0 M used swap
0 M free swap
1431707 non-nice user cpu ticks
56301 nice user cpu ticks
232979 system cpu ticks
3267984 idle cpu ticks
84908 IO-wait cpu ticks
0 IRQ cpu ticks
15766 softirq cpu ticks
0 stolen cpu ticks
4179948 pages paged in
6422812 pages paged out
0 pages swapped in
0 pages swapped out
35819291 interrupts
145676723 CPU context switches
1490259647 boot time
67936 forks
Works on ubuntu, should work on every linux

Is F# Interactive supposed to be much slower than compiled?

I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Update:
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.
I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
.\bin\Release\ConsoleApplication3.exe
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
Program.fs:
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
sw
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
[<EntryPoint>]
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
total
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
0
Interactive.fsx:
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference

Delphi Shared Module for Apache Runs Out of Memory

I hope someone will shed some lights on issue I am having.
I am writing an Apache shared object module that acts as a server for my app. Client or multiple Clients make SOAP request to this module, module talks to database and returns SOAP responses to client(s).
Task Manager shows httpd processes (incoming requests from clients) in Task Manager using about 35000K but every once in a while, one of httpd processes will grow in memory/CPU usage and over time reach the cap of 2GB, and then it will crash. Server reports "Internal Server 500" error in this case. (screenshot added)
I use FastMM to check for memory leaks and it does produce log but there is no memory leaks reported. To make sure I use FastMM properly, I introduced memory leaks and FastMM will log it. So, my assumption is that I do not have memory leak but that somehow memory gets consumed until 2GB threshold is reached and not released until I manually restart Apache.
Then I started taking snapshots using FastMM's LogMemoryManagerStateToFile call to create memory snapshot at the time of call. LogMemoryManagerStateToFile creates file with following information but I dont understand how is it helpful to me other than telling me that for example 8320 calls to UnicodeString allocated 676512 bytes.
Which of these are not released properly?
Note that this information is shortened version of created file and it is just for one single method call. There could be many calls that occur to different methods:
FastMM State Capture:
---------------------
1793K Allocated
7165K Overhead
20% Efficiency
Usage Detail:
676512 bytes: UnicodeString x 8320
152576 bytes: TTMQuery x 256
144312 bytes: TParam x 1718
134100 bytes: TDBQuery x 225
107444 bytes: Unknown x 1439
88368 bytes: TSDStringList x 1052
82320 bytes: TStringList x 980
80000 bytes: TList x 4000
53640 bytes: TTMStoredProc x 90
47964 bytes: TFieldList x 571
47964 bytes: TFieldDefList x 571
41112 bytes: TFields x 1142
38828 bytes: TFieldDefs x 571
20664 bytes: TParams x 574
...
...
...
4 bytes: TChunktIME x 1
4 bytes: TChunkpHYs x 1
4 bytes: TChunkgAMA x 1
Additional Information:
-----------------------
Finalizing MyMethodName
How is this data helpful to figure out where memory is being consumed so much in terms of locating it in code and fixing it?
Much appreciated,

erlang crash dump no more index entries

i have a problem about erlang.
One of my Erlang node crashes, and generates erl_crash.dump with reason no more index entries in atom_tab (max=1048576).
i checked the dump file and i found that there are a lot of atoms in the form of 'B\2209\000..., (about 1000000 entries)
=proc:<0.11744.7038>
State: Waiting
Name: 'B\2209\000d\022D.
Spawned as: proc_lib:init_p/5
Spawned by: <0.5032.0>
Started: Sun Feb 23 05:23:27 2014
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Reductions: 1992
Stack+heap: 1597
OldHeap: 1597
Heap unused: 918
OldHeap unused: 376
Program counter: 0x0000000001eb7700 (gen_fsm:loop/7 + 140)
CP: 0x0000000000000000 (invalid)
arity = 0
do you have some experience about what they are?
Atoms
By default, the maximum number of atoms is 1048576. This limit can be raised or lowered using the +t option.
Note: an atom refers into an atom table which also consumes memory. The atom text is stored once for each unique atom in this table. The atom table is not garbage-collected.
I think that you produce a lot of atom in your program, the number of atom reach the number limition for atom.
You can use this +t option to change the number limition of atom in your erlang VM when your start your erlang node.
So it tells you, that you generate atoms. There is somewhere list_to_atom/1 which is called with variable argument. Because you have process with this sort of name, you register/2 processes with this name. It is may be your code or some third party module what you use. It's bad behavior. Don't do that and don't use modules which is doing it.
To be honest, I can imagine design where I would do it intentionally but it is very special case and it is obviously not the case when you ask this question.

Resources