Making OCaml's Garbage Collector more predictable - memory

I have developed a certain tool using OCaml (which was extracted from Coq proofs). I want to do some experiments to empirically observe the memory consumption by this OCaml program.
However, OCaml is a garbage collected language and this makes its behavior of memory consumption somewhat unpredictable and also hard to explain. I would like to somehow tune the garbage collector so that it behaves in a more predictable, preferably eager, fashion.
Is there some way to best tune the Garbage Collector to my needs? For example, is it possible to plugin a different garbage collector into OCaml? Perhaps a more eager one, or something which uses a different mechanism, like reference counting.
One of the solutions I have found so far is to manually trigger the Garbage Collector by using Gc.minor () and Gc.major () from the Gc module. I'd appreciate other suggestions.

Related

Memory Consistency on iOS

I was reading this article on general memory consistency and it raised some questions for me about memory consistency on iOS. It's a really interesting article, but it doesn't go into too many specifics about general platforms. The article mentions that languages such as C++ (and I'm guessing Objective-C/Cocoa Touch APIs) use sequential consistency for data-race-free programs to remove many of the weird behaviors that may occur when trying to write to the same memory.
So, say if I were to use Grand Central Dispatch to create a bunch of different threads, would it even be possible to declare and use global variables that are stored on the same location in memory? If so, how would the writing and reading process work? Is there a write buffer? I would test out some of this stuff if I could, but at the moment I cannot. Anything to help me understand this concept would be appreciated.

Are there any tools that assist in porting F# to OCaml?

Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic. I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
b) OCaml gets a good complete port to the LLVM backend -
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
Anyone know of such things, either finished or in progress?
Thanks deeply!
To answer your question in an answer - as far as I know, there are no such tools and I do not think it is likely somebody will create them.
Although F# is inspired by OCaml, it has evolved a lot and is different in a number of ways (see this SO discussion), so automatic conversion is not trivial. Even if somebody did that, it would be more like compilation to hard to read OCaml than conversion to idiomatic code that you can later continue working on.
To add a few general comments, when you speak about "real-time" I imagine controlling some robot in a factory dealing with dangerous stuff or an airplane control. In these areas, concerns about GC are certainly valid. However, I do not think games are necessarily "real-time". You need good performance, that's for sure, but people have been writing games with .NET and F# quite happily. For some F# examples, see:
... a nice blog with a couple of game samples (that you can actually try & buy)
a 3D airplane shooter game that also looks fairly realistic
and there is also a book that uses games to explain F#
These are probably simpler than what you're aiming for, but it may be good enough to show that writing games using GC is doable.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic.
A few points here:
Incremental GCs are not the only way to get low pause times. Concurrent GCs like VCGC do the work in bulk but do it concurrently with mutators running, e.g. the VCGC implementation I described in the non-free article here was running with sub-millisecond pause times.
Incremental GC does not necessarily mean low pause times. For example, OCaml's GC typically incurs 10ms pauses and can incur arbitrarily-long pauses when it encounters a deep thread stack or long array in the heap.
I have measured typical pause times of 10ms with OCaml and 30ms with F# on .NET 3. With a simple implementation I was able to build a fault tolerant server in F# from scratch that handled 20k msgs/s with 50% of latencies under 114us and 95% under 500us.
I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
I wouldn't give up on the platform is your first working version has unacceptable latency. There are lots of things you can do to bring the max latency down.
b) OCaml gets a good complete port to the LLVM backend -
I seriously doubt OCaml will ever get what I'd consider to be a "good complete port to the LLVM backend". They'll just retarget LLVM with the current typeless IR and it won't do much better than the current ocamlopt compiler because LLVM isn't designed to optimize that kind of workload.
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
No automated tools but I've ported hundreds of thousands of lines of code between OCaml and F# now and it is generally very easy because most code is written in the core ML subset of both languages.

What makes Erlang unsuitable for computationally expensive work?

At the beginning of Programming Erlang, there is the following:
What makes Erlang the best choice for your project? It depends on what you are looking
to build. If you are looking into writing a number-crunching application, a graphics-
intensive system, or client software running on a mobile handset, then sorry, you
bought the wrong book.
The implied message is that Erlang isn't suitable for computationally expensive work. What makes Erlang so unsuitable, or have I misinterpreted?
Erlang shines for I/O-bound applications, that is, problems whose limiting factor is the latency and throughput of I/O operations rather than the rate at which instructions can be pushed through a CPU pipeline. Web servers and databases are good examples of I/O-bound applications: the liming factors are likely to be the disk and network rather than the CPU. Traditionally "compute-heavy" applications include cryptographic tools and scientific simulations.
As to why Erlang fails to match languages like C and Fortran when it comes to computationally intensive problems, we must consider things like code generation and cache-friendliness... I'll give it a try:
Code generation: Normally when you start an Erlang program, it will be run in BEAM, a virtual machine based on threaded code. While BEAM performs well enough for most purposes, it has much greater overhead per logical "instruction" than does the kind of code generated by a modern optimizing C compiler. The HiPE project provides a native code compiler for Erlang that was integrated into main OTP source tree a couple of years ago*. While it certainly improves Erlang's number crunching capacity, it will still have a hard time matching a well-written C or Fortran program.
Cache-friendliness: The memory system is a major bottleneck in modern computers: a read from main memory can take hundreds of processor cycles! To solve this problem, CPU designers introduce several levels of cache to hide the memory latency. Caches exploit two key properties of computer programs: temporal and spatial locality -- that is, regions of memory that were recently referenced (and nearby regions) are likely to be referenced again. Languages like C and Fortran offers a great deal of control over where and how memory is allocated, enabling the programmer to tune algorithms to play nicely with the caches. The same doesn't generally hold for dynamic languages like Erlang, where memory allocation is hidden from the programmer and handled automatically by the virtual machine.
Code size: The argument about spatial locality holds for code as well; Erlang code, whether in native or bytecode form, will generally be larger than the corresponding compiled C code. This leads to more frequent misses in the instruction cache.
Bear in mind that this is just the tip of the iceberg, and that I am by no means an expert in Erlang or language implementation. Don't let the fact that Erlang will probably never run scientific simulations scare you, though; for many applications, it's an absolutely fantastic language.
*HiPE is available through the erlang-base-hipe package in Debian, or ./configure --enable-hipe from a source tarball.
It's just that C code might be considerable faster most of the time. Erlang is great at fault tolerance, distributed computing, and concurrency. Programmers tend to be equally proficient in writing erlang or other languages, but if you want speed, use C or C++, maybe from an erlang port, so this code is usable from your own erlang application.
Erlang is a concurrent functional programming language designed for programming large industrial real-time systems. Nothing specifically prevents you from developing "a number-crunching application or a graphics-intensive system", but the language shines in real-time event processing.

Is it possible that F# will be optimized more than other .Net languages in the future?

Is it possible that Microsoft will be able to make F# programs, either at VM execution time, or more likely at compile time, detect that a program was built with a functional language and automatically parallelize it better?
Right now I believe there is no such effort to try and execute a program that was built as single threaded program as a multi threaded program automatically.
That is to say, the developer would code a single threaded program. And the compiler would spit out a compiled program that is multi-threaded complete with mutexes and synchronization where needed.
Would these optimizations be visible in task manager in the process thread count, or would it be lower level than that?
I think this is unlikely in the near future. And if it does happen, I think it would be more likely at the IL level (assembly rewriting) rather than language level (e.g. something specific to F#/compiler). It's an interesting question, and I expect that some fine minds have been looking at this and will continue to look at this for a while, but in the near-term, I think the focus will be on making it easier for humans to direct the threading/parallelization of programs, rather than just having it all happen as if by magic.
(Language features like F# async workflows, and libraries like the task-parallel library and others, are good examples of near-term progress here; they can do most of the heavy lifting for you, especially when your program is more declarative than imperative, but they still require the programmer to opt-in, do analysis for correctness/meaningfulness, and probably make slight alterations to the structure of the code to make it all work.)
Anyway, that's all speculation; who can say what the future will bring? I look forward to finding out (and hopefully making some of it happen). :)
Being that F# is derived from Ocaml and Ocaml compilers can optimize your programs far better than other compilers, it probably could be done.
I don't believe it is possible to autovectorize code in a generally-useful way and the functional programming facet of F# is essentially irrelevant in this context.
The hardest problem is not detecting when you can perform subcomputations in parallel, it is determining when that will not degrade performance, i.e. when the subtasks will take sufficiently long to compute that it is worth taking the performance hit of a parallel spawn.
We have researched this in detail in the context of scientific computing and we have adopted a hybrid approach in our F# for Numerics library. Our parallel algorithms, built upon Microsoft's Task Parallel Library, require an additional parameter that is a function giving the estimated computational complexity of a subtask. This allows our implementation to avoid excessive subdivision and ensure optimal performance. Moreover, this solution is ideal for the F# programming language because the function parameter describing the complexity is typically an anonymous first-class function.
Cheers,
Jon Harrop.
I think the question misses the point of the .NET architecture-- F#, C# and VB (etc.) all get compiled to IL, which then gets compiled to machine code via the JIT compiler. The fact that a program was written in a functional language isn't relevant-- if there are optimizations (like tail recursion, etc.) available to the JIT compiler from the IL, the compiler should take advantage of it.
Naturally, this doesn't mean that writing functional code is irrelevant-- obviously, there are ways to write IL which will parallelize better-- but many of these techniques could be used in any .NET language.
So, there's no need to flag the IL as coming from F# in order to examine it for potential parallelism, nor would such a thing be desirable.
There's active research for autoparallelization and auto vectorization for a variety of languages. And one could hope (since I really like F#) that they would concive a way to determine if a "pure" side-effect free subset was used and then parallelize that.
Also since Simon Peyton-Jones the father of Haskell is working at Microsoft I have a hard time not beliving there's some fantastic stuff comming.
It's possible but unlikely. Microsoft spends most of it's time supporting and implementing features requested by their biggest clients. That usually means C#, VB.Net, and C++ (not necessarily in that order). F# doesn't seem like it's high on the list of priorities.
Microsoft is currently developing 2 avenues for parallelisation of code: PLINQ (Pararllel Linq, which owes much to functional languages) and the Task Parallel Library (TPL) which was originally part of Robotics Studio. A beta of PLINQ is available here.
I would put my money on PLINQ becoming the norm for auto-parallelisation of .NET code.

Functional programming and multicore architecture

I've read somewhere that functional programming is suitable to take advantage of multi-core trend in computing. I didn't really get the idea. Is it related to the lambda calculus and von neumann architecture?
Functional programming minimizes or eliminates side effects and thus is better suited to distributed programming. i.e. multicore processing.
In other words, lots of pieces of the puzzle can be solved independently on separate cores simultaneously without having to worry about one operation affecting another nearly as much as you would in other programming styles.
One of the hardest things about dealing with parallel processing is locking data structures to prevent corruption. If two threads were to mutate a data structure at once without having it locked perfectly, anything from invalid data to a deadlock could result.
In contrast, functional programming languages tend to emphasize immutable data. Any state is kept separate from the logic, and once a data structure is created it cannot be modified. The need for locking is greatly reduced.
Another benefit is that some processes that parallelize very easily, like iteration, are abstracted to functions. In C++, You might have a for loop that runs some data processing over each item in a list. But the compiler has no way of knowing if those operations may be safely run in parallel -- maybe the result of one depends on the one before it. When a function like map() or reduce() is used, the compiler can know that there is no dependency between calls. Multiple items can thus be processed at the same time.
I've read somewhere that functional programming is suitable to take advantage of multi-core trend in computing... I didn't really get the idea. Is it related to the lambda calculus and von neumann architecture?
The argument behind the belief you quoted is that purely functional programming controls side effects which makes it much easier and safer to introduce parallelism and, therefore, that purely functional programming languages should be advantageous in the context of multicore computers.
Unfortunately, this belief was long since disproven for several reasons:
The absolute performance of purely functional data structures is poor. So purely functional programming is a big initial step in the wrong direction in the context of performance (which is the sole purpose of parallel programming).
Purely functional data structures scale badly because they stress shared resources including the allocator/GC and main memory bandwidth. So parallelized purely functional programs often obtain poor speedups as the number of cores increases.
Purely functional programming renders performance unpredictable. So real purely functional programs often see performance degradation when parallelized because granularity is effectively random.
For example, the bastardized two-line quicksort often cited by the Haskell community typically runs thousands of times slower than a real in-place quicksort written in a more conventional language like F#. Moreover, although you can easily parallelize the elegant Haskell program, you are unlikely to see any performance improvement whatsoever because all of the unnecessary copying makes a single core saturate the entire main memory bandwidth of a multicore machine, rendering parallelism worthless. In fact, nobody has ever managed to write any kind of generic parallel sort in Haskell that is competitively performant. The state-of-the-art sorts provided by Haskell's standard library are typically hundreds of times slower than conventional alternatives.
However, the more common definition of functional programming as a style that emphasizes the use of first-class functions does actually turn out to be very useful in the context of multicore programming because this paradigm is ideal for factoring parallel programs. For example, see the new higher-order Parallel.For function from the System.Threading.Tasks namespace in .NET 4.
When there are no side effects the order of evaluation does not matter. It is then possible to evaluate expressions in parallel.
The basic argument is that it is difficult to automatically parallelize languages like C/C++/etc because functions can set global variables. Consider two function calls:
a = foo(b, c);
d = bar(e, f);
Though foo and bar have no arguments in common and one does not depend on the return code of the other, they nonetheless might have dependencies because foo might set a global variable (or other side effect) which bar depends upon.
Functional languages guarantee that foo and bar are independant: there are no globals, and no side effects. Therefore foo and bar could be safely run on different cores, automatically, without programmer intervention.
All the answers above go to the key idea that "no shared mutable storage" is a key enabler to execute pieces of a program in parallel. It does not really solve the equally hard problem of finding things to execute in parallel. But the typical clearer expressions of functionality in functional languages do make it theoretically easier to extract parallelism from a sequential expression.
In practice, I think the "no shared mutable storage" property of languages based on garbage collection and copy-on-change semantics make them easier to add threads to. The best example is probably Erlang, that combines near-functional semantics with explicit threads.
This is a little bit of a vague question. One perk of multi-core CPUs is that you can run a functional program and let it plug away serially without worrying about affecting any computing going on that has to do with other functions the machine is carrying out.
The difference between a multi-U server and a multi-core CPU in a server or PC is the speed savings you get by having it on the same BUS, allowing better and faster communication to the cores.
edit: I should probably qualify this post by saying that in most of the scripting I do, with or without multiple cores, I rarely see a problem in getting my data through hackish parallelizing, such as running multiple small scripts at once in my script so I'm not slowed down by things like waiting for URLs to load and what not.
double edit: Furthermore, a lot of functional programming languages have had forked parallel variants for decades. These better utilize parallel computation with some speed improvement, but they never really caught on.
Omitting any technical/scientific terms the reason is because functional program doesn't share data. Data is copied and transfered among functions, thus there is no shared data in the application.
And shared data is what causes half the headaches with multithreading.
The book Programming Erlang: Software for a Concurrent World by Joe Armstrong (the creator of Erlang) talks quite a bit about using Erlang for multicore(/multiprocessor) systems. As the wikipedia article states:
Creating and managing processes is trivial in Erlang, whereas threads are considered a complicated and error-prone topic in most languages. Though all concurrency is explicit in Erlang, processes communicate using message passing instead of shared variables, which removes the need for locks.

Resources