Why do the CLR and JVM use a stack-based architecture? - clr

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?

I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.

It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

Related

Care to expound on this statement made on Erlang performance?

There's something else to keep in mind: while Erlang does some things very well, it's technically still possible to get the same results from other languages. The opposite is also true; evaluate each problem as it needs to be, and choose the right tool according to the problem being addressed. Erlang is no silver bullet and will be particularly bad at things like image and signal processing, operating system device drivers, etc. and will shine at things like large software for server use (i.e.: queues, map-reduce), doing some lifting coupled with other languages, higher-level protocol implementation
I'm learning Erlang and this link (http://learnyousomeerlang.com/introduction#kool-aid) got me curious of the reasoning of good vs bad applications for Erlang. Can anyone expound on this statement?
Why do Erlang excel at some of the aformentioned fields and not in the others?
while Erlang does some things very well, it's technically still possible to get the same results from other languages
Lets face it, really all programming languages can do more or less everything, and have ways to interface to C libraries to access anything they don't as such have a native library for.
The most obvious thing to point out is that all of Erlang boils down to C at the end of the day, and a little bit of assembler, but that's not really relevant to the point.
Thus it should be clear enough that anything you can write in Erlang could be written in C, and because you are eliminating a layer of abstraction and interpretation, if you do a reasonable job of it, it should be faster. Sometimes a little faster. Sometimes a lot faster.
Erlang is no silver bullet and will be particularly bad at things like image and signal processing, operating system device drivers, etc.
This is the arena of nitty gritty byte and bit shifting magic, and if you introduce an abstraction layer for every bit you shift... you can easily end up degrading the best possible achievable performance by multiple orders of magnitude.
and will shine at things like large software for server use (i.e.: queues, map-reduce), doing some lifting coupled with other languages, higher-level protocol implementation
This is the interesting bit. We've already established that if you write it in C, unless you do a sufficiently poor job of it, the result can only be better in terms of performance.
BUT performance isn't everything. In today’s world CPU and memory is cheap, but time to market is hugely important. A company might spend thousands on some extra hardware required to run your application because it's written in Erlang instead of C, but save (or make) millions because the product is first to market.
The fact is, if you match a given software problem to a high level language with the right paradigm, the average software engineer can often produce a given product many MANY times faster than if they had to write it in C.
Also, writing C is error prone, and provides vastly more scope for making mistakes and poor choices. That means a software engineer might write something in C badly enough that the equivalent Erlang, based on some very finely tuned mature clever C, if the Erlang itself is well through out, it might perform better!
evaluate each problem as it needs to be, and choose the right tool according to the problem being addressed
Erlang is a really great tool, generally, but it does suit some problem domains more than others. There are some problems which might just be better solved with perl for example, or C, python, etc. When it fits the problem domain, Erlang can be unbeatable, but if it's a bad fit, it's definitely best to consider something else.
Both Erlang and C are Turing complete (except for the lack of infinite memory) and thus both can be used to compute anything if you don't care about absolute performance or the amount of memory or other system resources used.
In systems with constrained memory (tinyDuino, et.al.), the language runtime footprint (and OS resources required to support that runtime) may be a differentiator. For problems where every multiply-accumulate per second counts (affects total cost in MegaWatt-days of power or microseconds of latency), any extra type or value checks, copies, or conversions, which might be implicit in the formal language definition, might incur an added performance cost in processor cycles, cache misses, or run-time memory management. A C program might be specified without much of the above overhead for certain types of applications. However, in applications which require such overhead for a robust solution, that performance advantage disappears as compared against the expected human cost of coding an equivalent (or more) robust solution.
Erlang is a good solution when you want to create:
Realtime Systems: They need predictable response time and Erlang preemptive scheduling and per process garbage collection features shine in it.
Distributed Systems: Erlang has out of box mechanisms for distribution and a standard protocol which is called Erlang Distributed Protocol.
Fault Tolerant Systems: The light-weight processes of Erlang which lets a process to crash without making other processes crash, and its mechanisms for processes to supervise and monitor each other is suitable for fault tolerant systems.
Concurrent Systems: Although writing a concurrent system in languages like C and Java is possible, it can be hard and error prone. But Erlang has internal primitives that makes it so easy to write a concurrent program.
Erlang is not a good choice when you need to write a program that has to do number crunching, image processing and such things because your Erlang codes runs above some layers of abstraction. However there are official mechanisms in Erlang for taking the advantage of C performance. Also Hipe (High Performance Erlang) project is worth considering.

Are there any tools that assist in porting F# to OCaml?

Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic. I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
b) OCaml gets a good complete port to the LLVM backend -
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
Anyone know of such things, either finished or in progress?
Thanks deeply!
To answer your question in an answer - as far as I know, there are no such tools and I do not think it is likely somebody will create them.
Although F# is inspired by OCaml, it has evolved a lot and is different in a number of ways (see this SO discussion), so automatic conversion is not trivial. Even if somebody did that, it would be more like compilation to hard to read OCaml than conversion to idiomatic code that you can later continue working on.
To add a few general comments, when you speak about "real-time" I imagine controlling some robot in a factory dealing with dangerous stuff or an airplane control. In these areas, concerns about GC are certainly valid. However, I do not think games are necessarily "real-time". You need good performance, that's for sure, but people have been writing games with .NET and F# quite happily. For some F# examples, see:
... a nice blog with a couple of game samples (that you can actually try & buy)
a 3D airplane shooter game that also looks fairly realistic
and there is also a book that uses games to explain F#
These are probably simpler than what you're aiming for, but it may be good enough to show that writing games using GC is doable.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic.
A few points here:
Incremental GCs are not the only way to get low pause times. Concurrent GCs like VCGC do the work in bulk but do it concurrently with mutators running, e.g. the VCGC implementation I described in the non-free article here was running with sub-millisecond pause times.
Incremental GC does not necessarily mean low pause times. For example, OCaml's GC typically incurs 10ms pauses and can incur arbitrarily-long pauses when it encounters a deep thread stack or long array in the heap.
I have measured typical pause times of 10ms with OCaml and 30ms with F# on .NET 3. With a simple implementation I was able to build a fault tolerant server in F# from scratch that handled 20k msgs/s with 50% of latencies under 114us and 95% under 500us.
I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
I wouldn't give up on the platform is your first working version has unacceptable latency. There are lots of things you can do to bring the max latency down.
b) OCaml gets a good complete port to the LLVM backend -
I seriously doubt OCaml will ever get what I'd consider to be a "good complete port to the LLVM backend". They'll just retarget LLVM with the current typeless IR and it won't do much better than the current ocamlopt compiler because LLVM isn't designed to optimize that kind of workload.
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
No automated tools but I've ported hundreds of thousands of lines of code between OCaml and F# now and it is generally very easy because most code is written in the core ML subset of both languages.

What makes Erlang unsuitable for computationally expensive work?

At the beginning of Programming Erlang, there is the following:
What makes Erlang the best choice for your project? It depends on what you are looking
to build. If you are looking into writing a number-crunching application, a graphics-
intensive system, or client software running on a mobile handset, then sorry, you
bought the wrong book.
The implied message is that Erlang isn't suitable for computationally expensive work. What makes Erlang so unsuitable, or have I misinterpreted?
Erlang shines for I/O-bound applications, that is, problems whose limiting factor is the latency and throughput of I/O operations rather than the rate at which instructions can be pushed through a CPU pipeline. Web servers and databases are good examples of I/O-bound applications: the liming factors are likely to be the disk and network rather than the CPU. Traditionally "compute-heavy" applications include cryptographic tools and scientific simulations.
As to why Erlang fails to match languages like C and Fortran when it comes to computationally intensive problems, we must consider things like code generation and cache-friendliness... I'll give it a try:
Code generation: Normally when you start an Erlang program, it will be run in BEAM, a virtual machine based on threaded code. While BEAM performs well enough for most purposes, it has much greater overhead per logical "instruction" than does the kind of code generated by a modern optimizing C compiler. The HiPE project provides a native code compiler for Erlang that was integrated into main OTP source tree a couple of years ago*. While it certainly improves Erlang's number crunching capacity, it will still have a hard time matching a well-written C or Fortran program.
Cache-friendliness: The memory system is a major bottleneck in modern computers: a read from main memory can take hundreds of processor cycles! To solve this problem, CPU designers introduce several levels of cache to hide the memory latency. Caches exploit two key properties of computer programs: temporal and spatial locality -- that is, regions of memory that were recently referenced (and nearby regions) are likely to be referenced again. Languages like C and Fortran offers a great deal of control over where and how memory is allocated, enabling the programmer to tune algorithms to play nicely with the caches. The same doesn't generally hold for dynamic languages like Erlang, where memory allocation is hidden from the programmer and handled automatically by the virtual machine.
Code size: The argument about spatial locality holds for code as well; Erlang code, whether in native or bytecode form, will generally be larger than the corresponding compiled C code. This leads to more frequent misses in the instruction cache.
Bear in mind that this is just the tip of the iceberg, and that I am by no means an expert in Erlang or language implementation. Don't let the fact that Erlang will probably never run scientific simulations scare you, though; for many applications, it's an absolutely fantastic language.
*HiPE is available through the erlang-base-hipe package in Debian, or ./configure --enable-hipe from a source tarball.
It's just that C code might be considerable faster most of the time. Erlang is great at fault tolerance, distributed computing, and concurrency. Programmers tend to be equally proficient in writing erlang or other languages, but if you want speed, use C or C++, maybe from an erlang port, so this code is usable from your own erlang application.
Erlang is a concurrent functional programming language designed for programming large industrial real-time systems. Nothing specifically prevents you from developing "a number-crunching application or a graphics-intensive system", but the language shines in real-time event processing.

How pthreads does cross-threading and scheduling

I was wondering, how does pthreads-win32 (windows implementation of pthreads) implement cross-threading? Is it written exclusively with windows API? I checked some of the sources and it seems that most is indeed written with windows API, tho i was wondering if it uses windows scheduler to switch between threads (and cores) as well or does it implement its own? Specifically, most processors these days implement their own scheduler (i've read about itanium arch for example, the hardwired logic supports two threads per core and it even automatically switches between them with hw logic, so evidently OS support for multiple cores is not necessarily needed), so if i have an obsolete OS like windows 32-bit or something, which doesn't support multi-core processors, would a program written with pthreads-win32 still run on more than one processor core or would only one core be used?
How about pthreads implementations (untainted posix threads)? Do they support multi-core processors even if the OS on which they are running doesn't?
I am guessing the answer is no, for both windows and posix versions, only one core is in use if the OS doesn't support for multiple cores. Tho this is just an educated guess and i would like to confirm it, so pls leave a comment.
On a side request, can you pls recommend a lib that DOES support for muli-core thread execution, even if the OS on which the program is running DOESN'T. If any exist ofc.
Also, is there a way to ensure two threads written with pthreads are being executed on different cores, or does the OS (or the processor, or pthreads lib) do the assignment automatically? Does pthreads guarantee execution on different cores if they are available?
Cheers, Val
EDIT:
I know most of these questions are implementation specific, so i was referring to this implementation of pthreads for windows http://sourceware.org/pthreads-win32/. I didn't specifically mention it before, because as far as i know, this is the most popular and widely used implementation of pthreads for windows.
So from what i'm getting, the most important thing to note in all of this is that threading has very little to do with parallelism (like UMA with multi-core processors). So while threading might be a technique to implement concurrency it is not a way of ensuring ACTUAL parallel execution, which is what i was looking for in the first place, since i am studying parallel and distributed systems and algorithms.
So to answer one question at a time. Yes, pthreads, and probably most (if not all) other threading APIs out there are based on the underlying OS API. Which ofc gives them the same limits that the OS has. Meaning, yes, if the OS (concretely in this case, some windows running for example pthreads-win32) doesn't support multiple cores, only one core is in use at all times. As is pointed out on the wiki page nob provided, to cite: "Hyper-threading requires not only that the operating system support multiple processors, but also that it be specifically optimised for HTT, and Intel recommends disabling HTT when using operating systems that have not been so optimized." http://en.wikipedia.org/wiki/Hyper-threading Meaning in most cases, just hardwired processors (basic) scheduler is not enough to take advantage of multiple cores, it has to be supported/used by SW (OS support).
While this might not be a definitive proof, i believe enough evidence points in the same direction to confirm this to be the case.
I did not sift through pthreads (for posix compliant OSes) sources, i am guessing the same goes for this API, since it is more than likely to use the underlying OS API. You will have to confirm this on your own. :)
Also, any potential libs out there that might support execution on multiple cores even if the OS on which they're running on doesn't support multiple cores, you will have to find them on your own (if they exist), please leave a comment.
To ensure parallelism (execution on different cores) manually, linux does provide a way to pin a thread to a specific virtual processor (under certain conditions). To pin an entire process to a specific (virtual) processor/core, sched_setaffinity() (from sched.h) can be used. As nos pointed out, pthreads provides pthread_setaffinity_np() to pin a particular thread to a specific core. Windows supports a similar functionality with SetThreadAffinityMask(), so clearly, assigning threads manually to run in parallel on different cores is possible (if the OS supports multi-cores).
From my experience coding with pthreads, if you write for code that uses multiple threads (more than 2), they SHOULD be executed on more than one physical core, if available (which is probably an OS feature used by pthreads).
My questions were quite general to begin with, since most of these things are implementation specific, it's hard to give one answer. I hope this answer is detailed enough to help you clarify a few things.
Cheers, Val
Generally each modern OS supports Threads by itself and schedules them to the different (virtual) Cores of a System. The OS provides some general synchronization techniques (like Mutexes or Semaphores or Barriers) which are used by pthread to implement the pthreads API.
With two threads per Core (I think you mean Hyper Threading) on some Intel Processors (like Itanium) the OS sees two "virtual" Cores. The processor indeed schedules the two threads onto one physical core. (See Wikipedia)
However, there are examples where Runtime-Plattforms implement their own Thread-Conceptepts and do the scheduling: I think of (at least older) implementations of Java having their own scheduling routines.

Datamining models in FORTRAN or C (or managed code)?

We are planning to develop a datamining package for windows. The program core / calculation engine will be developed in F# with GUI stuff / DB bindings etc done in C# and F#.
However, we have not yet decided on the model implementations. Since we need high performance, we probably can't use managed code here (any objections here?). The question is, is it reasonable to develop the models in FORTRAN or should we stick to C (or maybe C++). We are looking into using OpenCL at some point for suitable models - it feels funny having to go from managed code -> FORTRAN -> C -> OpenCL invocation for these situations.
Any recommendations?
F# compiles to the CLR, which has a just-in-time compiler. It's a dialect of ML, which is strongly typed, allowing all of the nice optimisations that go with that type of architecture; this means you will probably get reasonable performance from F#. For comparison, you could also try porting your code to OCaml (IIRC this compiles to native code) and see if that makes a material difference.
If it really is too slow then see how far that scaling hardware will get you. With the performance available through a modern PC or server it seems unlikely that you would need to go to anything exotic unless you are working with truly brobdinagian data sets. Users with smaller data sets may well be OK on an ordinary PC.
Workstations give you perhaps an order of magnitude more capacity than a standard dekstop PC. A high-end workstation like a HP Z800 or XW9400 (similar kit is available from several other manufacturers) can take two 4 or 6 core CPU chips, tens of gigabytes of RAM (up to 192GB in some cases) and has various options for high-speed I/O like SAS disks, external disk arrays or SSDs. This type of hardware is expensive but may be cheaper than a large body of programmer time. Your existing desktop support infrastructure shouldn be able to this sort of kit. The most likely problem is compatibility issues running 32 bit software on a 64-bit O/S. In this case you have various options like VMs or KVM switches to work around the compatibility issues.
The next step up is a 4 or 8 socket server. Fairly ordinary wintel servers go up to 8 sockets (32-48 cores) and perhaps 512GB of RAM - without having to move off the Wintel platform. This gives you fairly wide range of options within your platform of choice before you have to go to anything exotic1.
Finally, if you can't make it run quickly in F#, validate the F# prototype and build a C implementation using the F# prototype as a control. If that's still not fast enough you've got problems.
If your application can be structured in a way that suits the platform then you could look at a more exotic platform. Depending on what will work with your application, you might be able to host it on a cluster, cloud provider or build the core engine on a GPU, Cell processor or FPGA. However, in doing this you're getting into (quite substantial) additional costs and exotic dependencies that might cause support issues. You will probably also have to bring a third-party consultant who knows how to program the platform.
After all that, the best advice is: suck it and see. If you're comfortable with F# you should be able to prototype your application fairly quickly. See how fast it runs and don't worry too much about performance until you have some clear indication that it really will be an issue. Remember, Knuth said that premature optimisation is the root of all evil about 97% of the time. Keep a weather eye out for issues and re-evaluate your strategy if you think performance really will cause trouble.
Edit: If you want to make a packaged application then you will probably be more performance-sensitive than otherwise. In this case performance will probably become an issue sooner than it would with a bespoke system. However, this doesn't affect the basic 'suck it and see' principle.
For example, at the risk of starting a game of buzzword bingo, if your application can be parallelized and made to work on a shared-nothing architecture you might see if one of the cloud server providers [ducks] could be induced to host it. An appropriate front-end could be built to run locally or through a browser. However, on this type of architecture the internet connection to the data source becomes a bottleneck. If you have large data sets then uploading these to the service provider becomes a problem. It may be quicker to process a large dataset locally than to upload it through an internet connection.
I would advise not to bother with optimizations yet. First try to get a working prototype, then find out where computation time is spent. You can probably move the biggest bottlenecks out into C or Fortran when and if needed -- then see how much difference it makes.
As they say, often 90% of the computation is spent in 10% of the code.

Resources