Run octave faster - memory

I am very new to GNU Octave, but need to run a script, foo.m that was written in octave. In java, I know it's possible to allocate more memory to a process, and thereby speed it up, using, e.g., -Xmx4g. Is there a similar option in octave?

In java, I know it's possible to allocate more memory to a process, and thereby speed it up, using, e.g., -Xmx4g. Is there a similar option in octave?
The answer for your question is no.
The only flag you can specify during startup of the Octave interpreter that may improve performance is --jit-compiler. For more info, read octave --help.
The jit compiler was a GSoC project in 2012 and was a proof of concept with fairly limited capabilities. Since it didn't really do much, it has been removed in Octave 7.
Modern Octave has no JIT compiler.

Related

Are there any implementations\prototypes of Erlang alike VM that could run not only on CPU but also on gpu?

I was creating distributed systems in OOP languages using message passing libraries like MPI, ZepoMQ, RabbitMQ and so on. Now I found myself watching some erlang promotional material and understood that lots of things we emulate in OOP languages like C++ and C# using libraries (1 000 000 socket connections per process, distributed messaging and distributed process monitoring visualization) was there in Erlang for many years now. And it seemed reasonable to get to know the language better. I found myself asking one last question: are there any implementations\prototypes of Erlang alike VM that could run/spawn some processes not only on CPU but also on GPU?
Because that would definitely make Erlang (and its more readable for my OOP background dialects like Elixir) language of choice for most future projects.
GPU is fast only with sequential memory access. I hardly imagine garbage collection on GPU RAM. GPU is NOT a cool and parallel CPU. It requires more effort to write to. So most probably there is no Erlang compiler for GPU.
I doubt there's any implementation that can run Erlang processes on GPU but you can use two techniques to run computations on GPU under Erlang:
use C library through NIFs (native implemented functions) - see http://www.erlang.org/doc/man/erl_nif.html and an example of such an implementation: msantos/procket on Github (I'm sorry, I can't post the link due to low reputation :)
use native OS process and communicate with it through erlang "port" - see http://www.erlang.org/doc/reference_manual/ports.html
The first one is faster and the later is safer (NIFs can crash the whole VM).
This is not specific to GPU coputations. Erlang is not well suited for high performance number crunching - it's better to do it in C and manipulate the results in Erlang anyway. The communication between the C and Erlang should be implemented in the one of the two described manners.

Why do the CLR and JVM use a stack-based architecture?

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?
I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.
It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

Are there any tools that assist in porting F# to OCaml?

Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic. I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
b) OCaml gets a good complete port to the LLVM backend -
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
Anyone know of such things, either finished or in progress?
Thanks deeply!
To answer your question in an answer - as far as I know, there are no such tools and I do not think it is likely somebody will create them.
Although F# is inspired by OCaml, it has evolved a lot and is different in a number of ways (see this SO discussion), so automatic conversion is not trivial. Even if somebody did that, it would be more like compilation to hard to read OCaml than conversion to idiomatic code that you can later continue working on.
To add a few general comments, when you speak about "real-time" I imagine controlling some robot in a factory dealing with dangerous stuff or an airplane control. In these areas, concerns about GC are certainly valid. However, I do not think games are necessarily "real-time". You need good performance, that's for sure, but people have been writing games with .NET and F# quite happily. For some F# examples, see:
... a nice blog with a couple of game samples (that you can actually try & buy)
a 3D airplane shooter game that also looks fairly realistic
and there is also a book that uses games to explain F#
These are probably simpler than what you're aiming for, but it may be good enough to show that writing games using GC is doable.
Unfortunately, due to .NET's lack of an incremental GC (either in the MS or Mono implementation), building soft real-time software such as games with F# is problematic.
A few points here:
Incremental GCs are not the only way to get low pause times. Concurrent GCs like VCGC do the work in bulk but do it concurrently with mutators running, e.g. the VCGC implementation I described in the non-free article here was running with sub-millisecond pause times.
Incremental GC does not necessarily mean low pause times. For example, OCaml's GC typically incurs 10ms pauses and can incur arbitrarily-long pauses when it encounters a deep thread stack or long array in the heap.
I have measured typical pause times of 10ms with OCaml and 30ms with F# on .NET 3. With a simple implementation I was able to build a fault tolerant server in F# from scratch that handled 20k msgs/s with 50% of latencies under 114us and 95% under 500us.
I've written a language in F# that, if -
a) it doesn't perform adequately in the face of the generational GC (arbitrary pauses during the interactive simulation, and
I wouldn't give up on the platform is your first working version has unacceptable latency. There are lots of things you can do to bring the max latency down.
b) OCaml gets a good complete port to the LLVM backend -
I seriously doubt OCaml will ever get what I'd consider to be a "good complete port to the LLVM backend". They'll just retarget LLVM with the current typeless IR and it won't do much better than the current ocamlopt compiler because LLVM isn't designed to optimize that kind of workload.
I will port it from F# to OCaml. I have avoided as much .NET-specific libraries as I could, and since F#'s syntax is based on OCaml's, I'm assuming there should be some automated tools to assist in converting the code.
No automated tools but I've ported hundreds of thousands of lines of code between OCaml and F# now and it is generally very easy because most code is written in the core ML subset of both languages.

What makes Erlang unsuitable for computationally expensive work?

At the beginning of Programming Erlang, there is the following:
What makes Erlang the best choice for your project? It depends on what you are looking
to build. If you are looking into writing a number-crunching application, a graphics-
intensive system, or client software running on a mobile handset, then sorry, you
bought the wrong book.
The implied message is that Erlang isn't suitable for computationally expensive work. What makes Erlang so unsuitable, or have I misinterpreted?
Erlang shines for I/O-bound applications, that is, problems whose limiting factor is the latency and throughput of I/O operations rather than the rate at which instructions can be pushed through a CPU pipeline. Web servers and databases are good examples of I/O-bound applications: the liming factors are likely to be the disk and network rather than the CPU. Traditionally "compute-heavy" applications include cryptographic tools and scientific simulations.
As to why Erlang fails to match languages like C and Fortran when it comes to computationally intensive problems, we must consider things like code generation and cache-friendliness... I'll give it a try:
Code generation: Normally when you start an Erlang program, it will be run in BEAM, a virtual machine based on threaded code. While BEAM performs well enough for most purposes, it has much greater overhead per logical "instruction" than does the kind of code generated by a modern optimizing C compiler. The HiPE project provides a native code compiler for Erlang that was integrated into main OTP source tree a couple of years ago*. While it certainly improves Erlang's number crunching capacity, it will still have a hard time matching a well-written C or Fortran program.
Cache-friendliness: The memory system is a major bottleneck in modern computers: a read from main memory can take hundreds of processor cycles! To solve this problem, CPU designers introduce several levels of cache to hide the memory latency. Caches exploit two key properties of computer programs: temporal and spatial locality -- that is, regions of memory that were recently referenced (and nearby regions) are likely to be referenced again. Languages like C and Fortran offers a great deal of control over where and how memory is allocated, enabling the programmer to tune algorithms to play nicely with the caches. The same doesn't generally hold for dynamic languages like Erlang, where memory allocation is hidden from the programmer and handled automatically by the virtual machine.
Code size: The argument about spatial locality holds for code as well; Erlang code, whether in native or bytecode form, will generally be larger than the corresponding compiled C code. This leads to more frequent misses in the instruction cache.
Bear in mind that this is just the tip of the iceberg, and that I am by no means an expert in Erlang or language implementation. Don't let the fact that Erlang will probably never run scientific simulations scare you, though; for many applications, it's an absolutely fantastic language.
*HiPE is available through the erlang-base-hipe package in Debian, or ./configure --enable-hipe from a source tarball.
It's just that C code might be considerable faster most of the time. Erlang is great at fault tolerance, distributed computing, and concurrency. Programmers tend to be equally proficient in writing erlang or other languages, but if you want speed, use C or C++, maybe from an erlang port, so this code is usable from your own erlang application.
Erlang is a concurrent functional programming language designed for programming large industrial real-time systems. Nothing specifically prevents you from developing "a number-crunching application or a graphics-intensive system", but the language shines in real-time event processing.

Why is LuaJIT so good?

EDIT: unfortunately LuaJIT was taken out of the comparison in the link below.
This comparison of programming languages shows that LuaJIT has an over tenfold improvement over the normal Lua implementation.
Why is the change so big? Is there something specific about Lua that makes it benefit a lot from JIT compilation?
Python is dynamically typed and compiled to bytecode as well, so why doesn't PyPy (that has JIT now, I believe) show such a large jump in performance?
Mike Pall has talked about this in a few places:
http://article.gmane.org/gmane.comp.lang.lua.general/58908
http://lambda-the-ultimate.org/node/3851
http://www.reddit.com/user/mikemike
As with every performant system, the answer in the end comes down to two things: algorithms and engineering. LuaJIT uses advanced compilation techniques, and it also has a very finely engineered implementation. For example, when the fancy compilation techniques can't handle a piece of code, LuaJIT falls back to an very fast interpreter written in x86 assembly.
LuaJIT gets double points on the engineering aspect, because not only is LuaJIT itself well-engineered, but the Lua language itself has a simpler and more coherent design than Python and JavaScript. This makes it (marginally) easier for an implementation to provide consistently good performance.

Resources