Javonet performance 10x slower compared to native .net code? May be due to object array? - javonet

In another post, I talk about the need for support of primitive array in javonet. Could this explain why pulling ~2GB worth of double array is about 10x slower than comparable code in .net? I've attached a screenshot of JProfiler in case it helps. (Also, though not shown, JProfiler also showed about 1GB of Double objects, which I think should not exist if we just had primitives; but, is this the reason for the slowness or is it because of the ~40,000 calls to a .net method, and all the "stuff" in between with Javonet etc end up taking a few hundred miliseconds or so?)
UPDATE 5/3/2018:
If you read the comments to the first response, you'll eventually see a build (hf16) which resolves the slowness problem. Javonet appears quite fast....I imagine that this build will eventually make it into the core product.

Jonathan, deeply analyzing your case the answer to your performance issues comes from variety of factors. Let me explain them one by one:
Boxing/Unboxing - indeed this had an impact on your results, as answered in this thread How to avoid autoboxing of primitives in arrays in javonet there is beta release which includes ability to force Javonet to use primitive arrays as result. So this issue can be resolved easily.
Unnecessary Strings Passed as mentioned here Performance of Javonet current release of Javonet for Java developers is still suffering from an issue that even for optimized subsequent method calls the method name was passed to .NET side and converted to .NET string. Moreover for each result the type name was returned and converted to Java string. This has been resolved in Javonet for .NET developers already. We addressed this issue for you in temporary build merging those optimizations into Javonet for Java developers. (link below).
Data Type Conversion analyzing your results we found out an issue in "double" processing that might have affected your performance. This is also covered in temporary build linked below.
Type of Operation for Javonet the most costly operation is the on the fly conversion of value-types. Depending on time its either superfast (i.e. boolean) or quite costly (i.e. UInt64). So your case is special as you do few cross-boundry calls but you do a lot of value-type conversion (2GB of array). Completely different results you should observe if you compare calling many times (i.e. 250k) method generating prime number for growing "x" argument. (if you compare that to calling same method via web services i will be 1000x faster)
Way of Comparing Results lastly but very important is that Javonet performance varies depending on the operation you do and way you compare the results. It's clear that if you invoke a method which does nothing purely in .NET it will be optimized by compiler and execute in almost "no time". When you call it through Javonet it will take some "tiny" time (i.e. 0.0000009s) to pass the call to .NET. In result when you divide "tiny" by "no time" it is like dividing 10 by 0 so you could assume its infinitly slower (does it mean Javonet is slow? - not exactly). However if you call a method which does some processing or retrieves data from DB etc.. then Javonet overhead will be almost unoticable.
Unstable beta release with fix for faulty string exchange and double data type conversion:
...link removed due to newer release included in update below...
Please use this only for measuring purposes. We would be happy to know your results. Soon those changes will be merged in stable state to official build and we will inform you afterwards.
Summarizing: There are different reasons for the performance results you obtain. Some of them are being addressed by beta patch mentioned above, some are related to the way you measure and operation you do. Javonet in many cases is the fastest native integration technology between .NET and Java, as recognized in tests of many our customers and trusted in solutions like high frequency trading, real-time data processing, controlling manufacturing and medical devices and other... Of course there are still situations and cases where performance varies. Achieving highest results is one of our key priorities, following one of our principles "be faster with each release". We always accept performance challenges of our customers implementing on-demand improvements if needed. We do accept yours and will strive to optimize for big primitive arrays retrieval as well.
Please test the patch above which should expose significant improvement but still will suffer from environmental reasons points: 4 and 5.
Update 2018-04-30: We have started the implementation of modernized optimization module to address scenarios like yours preserving highest possible performance almost equal to native. Under link below you will find alpha release which works in "usePrimitiveArraysMode" for non-generic methods returning "double[]" without ref/out arguments. I.e. double[] CreateArray() or double[] CreateArray(int size) etc...


What's the overhead for gen_server calls in Erlang/OTP?

This post on Erlang scalability says there's an overhead for every call, cast or message to a gen_server. How much overhead is it, and what is it for?
The cost that is being referenced is the cost of a (relatively blind) function call to an external module. This happens because everything in the gen_* abstractions are callbacks to externally defined functions (the functions you write in your callback module), and not function calls that can be optimized by the compiler within a single module. A part of that cost is the resolution of the call (finding the right code to execute -- the same reason each "dot" in.a.long.function.or.method.call in Python or Java raise the cost of resolution) and another part of the cost is the actual call itself.
This is not something you can calculate as a simple quantity and then multiply by to get a meaningful answer regarding the cost of operations across your system.
There are too many variables, points of constraint, and unexpectedly cheap elements in a massively concurrent system like Erlang where the hardest parts of concurrency are abstracted away (scheduling related issues) and the most expensive elements of parallel processing are made extremely cheap (context switching, process spawn/kill and process:memory ratio).
The only way to really know anything about a massively concurrent system, which by its very nature will exhibit emergent behavior, is to write one and measure it in actual operation. You could write exactly the same program in pure Erlang once and then again as an OTP application using gen_* abstractions and measure the difference in performance that way -- but the benchmark numbers would only mean anything to that particular program and probably only mean anything under that particular load profile.
All this taken in mind... the numbers that really matter when we start splitting hairs in the Erlang world are the reduction budget costs the scheduler takes into account. Lukas Larsson at Erlang Solutions put out a video a while back about the scheduler and details the way these costs impact the system, what they are, and how to tweak the values under certain circumstances (Understanding the Erlang Scheduler). Aside from external resources (iops delay, network problems, NIF madness, etc.) that have nothing to do with Erlang/OTP the overwhelming factor is the behavior of the scheduler, not the "cost of a function call".
In all cases, though, the only way to really know is to write a prototype that represents the basic behavior you expect in your actual system and test it.

Memory efficiency vs Processor efficiency

In general use, should I bet on memory efficiency or processor efficiency?
In the end, I know that must be according to software/hardware specs. but I think there's a general rule when there's no boundaries.
Example 01 (memory efficiency):
int n=0;
if(n < getRndNumber())
n = getRndNumber();
Example 02 (processor efficiency):
int n=0, aux=0;
aux = getRndNumber();
if(n < aux)
n = aux;
They're just simple examples and wrote them in order to show what I mean. Better examples will be well received.
Thanks in advance.
I'm going to wheel out the universal performance question trump card and say "neither, bet on correctness".
Write your code in the clearest possible way, set specific measurable performance goals, measure the performance of your software, profile it to find the bottlenecks, and then if necessary optimise knowing whether processor or memory is your problem.
(As if to make a case in point, your 'simple examples' have different behaviour assuming getRndNumber() does not return a constant value. If you'd written it in the simplest way, something like n = max(0, getRndNumber()) then it may be less efficient but it would be more readable and more likely to be correct.)
To answer Dervin's criticism below, I should probably state why I believe there is no general answer to this question.
A good example is taking a random sample from a sequence. For sequences small enough to be copied into another contiguous memory block, a partial Fisher-Yates shuffle which favours computational efficiency is the fastest approach. However, for very large sequences where insufficient memory is available to allocate, something like reservoir sampling that favours memory efficiency must be used; this will be an order of magnitude slower.
So what is the general case here? For sampling a sequence should you favour CPU or memory efficiency? You simply cannot tell without knowing things like the average and maximum sizes of the sequences, the amount of physical and virtual memory in the machine, the likely number of concurrent samples being taken, the CPU and memory requirements of the other code running on the machine, and even things like whether the application itself needs to favour speed or reliability. And even if you do know all that, then you're still only guessing, you don't really know which one to favour.
Therefore the only reasonable thing to do is implement the code in a manner favouring clarity and maintainability (taking factors you know into account, and assuming that clarity is not at the expense of gross inefficiency), measure it in a real-life situation to see whether it is causing a problem and what the problem is, and then if so alter it. Most of the time you will not have to change the code as it will not be a bottleneck. The net result of this approach is that you will have a clear and maintainable codebase overall, with the small parts that particularly need to be CPU and/or memory efficient optimised to be so.
You think one is unrelated to the other? Why do you think that? Here are two examples where you'll find often unconsidered bottlenecks.
Example 1
You design a DB related software system and find that I/O is slowing you down as you read in one of the tables. Instead of allowing multiple queries resulting in multiple I/O operations you ingest the entire table first. Now all rows of the table are in memory and the only limitation should be the CPU. Patting yourself on the back you wonder why your program becomes hideously slow on memory poor computers. Oh dear, you've forgotten about virtual memory, swapping, and such.
Example 2
You write a program where your methods create many small objects but posses O(1), O(log) or at the worst O(n) speed. You've optimized for speed but see that your application takes a long time to run. Curiously, you profile to discover what the culprit could be. To your chagrin you discover that all those small objects adds up fast. Your code is being held back by the GC.
You have to decide based on the particular application, usage etc. In your above example, both memory and processor usage is trivial, so not a good example.
A better example might be the use of history tables in chess search. This method caches previously searched positions in the game tree in case they are re-searched in other branches of the game tree or on the next move.
However, it does cost space to store them, and space also requires time. If you use up too much memory you might end up using virtual memory which will be slow.
Another example might be caching in a database server. Clearly it is faster to access a cached result from main memory, but then again it would not be a good idea to keep loading and freeing from memory data that is unlikely to be re-used.
In other words, you can't generalize. You can't even make a decision based on the code - sometimes the decision has to be made in the context of likely data and usage patterns.
In the past 10 years. main memory has increased in speed hardly at all, while processors have continued to race ahead. There is no reason to believe this is going to change.
Edit: Incidently, in your example, aux will most likely end up in a register and never make it to memory at all.
Without context I think optimising for anything other than readability and flexibilty
So, the only general rule I could agree with is "Optmise for readability, while bearing in mind the possibility that at some point in the future you may have to optimise for either memory or processor efficiency in the future".
Sorry it isn't quite as catchy as you would like...
In your example, version 2 is clearly better, even though version 1 is prettier to me, since as others have pointed out, calling getRndNumber() multiple times requires more knowledge of getRndNumber() to follow.
It's also worth considering the scope of the operation you are looking to optimize; if the operation is time sensitive, say part of a web request or GUI update, it might be better to err on the side of completing it faster than saving memory.
Processor efficiency. Memory is egregiously slow compared to your processor. See this link for more details.
Although, in your example, the two would likely be optimized to be equivalent by the compiler.

How to hunt a Heisenbug

Recently, we received a bug report from one of our users: something on the screen was displayed incorrectly in our software. Somehow, we could not reproduce this in our development environment (Delphi 2007).
After some further study, it appears that this bug only manifests itself when "Code optimization" is turned on.
Are there any people here with experience in hunting down such a Heisenbug? Any specific constructs or coding bugs that commonly cause such an issue in Delphi software? Any places you would start looking?
I'll also just start debugging the whole thing in the usual way, but any tips specific to Optimization-related bugs (*) would be more than welcome!
(*) Note: I don't mean to say that the bug is caused by the optimizer; I think it's much more likely some wonky construct in the code is somehow pushed "over the edge" by the optimizer.
It seems the bug boils down to a record being fully initialized with zeros when there's no code optimization, and the same record containing some random data when there is optimization. In this case, the random data seems to cause an enum type to contain invalid data (to my great surprise!).
The solution turned out to involve an unitialized local record variable somewhere deep in the code. Apparently, without optimization the record was reset (heap?), and with optimization turned on, the record was filled with the usual garbage. Thanks to you all for your contributions --- I learned a lot along the way!
Typically bugs of this form are caused by invalid memory access (reading uninitialised data, reading off the end of a buffer...) or thread race conditions.
The former will be affected by optimisations causing data layout to be rearranged in memory, and/or possibly by debug code that initialises newly allocated memory to some value; causing the incorrect code to "accidentally work".
The latter will be affected due to timings changing between optimisation levels. The former is generally much more likely.
If you have some automated way of making freshly allocated memory be filled with some constant value before it is passed to the program, and this makes the crash go away or become reproducible in the debug build, that'll provide a good point to start chasing things.
Could very well be a memory vs register issue: you programm running fine relying on memory persistence after a free.
I would recommend running your application with FastMM4 in full debug mode to be sure of your memory management.
Another (not free) tool which can be very useful in a case like this is Eurekalog.
Another thing that I've seen: a crash with the FPU registers being botched when calling some outside code (DLL, COM...) while with the debugger everything was OK.
A record that contains different data according to different compiler settings tells me one thing: That the record is not being explicitly initialised.
You may find that the setting of the compiler optimization flag is only one factor that might affect the content of that record - with any uninitialised data structures the one thing that you can rely on is that you can't rely on the initial content of the structure.
In simple terms:
class member data is initialised (to zero's) for new instances of the class
local variables (in functions and procedures) and unit variables are NOT initialised except in a few specific cases: interface references, dynamic arrays and strings and I think (but would need to check) records if they contain one or more fields of those types that would be initialised (strings, interface references etc).
The question as stated is now a little misleading because it seems you found your "Heisenberg" fairly easily enough. Now the issue is how to deal with it, and the answer is simply to explicitly initialise your record so that you aren't reliant on whatever behaviour or side-effect of the compiler is sometimes taking care of that for you and sometimes not.
Especially in purely native languages, like Delphi, you should be more than careful not to abuse the freedom to be able to cast anything to anything.
IOW: One thing, I have seen is that someone copies the definition of a class (e.g. from the implementation section in RTL or VCL) into his own code and then cast instances of the original class to his copy.
Now, after upgrading the library where the original class came from, you might experience all kinds of weird stuff. Like jumping into the wrong methods or bufferoverflows.
There's also the habit of using signed integer as pointers and vice-versa. (Instead of cardinal)
this works perfectly fine as long as your process has only 2GB of address space. But boot with the /3GB switch and you will see a lot of apps that start acting crazy. Those made the assumption of "pointer=signed integer" at least somewhere.
Your customer uses a 64Bit Windows? Chances are, he might have a larger address space for 32Bit apps. Pretty tough to debug w/o having such a test system available.
Then, there's race conditions.
Like having 2 threads, where one is very, very slow. So that you instinctively assume it will always be the last one and so there's no code that handles the scenario where "Captn slow" finishes first.
Changes in the underlying technologies can make these assumptions very wrong, very fast indeed.
Take a look at the upcoming breed of Flash-based super-mega-fast server storage.
Systems that can read and write Gigabytes per second. Applications that assume the IO stuff to be significantly slower than some calculations on in-memory values will easily fail on this kind of fast storage.
I could go on and on, but I gotta run right now...
Code optimization does not mean necessarily that debug symbols have to be left out. Do a debug build with code optimization, then you can still debug the program and maybe the error occurs now.
One easy thing to do is Turn on compiler warning and hint, rebuild project and then fix all warnings/hints
If it Delphi businesscode, with dataaware components etc, the follow might not apply.
I'm however writing machine vision code which is a bit computational. Most of the unittests are console based. I also am involved with FPC, and over the years have tested a lot with FPC. Partially out of hobby, partially in desperate situations where I wanted any hunch.
Some standard tricks that I tried (decreasing usefulness)
use -gv and valgrind the code (practically this means applications are required to run on Linux/FreeBSD. But for computational code and unittests that can be doable)
compile using fpc param -gt (=trash local vars, randomize local vars on procedure init)
modify heapmanager to randomize data of blocks it puts out (also applyable to Delphi code)
Try FPC's range/overflow checking and compiler hints.
run on a Mac Mini (powerpc) or win64. Due to totally different rules and memory layouts it can catch pretty funky things.
The 2 and 3 together nearly allow you to find most, if not all initialization problems.
Try to find any clues, and then go back to Delphi and search more focussed, debug etc.
I do realize this is not easy. I have a lot of FPC experience, and didn't have to find everything out from scratch for these cases. Still it might be worth a try, and might be a motivation to start setting up non-visual systems and unittests FPC compatible and platform independant. Most of this work will be needed anyway, seeing the Delphi roadmap.
In such problems i always advice to use logfiles.
Question: Can you somehow determine the incorrect display in the sourcecode?
If not, my answer wont help you.
If yes, check for the incorrectness, and as soon as you find it, dump the stack to a logfile. (see post mortem debugging for details about dumping and resymbolizing the stack).
If you see that some data has been corrupted, but you dont know how and then this happend, extract a function that does such a test for validity (with logging if failed), and call this function from more and more places over program execution (i.e. after each menu call). If you reiterate such a approach a few times you have good chances to find the problem.
Is this a local variable inside a procedure or function?
If so, then it lives on the stack, and will contain garbage. Depending on the execution path and compiler settings the garbage will change, potentially pushing your logic 'over the edge'.
Given your description of the problem I think you had uninitialized data that you got away with without the optimizer but which blew up with the optimization on.

What's more expensive, comparison or assignment?

I've started reading Algorithms and I keep wondering, when dealing with primitives of the same type, which is the more expensive operation, assignment or comparison? Does this vary a great deal between languages?
What do you think?
At the lowest level one does two reads, the other does a read and a write.
But why should you really care? You shouldn't care about performance at this level. Optimize for Big-O
Micro-optimization is almost always the wrong thing to do. Don't even start on it unless the program runs too slowly, and you use a profiler to determine exactly where the slow parts are.
Once you've done that, my advice is to see about improving code and data locality, because cache misses are almost certainly worse than suboptimal instructions.
That being done, in the rather odd case that you can use either an assignment-based or comparison-based approach, try both and time them. Micro-optimization is a numbers game. If the numbers aren't good enough, find out why, then verify that what you're doing actually works.
So, what do you mean by a comparison? Conditional jumps cause problems to any vaguely modern processor, but different processors do different things, and there's no guarantee that any given one will slow things down. Also, if either causes a cache miss, that's probably the slower one no matter what.
Finally, languages are normally compiled to machine code, and simple things like comparisons and assignments will normally be compiled the same. The big difference will be the type of CPU.

Is it worth caching objects created by Delphi's Memory Manager?

I have an application that creates, and destroys thousands of objects. Is it worth caching and reusing objects, or is Delphi's memory manager fast enough that creating and destroying objects multiple times is not that great an overhead (as opposed to keeping track of a cache) When I say worth it, of course I'm looking for a performance boost.
From recent testing - if object creation is not expensive (i.e. doesn't depend on external resources - accessing files, registry, database ...) then you'll have a hard time beating Delphi's memory manager. It is that fast.
That of course holds if you're using a recent Delphi - if not, get FastMM4 from SourceForge and use it instead of Delphi's internal MM.
Memory allocation is only a small part of why you would want to cache. You need to know the full cost of constructing a semantically valid object, and compare it with the cost of retrieving items from the cache, and not just for a micro-benchmark: cache effects (CPU cache, that is) may change the runtime dynamics in a real live running application.
Or to put it another way, measure it and find out. If you're not measuring, you're not engineering, just guessing.
Only a profiler will tell you. Try both approaches in a tight loop and see what comes out on top :-)
You absolutely have to measure with real-world loads to answer questions like this. Depending on what resources are held in those objects, any resource contention, construction cost, size, etc., the answer may surprise you, and may even change depending on the nature of the load.
It is usually very difficult to determine where your performance issues will be without measuring.
I think this depends on the code your objects will execute during create and destroy. The impact from TObject.Create and TObject.Destroy is normally neglectable and may easily be outweight by the caching overhead.
You should also consider that the state of an object may differ when reused from that after just being created.
Often the only way to tell - is to try it.
If current performance is adequate then you don't have much call to try and increase it. However, if you have performance issues, then some caching (or indeed some other strategies) may help.
You will also need some stats on how often a specific object (instance) is being used. If you're referencing the same set of data regularly, than caching may really improve performance but if the accesses are distributed across all the possible objects, than your cache miss-rate might be too high for it to be worth-while.
