Datamining models in FORTRAN or C (or managed code)? - f#

We are planning to develop a datamining package for windows. The program core / calculation engine will be developed in F# with GUI stuff / DB bindings etc done in C# and F#.
However, we have not yet decided on the model implementations. Since we need high performance, we probably can't use managed code here (any objections here?). The question is, is it reasonable to develop the models in FORTRAN or should we stick to C (or maybe C++). We are looking into using OpenCL at some point for suitable models - it feels funny having to go from managed code -> FORTRAN -> C -> OpenCL invocation for these situations.
Any recommendations?

F# compiles to the CLR, which has a just-in-time compiler. It's a dialect of ML, which is strongly typed, allowing all of the nice optimisations that go with that type of architecture; this means you will probably get reasonable performance from F#. For comparison, you could also try porting your code to OCaml (IIRC this compiles to native code) and see if that makes a material difference.
If it really is too slow then see how far that scaling hardware will get you. With the performance available through a modern PC or server it seems unlikely that you would need to go to anything exotic unless you are working with truly brobdinagian data sets. Users with smaller data sets may well be OK on an ordinary PC.
Workstations give you perhaps an order of magnitude more capacity than a standard dekstop PC. A high-end workstation like a HP Z800 or XW9400 (similar kit is available from several other manufacturers) can take two 4 or 6 core CPU chips, tens of gigabytes of RAM (up to 192GB in some cases) and has various options for high-speed I/O like SAS disks, external disk arrays or SSDs. This type of hardware is expensive but may be cheaper than a large body of programmer time. Your existing desktop support infrastructure shouldn be able to this sort of kit. The most likely problem is compatibility issues running 32 bit software on a 64-bit O/S. In this case you have various options like VMs or KVM switches to work around the compatibility issues.
The next step up is a 4 or 8 socket server. Fairly ordinary wintel servers go up to 8 sockets (32-48 cores) and perhaps 512GB of RAM - without having to move off the Wintel platform. This gives you fairly wide range of options within your platform of choice before you have to go to anything exotic1.
Finally, if you can't make it run quickly in F#, validate the F# prototype and build a C implementation using the F# prototype as a control. If that's still not fast enough you've got problems.
If your application can be structured in a way that suits the platform then you could look at a more exotic platform. Depending on what will work with your application, you might be able to host it on a cluster, cloud provider or build the core engine on a GPU, Cell processor or FPGA. However, in doing this you're getting into (quite substantial) additional costs and exotic dependencies that might cause support issues. You will probably also have to bring a third-party consultant who knows how to program the platform.
After all that, the best advice is: suck it and see. If you're comfortable with F# you should be able to prototype your application fairly quickly. See how fast it runs and don't worry too much about performance until you have some clear indication that it really will be an issue. Remember, Knuth said that premature optimisation is the root of all evil about 97% of the time. Keep a weather eye out for issues and re-evaluate your strategy if you think performance really will cause trouble.
Edit: If you want to make a packaged application then you will probably be more performance-sensitive than otherwise. In this case performance will probably become an issue sooner than it would with a bespoke system. However, this doesn't affect the basic 'suck it and see' principle.
For example, at the risk of starting a game of buzzword bingo, if your application can be parallelized and made to work on a shared-nothing architecture you might see if one of the cloud server providers [ducks] could be induced to host it. An appropriate front-end could be built to run locally or through a browser. However, on this type of architecture the internet connection to the data source becomes a bottleneck. If you have large data sets then uploading these to the service provider becomes a problem. It may be quicker to process a large dataset locally than to upload it through an internet connection.

I would advise not to bother with optimizations yet. First try to get a working prototype, then find out where computation time is spent. You can probably move the biggest bottlenecks out into C or Fortran when and if needed -- then see how much difference it makes.
As they say, often 90% of the computation is spent in 10% of the code.

Related

Care to expound on this statement made on Erlang performance?

There's something else to keep in mind: while Erlang does some things very well, it's technically still possible to get the same results from other languages. The opposite is also true; evaluate each problem as it needs to be, and choose the right tool according to the problem being addressed. Erlang is no silver bullet and will be particularly bad at things like image and signal processing, operating system device drivers, etc. and will shine at things like large software for server use (i.e.: queues, map-reduce), doing some lifting coupled with other languages, higher-level protocol implementation
I'm learning Erlang and this link (http://learnyousomeerlang.com/introduction#kool-aid) got me curious of the reasoning of good vs bad applications for Erlang. Can anyone expound on this statement?
Why do Erlang excel at some of the aformentioned fields and not in the others?
while Erlang does some things very well, it's technically still possible to get the same results from other languages
Lets face it, really all programming languages can do more or less everything, and have ways to interface to C libraries to access anything they don't as such have a native library for.
The most obvious thing to point out is that all of Erlang boils down to C at the end of the day, and a little bit of assembler, but that's not really relevant to the point.
Thus it should be clear enough that anything you can write in Erlang could be written in C, and because you are eliminating a layer of abstraction and interpretation, if you do a reasonable job of it, it should be faster. Sometimes a little faster. Sometimes a lot faster.
Erlang is no silver bullet and will be particularly bad at things like image and signal processing, operating system device drivers, etc.
This is the arena of nitty gritty byte and bit shifting magic, and if you introduce an abstraction layer for every bit you shift... you can easily end up degrading the best possible achievable performance by multiple orders of magnitude.
and will shine at things like large software for server use (i.e.: queues, map-reduce), doing some lifting coupled with other languages, higher-level protocol implementation
This is the interesting bit. We've already established that if you write it in C, unless you do a sufficiently poor job of it, the result can only be better in terms of performance.
BUT performance isn't everything. In today’s world CPU and memory is cheap, but time to market is hugely important. A company might spend thousands on some extra hardware required to run your application because it's written in Erlang instead of C, but save (or make) millions because the product is first to market.
The fact is, if you match a given software problem to a high level language with the right paradigm, the average software engineer can often produce a given product many MANY times faster than if they had to write it in C.
Also, writing C is error prone, and provides vastly more scope for making mistakes and poor choices. That means a software engineer might write something in C badly enough that the equivalent Erlang, based on some very finely tuned mature clever C, if the Erlang itself is well through out, it might perform better!
evaluate each problem as it needs to be, and choose the right tool according to the problem being addressed
Erlang is a really great tool, generally, but it does suit some problem domains more than others. There are some problems which might just be better solved with perl for example, or C, python, etc. When it fits the problem domain, Erlang can be unbeatable, but if it's a bad fit, it's definitely best to consider something else.
Both Erlang and C are Turing complete (except for the lack of infinite memory) and thus both can be used to compute anything if you don't care about absolute performance or the amount of memory or other system resources used.
In systems with constrained memory (tinyDuino, et.al.), the language runtime footprint (and OS resources required to support that runtime) may be a differentiator. For problems where every multiply-accumulate per second counts (affects total cost in MegaWatt-days of power or microseconds of latency), any extra type or value checks, copies, or conversions, which might be implicit in the formal language definition, might incur an added performance cost in processor cycles, cache misses, or run-time memory management. A C program might be specified without much of the above overhead for certain types of applications. However, in applications which require such overhead for a robust solution, that performance advantage disappears as compared against the expected human cost of coding an equivalent (or more) robust solution.
Erlang is a good solution when you want to create:
Realtime Systems: They need predictable response time and Erlang preemptive scheduling and per process garbage collection features shine in it.
Distributed Systems: Erlang has out of box mechanisms for distribution and a standard protocol which is called Erlang Distributed Protocol.
Fault Tolerant Systems: The light-weight processes of Erlang which lets a process to crash without making other processes crash, and its mechanisms for processes to supervise and monitor each other is suitable for fault tolerant systems.
Concurrent Systems: Although writing a concurrent system in languages like C and Java is possible, it can be hard and error prone. But Erlang has internal primitives that makes it so easy to write a concurrent program.
Erlang is not a good choice when you need to write a program that has to do number crunching, image processing and such things because your Erlang codes runs above some layers of abstraction. However there are official mechanisms in Erlang for taking the advantage of C performance. Also Hipe (High Performance Erlang) project is worth considering.

Is the "4GB patch" of any use in real life?

And if so, how. I'm talking about this 4GB Patch.
On the face of it, it seems like a pretty nifty idea: on Windows, each 32-bit application normally only has access to 2GB of address space, but if you have 64-bit Windows, you can enable a little flag to allow a 32-bit application to access the full 4GB. The page gives some examples of applications that might benefit from it.
HOWEVER, most applications seem to assume that memory allocation is always successful. Some applications do check if allocations are successful, but even then can at best quit gracefully on failure. I've never in my (short) life come across an application that could fail a memory allocation and still keep going with no loss of functionality or impact on correctness, and I have a feeling that such applications are from extremely rare to essentially non-existent in the realm of desktop computers. With this in mind, it would seem reasonable to assume that any such application would be programmed to not exceed 2GB memory usage under normal conditions, and those few that do would have been built with this magic flag already enabled for the benefit of 64-bit users.
So, have I made some incorrect assumptions? If not, how does this tool help in practice? I don't see how it could, yet I see quite a few people around the internet claiming it works (for some definition of works).
Your troublesome assumptions are these ones:
Some applications do check if allocations are successful, but even then can at best quit gracefully on failure. I've never in my (short) life come across an application that could fail a memory allocation and still keep going with no loss of functionality or impact on correctness, and I have a feeling that such applications are from extremely rare to essentially non-existent in the realm of desktop computers.
There do exist applications that do better than "quit gracefully" on failure. Yes, functionality will be impacted (after all, there wasn't enough memory to continue with the requested operation), but many apps will at least be able to stay running - so, for example, you may not be able to add any more text to your enormous document, but you can at least save the document in its current state (or make it smaller, etc.)
With this in mind, it would seem reasonable to assume that any such application would be programmed to not exceed 2GB memory usage under normal conditions, and those few that do would have been built with this magic flag already enabled for the benefit of 64-bit users.
The trouble with this assumption is that, in general, an application's memory usage is determined by what you do with it. So, as over the past years storage sizes have grown, and memory sizes have grown, the sizes of files that people want to operate on have also grown - so an application that worked fine when 1GB files were unheard of may struggle now that (for example) high definition video can be taken by many consumer cameras.
Putting that another way: applications that used to fit comfortably within 2GB of memory no longer do, because people want do do more with them now.
I do think the following extract from your link of 4 GB Patch pretty much explains the reason of how and why it works.
Why things are this way on x64 is easy to explain. On x86 applications have 2GB of virtual memory out of 4GB (the other 2GB are reserved for the system). On x64 these two other GB can now be accessed by 32bit applications. In order to achieve this, a flag has to be set in the file's internal format. This is, of course, very easy for insiders who do it every day with the CFF Explorer. This tool was written because not everybody is an insider, and most probably a lot of people don't even know that this can be achieved. Even I wouldn't have written this tool if someone didn't explicitly ask me to.
And to expand on CFF,
The CFF Explorer was designed to make PE editing as easy as possible,
but without losing sight on the portable executable's internal
structure. This application includes a series of tools which might
help not only reverse engineers but also programmers. It offers a
multi-file environment and a switchable interface.
And to quote a Microsoft insider, Larry Miller of Microsoft MCSA on a blog post about patching games using the tool,
Under 32 bit windows an application has access to 2GB of VIRTUAL
memory space. 64 bit Windows makes 4GB available to applications.
Without the change mentioned an application will only be able to
access 2GB.
This was not an arbitrary restriction. Most 32 bit applications simply
can not cope with a larger than 2GB address space. The switch
mentioned indicates to the system that it is able to cope. If this
switch is manually set most 32 bit applications will crash in 64 bit
environment.
In some cases the switch may be useful. But don't be surprised if it
crashes.
And finally to add from MSDN - Migrating 32-bit Managed Code to 64-bit,
There is also information in the PE that tells the Windows loader if
the assembly is targeted for a specific architecture. This additional
information ensures that assemblies targeted for a particular
architecture are not loaded in a different one. The C#, Visual Basic
.NET, and C++ Whidbey compilers let you set the appropriate flags in
the PE header. For example, C# and THIRD have a /platform:{anycpu,
x86, Itanium, x64} compiler option.
Note: While it is technically possible to modify the flags in the PE header of an assembly after it has been compiled, Microsoft does not recommend doing this.
Finally to answer your question - how does this tool help in practice?
Since you have malloc in your tags, I believe you are working on unmanaged memory. This patch would mostly result in invalid pointers as they become twice the size now, and almost all other primitive datatypes would be scaled by a factor of 2X.
But for managed code since all these are handled by the CLR in .NET, this would mean really helpful and would not have much problems unless you are dealing with any of the following :
Invoking platform APIs via p/invoke
Invoking COM objects
Making use of unsafe code
Using marshaling as a mechanism for sharing information
Using serialization as a way of persisting state
To summarize, being a programmer I would not use the tool to convert my application and rather would migrate it myself by changing build targets. being said that if I have a exe that can do well like games with more RAM, then this is worth a try.

Why do the CLR and JVM use a stack-based architecture?

I am always curious as to why the JVM and CLR have a stack-based architecture?
Why don't they use a register-based approach?
What benefits does it have over the register-based approach?
I used to ponder the differences between register and stack machines and compare instruction sequences, and run benchmarks...
Then I spent a couple of years implementing both types of machines while working on the Parrot VM, which was a register machine. We started, naively, with a fixed register set, in combination with data and register stacks, but eventually concluded that it was an artificial limitation, so we changed to an infinite register set and an allocator. At some point, the Parrot fast core (GCC computed goto) outperformed Mono and JVM interpreter cores (non-JIT), but the difference came down to the JIT. Parrot's JIT never matched the quality of the others. It is the quality of the JITter that makes the eventual machine, and that is generally what people care about. If all VMs played by the same rules (ie. they had a constraint to run in interpreted mode with no JIT), then my evidence shows a register machine has the performance edge on an equivalent stack machine. Larger instructions, but fewer of them == higher throughput (IPC), and better cache locality of reference. The Dalvik JVM actually supports my findings, Dalvik had no JIT for a couple of years, and competed with its interpreter core.
Very few mainstream VMs run in interpretation mode exclusively (AFAIK), they JIT compile, and thats what we benchmark. The point of the interpreter core is to establish a presence on the platform, do bytecode verification, and provide a failsafe execution core in absence of the JIT. This isn't a rule, of course; there are billions of devices running ARM accelerated JVM without JIT, but in the absence of memory or CPU constraints, this applies.
I worked and worked at tweaking the core, testing and tuning, only to find that in the end we really wanted a fast JIT. I arrived at the conclusion that if you are going to eventually JIT, it doesn't matter much whether you implement a stack or register machine to start, do what you like; but you will get "to market" faster with a stack machine. Doing a lot of pseudo-register-machine virtual optimizations for bytecode interpretation by a virtual machine core is partially a wasted effort, because it isn't real native optimization. The soft-core doesn't do branch prediction, register renaming, instruction reordering, parallel execution or prefetch like a real processor. My feeling is that once we have a high quality JIT to native binary, we arrive at the same destination.
For those reasons, I technically favor a stack based machine for:
Simplicity - Much less code to maintain = less bugs
Time to implement
But visually, and emotionally I favor a register machine for:
Visual-Conceptual models more closely match the machine, and my
brain
Flexibility - Compilers can evaluate their expression trees
in different orders using SSA.
Note I didn't say compilers could more "easily" generate code. That seems to be what people who have worked mostly with stack machines like to argue. I don't believe that and didn't find that to be true. I saw many hobby compilers written in a short time on both Parrot and the CLR, though I would admit the ones on the CLR are of higher quality, but that is mainly one of ecosystem and quality of available tools. I wrote compilers on both platforms myself, and found there are tradeoffs, but not enough to lose sleep over.
This is an educated guess, because my real-world experience does not include writing a full JITter so I don't have first-hand experience comparing the pros and cons of JITting various forms of opcodes, but my opinion is, if you plan to include a JIT, then creating an extremely sophisticated virtual machine opcode core amounts to premature optimization. Your time is better spent elsewhere.
It is usually not appropriate to just link out to an article but this time I'll make an exception: This article by Eric Lippert answers just this question.

Should I run my regression testing programs on both AMD and Intel chips?

Right now I plan to test on 32-bit, 64-bit, Windows XP Home, Windows XP Pro, Windows Vista Home Basic, Windows Vista Ultimate, Windows 7 Home Basic, and Windows 7 Ultimate ... all with the latest service pack.
However, now I'm wondering if it's worthwhile to test on both AMD and Intel for all the listed scenarios above or would it be a waste of time?
Note: this is a security application for everyday average users.
My feeling is that this would only be worthwhile if you had lots of on-the-edge hand-coded assembly language or some kind of incredibly tight timings (which you're not going to meet with that selection of OS anyway).
If you're using off-the-shelf commercial compilers, then you can be reasonably sure they're going to generate code which runs on all the normal processors.
Of course, nobody could ever prove they didn't need to test on a particular platform, but I would think there are bigger causes of platform difference to worry about than CPU brand (all the various multi-core/hyperthreading permutations, for example, which might expose all your multithreaded code bugs in different ways)
Only if you're programming in assembly and use extended, vender specific instruction sets. But since AMD and Intel have cross-licensing agreements in place, this is more of an historic issue than a current one.
In every other case (e.g. using a high level language) it's the job of the compiler writers to ensure the code is x86 compliant and runs on every CPU.
Oh, and except the FDIV Bug Processor vendors usually don't do mistakes.
I think you're looking in the wrong direction for testing scenarios.
Yes, it's possible that your code will work on Intel but not on AMD, or in Windows Vista Home but not in Windows Vista Professional. But unless you're doing something very closely tied to low-level programming in the first case, or to details of OS implementation in the second, the odds are small. You could say that it never hurts to test every conceivable scenario. But in real life there must be some limit on the resources available to you for testing. Testing on different processors or different OS's is, in most cases, not testing YOUR program, it's testing the compiler, the OS, or the processor. How much time do you have to spare to test other people's work? I think your time would be better spent testing more scenarios within your own code. You don't give much detail on just what your app does, but just to take one of my own examples, it would be much more productive to spend a day testing selling products our own company makes versus products we resell from other manufacturers, or testing sales tax rules for different states, or whatever.
In practice, I rarely even test deploying on Windows versus deploying on Linux, never mind different versions of Windows, and I rarely get burned on that.
If I was writing low-level device drivers or some such, that would be a different story. But normal apps? Don't waste your time.
Certainly sounds like it would be a waste of time to me - which language(s) are your programs written in?
I'd say no. Unless you are writing your application in assembler, you should be far enough removed from the processor to not need to worry about differences. The processors will support the Windows OS whose API's are what you are interefacing with(depending on the language). If you are using .NET the ONLY forseeable issue you will have is if you are using a version of the framework that those platforms don't support. Given that they are all XP or later you should be fine. If you want to worry about something make sure your application will play nicely with the Vista and later security model.
The question is probably "what are you testing". It is unlikely that any of the test is testing something that would be potentially different between AMD and Intel hardware platforms. Differences could be expected at driver level, but you do not seems to plane testing your software for every existing bit of PC hardware available around. Most probably there would be much more differences between different levels of windows service pack than between AMD and Intel processors.
I suppose it's possible there is some functionality in your code that (whether you know it or not) takes advantage of some processing/optimization in one or the other that could have a serious effect on the outcome. Keyword possible.
I would say in general you're unlikely to have to worry about it. If you're going to do it on multiple machines anyway, mix it up on them. But I wouldn't stress out about it.
I would never run all of my regression tests on both AMD and Intel unless I had specifically fixed an issue unique to one either one. That is what regression testing is.
Unit testing on the other hand... I wouldn't anticipate any difference. So again, I wouldn't bother running unit tests on both until I had actually seen an issue specific to either AMD or Intel.
If you rely on accurate / consistent floating point results, then yes, definitely.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.
Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.
I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.
Have you seen Intel's (Open Source) Threading Building Blocks?
I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).
If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Resources