What's the overhead for gen_server calls in Erlang/OTP? - erlang

This post on Erlang scalability says there's an overhead for every call, cast or message to a gen_server. How much overhead is it, and what is it for?

The cost that is being referenced is the cost of a (relatively blind) function call to an external module. This happens because everything in the gen_* abstractions are callbacks to externally defined functions (the functions you write in your callback module), and not function calls that can be optimized by the compiler within a single module. A part of that cost is the resolution of the call (finding the right code to execute -- the same reason each "dot" in.a.long.function.or.method.call in Python or Java raise the cost of resolution) and another part of the cost is the actual call itself.
BUT
This is not something you can calculate as a simple quantity and then multiply by to get a meaningful answer regarding the cost of operations across your system.
There are too many variables, points of constraint, and unexpectedly cheap elements in a massively concurrent system like Erlang where the hardest parts of concurrency are abstracted away (scheduling related issues) and the most expensive elements of parallel processing are made extremely cheap (context switching, process spawn/kill and process:memory ratio).
The only way to really know anything about a massively concurrent system, which by its very nature will exhibit emergent behavior, is to write one and measure it in actual operation. You could write exactly the same program in pure Erlang once and then again as an OTP application using gen_* abstractions and measure the difference in performance that way -- but the benchmark numbers would only mean anything to that particular program and probably only mean anything under that particular load profile.
All this taken in mind... the numbers that really matter when we start splitting hairs in the Erlang world are the reduction budget costs the scheduler takes into account. Lukas Larsson at Erlang Solutions put out a video a while back about the scheduler and details the way these costs impact the system, what they are, and how to tweak the values under certain circumstances (Understanding the Erlang Scheduler). Aside from external resources (iops delay, network problems, NIF madness, etc.) that have nothing to do with Erlang/OTP the overwhelming factor is the behavior of the scheduler, not the "cost of a function call".
In all cases, though, the only way to really know is to write a prototype that represents the basic behavior you expect in your actual system and test it.

Related

Is there a sandboxable compiled programming language simililar to lua

I'm working on a crowd simulator. The idea is people walking around a city in 2D. Think gray rectangles for the buildings and colored dots for the people. Now I want these people to be programmable by other people, without giving them access to the core back end.
I also don't want them to be able to use anything other than the methods I provide for them. Meaning no file access, internet access, RNG, nothing.
They will receive get events like "You have just been instructed to go to X" or "You have arrived at P" and such.
The script should then allow them to do things like move_forward or how_many_people_are_in_front_of me and such.
Now I have found out that Lua and python are both thousands of times slower than compiled languages (I figured it would be in order of magnitude of 10s times slower), which is way to slow for my simulation.
So heres my question: Is there a programming language that is FOSS, allows me to restrict system access (sandboxing) the entire language to limit the amount of information the script has by only allowing it to use my provided functions, that is reasonably fast, something like <10x slower than Java, where I can send events to objects inside that language with which I can load in new Classes/Objects on the fly.
Don't you think that if there was a scripting language faster than lua and python, then it'd be talked about at least as much as they are?
The speed of a scripting language is rather vague term. Scripting languages essentially are converted to a series of calls to functions written in fast compiled languages. But the functions are usually written to be general with lots of checks and fail-safes, rather than to be fast. For some problems, not a lot of redundant actions stacks up and the script translation results in essentially same machine code as the compiled program would have. For other problems, a person, knowledgeable about the language, might coerce it to translate to essentially same machine code. For other problems the price of convenience stay forever with the script.
If you look at the timings of benchmark tasks, you'll find that there's no consistent winner across them. For one task the language is fastest, for the other it is way behind.
It would make sense to gauge language speed at your task by looking at similar tasks in benchmarks. So, which of those problem maps the closest to yours? My guess would be: none.
Now, onto the question of user programs inside your program.
That's how script languages came to existence in the first place. You can read up on why such a language may be slow for example in SICP.
If you evaluate what you expect people to write in their programs, you might decide, that you don't need to give them whole programming language. Then you may give them a simple set of instructions they can use to describe a few branching decisions and value lookups. Then your own very performant program will construct an object that encompasses the described logic. This tric is described here and there.
However if you keep adding more and more complex commands for users to invoke, you'll just end up inventing your own language. At that point you'll likely wish you'd went with Lua from the very beginning.
That being said, I don't think the snippet below will run significantly different in compiled code, your own interpreter object, or any embedded scripting language:
if event = "You have just been instructed to go to X":
set_front_of_me(X) # call your function
n = how_many_people_are_in_front_of_me() #call to your function
if n > 3:
move_to_side() #call to function provided by you
else:
move_forward() #call to function provided by you
Now, if the users would need to do complex computer-sciency stuff, solve np-problems, do machine learning or other matrix multiplications, then yes, that would be slow, provided someone would actually trouble themselves with implementing that.
If you get to that point, it seem that there are at least some possibilities to sandbox the compiled dlls (at least in some languages). Or you could do compilation of users' code yourself to control the functionality they invoke and then plug it in as a library.

Erlang OTP I/O - A few questions

I have read one of erlang's biggest adopters is the telecom industry. I'm assuming that they use it to send binary data between their nodes and provide for easy redundancy, efficiency, and parallelism.
Does erlang actually send just the binary to a central node?
Is it directly responsible for parsing the binary data into actual voice? Or is it fed to another language/program via ports?
Is responsible for the speed in a telephone call, speed as in the delay between me saying something and you hearing it.
Is it possible that erlang is solely used for the ease in parallel behavior and c++ or similar for processing speed in sequential functions?
I can only guess at how things are implemented in actual telecom switches, but I can recommend an approach to take:
First, you implement everything in Erlang, including much of the low-level stuff. This probably won't scale that much since signal processing is very costly. As a prototype however, it works and you can make calls and whatnot.
Second, you decide on what to do with the performance bottlenecks. You can push them to C(++) and get a factor of roughly 10 or you can push them to an FPGA and get a factor of roughly 100. Finally you can do CMOS work and get a factor of 1000. The price of the latter approach is also much steeper, so you decide what you need and go buy that.
Erlang remains in control of the control backplane in the sense of what happens when you push buttons the call setup and so on. But once a call has been allocated, we hand over the channel to the lower layer. ATM switching is easier here because once the connection is set you don't need to change it (ATM is connection-oriented, IP is packet-oriented).
Erlangs distribution features are there primarily for providing redundancy in the control backplane. That is, we synchronize tables of call setups and so on between multiple nodes to facilitate node takeover in case of hardware failure.
The trick is to use ports and NIFs post prototype to speed up the slower parts of the program.

Thoughts on minimize code and maximize data philosophy

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
total_tax_burden(income) {
if (income < 1000)
return 0
if (income < 10000)
return .1 * (income - 1000)
if (income < 100000)
return 999.9 + .2 * (income - 10000)
return 18999.7 + .3 * (income - 100000)
}
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
1000:0
10000:10
100000:20
inf:30
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
#! /bin/bash
# $Id$
# Copyright (C) 2002-2007 Novell/SUSE
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, version 2 of the
# License.
#=NAME open
#=DESCRIPTION
# Verify that the open syscall is correctly managed for confined profiles.
#=END
pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`
bin=$pwd
. $bin/prologue.inc
file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w
# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file
# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file
# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file
# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file
# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file
# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file
# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.
cost = (base * ordered * exchange * ... + ... / ...)^13
Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.
Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.
In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
How does the data-driven approach compare to a more traditional software development
process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big
advantage of the data-driven method is that so much knowledge is encoded in the data,
and new knowledge can be added just by collecting more data. But another advantage is
that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...
Another issue is portability. If we wanted a Latvian spelling-corrector, the English
metaphone rules would be of little use. To port the data-driven correct algorithm to another
language, all we need is a large corpus of Latvian; the code remains unchanged.
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A lookup(avalue) can be reimplemented in many different ways during its lifetime, as long as its starts as a function.
...All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.

Memory efficiency vs Processor efficiency

In general use, should I bet on memory efficiency or processor efficiency?
In the end, I know that must be according to software/hardware specs. but I think there's a general rule when there's no boundaries.
Example 01 (memory efficiency):
int n=0;
if(n < getRndNumber())
n = getRndNumber();
Example 02 (processor efficiency):
int n=0, aux=0;
aux = getRndNumber();
if(n < aux)
n = aux;
They're just simple examples and wrote them in order to show what I mean. Better examples will be well received.
Thanks in advance.
I'm going to wheel out the universal performance question trump card and say "neither, bet on correctness".
Write your code in the clearest possible way, set specific measurable performance goals, measure the performance of your software, profile it to find the bottlenecks, and then if necessary optimise knowing whether processor or memory is your problem.
(As if to make a case in point, your 'simple examples' have different behaviour assuming getRndNumber() does not return a constant value. If you'd written it in the simplest way, something like n = max(0, getRndNumber()) then it may be less efficient but it would be more readable and more likely to be correct.)
Edit:
To answer Dervin's criticism below, I should probably state why I believe there is no general answer to this question.
A good example is taking a random sample from a sequence. For sequences small enough to be copied into another contiguous memory block, a partial Fisher-Yates shuffle which favours computational efficiency is the fastest approach. However, for very large sequences where insufficient memory is available to allocate, something like reservoir sampling that favours memory efficiency must be used; this will be an order of magnitude slower.
So what is the general case here? For sampling a sequence should you favour CPU or memory efficiency? You simply cannot tell without knowing things like the average and maximum sizes of the sequences, the amount of physical and virtual memory in the machine, the likely number of concurrent samples being taken, the CPU and memory requirements of the other code running on the machine, and even things like whether the application itself needs to favour speed or reliability. And even if you do know all that, then you're still only guessing, you don't really know which one to favour.
Therefore the only reasonable thing to do is implement the code in a manner favouring clarity and maintainability (taking factors you know into account, and assuming that clarity is not at the expense of gross inefficiency), measure it in a real-life situation to see whether it is causing a problem and what the problem is, and then if so alter it. Most of the time you will not have to change the code as it will not be a bottleneck. The net result of this approach is that you will have a clear and maintainable codebase overall, with the small parts that particularly need to be CPU and/or memory efficient optimised to be so.
You think one is unrelated to the other? Why do you think that? Here are two examples where you'll find often unconsidered bottlenecks.
Example 1
You design a DB related software system and find that I/O is slowing you down as you read in one of the tables. Instead of allowing multiple queries resulting in multiple I/O operations you ingest the entire table first. Now all rows of the table are in memory and the only limitation should be the CPU. Patting yourself on the back you wonder why your program becomes hideously slow on memory poor computers. Oh dear, you've forgotten about virtual memory, swapping, and such.
Example 2
You write a program where your methods create many small objects but posses O(1), O(log) or at the worst O(n) speed. You've optimized for speed but see that your application takes a long time to run. Curiously, you profile to discover what the culprit could be. To your chagrin you discover that all those small objects adds up fast. Your code is being held back by the GC.
You have to decide based on the particular application, usage etc. In your above example, both memory and processor usage is trivial, so not a good example.
A better example might be the use of history tables in chess search. This method caches previously searched positions in the game tree in case they are re-searched in other branches of the game tree or on the next move.
However, it does cost space to store them, and space also requires time. If you use up too much memory you might end up using virtual memory which will be slow.
Another example might be caching in a database server. Clearly it is faster to access a cached result from main memory, but then again it would not be a good idea to keep loading and freeing from memory data that is unlikely to be re-used.
In other words, you can't generalize. You can't even make a decision based on the code - sometimes the decision has to be made in the context of likely data and usage patterns.
In the past 10 years. main memory has increased in speed hardly at all, while processors have continued to race ahead. There is no reason to believe this is going to change.
Edit: Incidently, in your example, aux will most likely end up in a register and never make it to memory at all.
Without context I think optimising for anything other than readability and flexibilty
So, the only general rule I could agree with is "Optmise for readability, while bearing in mind the possibility that at some point in the future you may have to optimise for either memory or processor efficiency in the future".
Sorry it isn't quite as catchy as you would like...
In your example, version 2 is clearly better, even though version 1 is prettier to me, since as others have pointed out, calling getRndNumber() multiple times requires more knowledge of getRndNumber() to follow.
It's also worth considering the scope of the operation you are looking to optimize; if the operation is time sensitive, say part of a web request or GUI update, it might be better to err on the side of completing it faster than saving memory.
Processor efficiency. Memory is egregiously slow compared to your processor. See this link for more details.
Although, in your example, the two would likely be optimized to be equivalent by the compiler.

Designing system architecture for real time acquisition and 'control'

Brief description of requirements
(Lots of good answers here, thanks to all, I'll update if I ever get this flying).
A detector runs along a track, measuring several different physical parameters in real-time (determinist), as a function of curvilinear distance. The user can click on a button to 'mark' waypoints during this process, then uses the GUI to enter the details for each waypoint (in human-time, but while the data acquisition continues).
Following this, the system performs a series of calculations/filters/modifications on the acquired data, taking into account the constraints entered for each waypoint. The output of this process is a series of corrections, also as a function of curvilinear distance.
The third part of the process involves running along the track again, but this time writing the corrections to a physical system which corrects the track (still as a function of curvilinear distance).
My current idea for your input/comments/warnings
What I want to determine is if I can do this with a PC + FPGA. The FPGA would do the 'data acquisition', I would use C# on the PC to read the data from a buffer. The waypoint information could be entered via a WPF/Winforms application, and stocked in a database/flatfile/anything pending 'processing'.
For the processing, I would use F#.
The the FPGA would be used for 'writing' the information back to the physical machine.
The one problem that I can foresee currently is if processing algorithms require a sampling frequency which makes the quantity of data to buffer too big. This would imply offloading some of the processing to the FPGA - at least the bits that don't require user input. Unfortunately, the only pre-processing algorithm is a Kalman filter, which is difficult to implement with an FPGA, from what I have googled.
I'd be very greatful for any feedback you care to give.
UPDATES (extra info added here as and when)
At the entrance to the Kalman filter we're looking at once every 1ms. But on the other side of the Kalman filter, we would be sampling every 1m, which at the speeds we're talking about would be about 2 a second.
So I guess more precise questions would be:
implementing a Kalman filter on an FPGA - seems that it's possible, but I don't understand enough about either subject to be able to work out just HOW possible it is.
I'm also not sure whether an FPGA implementation of a Kalman will be able to cycle every 1ms - though I imagine that it should be no problem.
If I've understood correctly, FPGAs don't have hod-loads of memory. For the third part of the process, where I would be sending a (approximately) 4 x 400 array of doubles to use as a lookup table, is this feasible?
Also, would swapping between the two processes (reading/writing data) imply re-programming the FPGA each time, or could it be instructed to switch between the two? (Maybe possible just to run both in parallel and ignore one or the other).
Another option I've seen is compiling F# to VHDL using Avalda FPGA Developer, I'll be trying that soon, I think.
You don't mention your goals, customers, budget, reliability or deadlines, so this is hard to answer, but...
Forget the FPGA. Simplify your design, development environment and interfaces unless you know you are going to blow your real-time requirements with another solution.
If you have the budget, I'd first take look at LabView.
http://www.ni.com/labview/
http://www.ni.com/dataacquisition/
LabView would give you the data acquisition system and user GUI all on a single PC. In my experience, developers don't choose LabView because it doesn't feel like a 'real' programming environment, but I'd definitely recommend it for the problem you described.
If you are determined to use compiled languages, then I'd isolate the real time data acquisition component to an embedded target with an RTOS, and preferably one that takes advantage of the MMU for scheduling and thread isolation and lets you write in C. If you get a real RTOS, you should be able to realiably schedule the processes that need to run, and also be able to debug them if need be! Keep this off-target system as simple as possible with defined interfaces. Make it do just enough to get the data you need.
I'd then implement the interfaces back to the PC GUI using a common interface file for maintenance. Use standard interfaces for data transfer to the PC, something like USB2 or Ethernet. The FTDI chips are great for this stuff.
Since you are moving along a track, I have to assume the sampling frequency isn't more than 10 kHz. You can offload the data to PC at that rate easily, even 12 Mb USB (full-speed).
For serious processing of math data, Matlab is the way to go. But since I haven't heard of F#, I can't comment.
4 x 400 doubles is no problem. Even low-end FPGAs have 100's of kb of memory.
You don't have to change images to swap between reading and writing. That is done all the time in FPGAs.
Here is a suggestion.
Dump the FPGA concept.
Get a DSP evaluation board from TI
Pick one with enough gigaflops to make you happy.
Enough RAM to store your working set.
Program it in C. TI supply a small RT kernel.
It talks to the PC over, say a serial port or ethernet, whatever.
It sends the PC cooked data with a handshake so the data doesn't get lost.
There is enough ram in the DPS to store your data while the PC has senior moments.
No performance problems with the DSP.
Realtime bit does the realtime, with MP's of ram.
Processing is fast, and the GUI is not time-critical.
What is your connection to the PC? .Net will be a good fit if it is a network based connection, as you can use streams to deal with the data input.
My only warning to you regarding F# or any functional programming language involving large data sets is memory usage. They are wonderful and mathematically provable but when you are getting a stack overflow exception from to many recursions it means that your program won't work and you lose time and effort.
C# will be great if you need to develop a GUI, winforms and GDI+ should get you to something usable without a monumental effort.
Give us some more information regarding data rates and connection and maybe we can offer some more help?
There might be something useful in the Microsoft Robotics Studio: link text especially for the real time aspect. The CCR - Concurrency Coordination Runtime has a lot of this thought out already and the simulation tools might help you build a model that would help your analysis.
Sounds to me like you can do all the processing off line. If this is the case, then offline is the way to go. In other words divide the process into 3 steps:
Data acquisition
Data analysis
Physical system corrections based on the data analysis.
Data Acquisition
If you can't collect the data using a standard interface, then you probably have to go with a custom interface. Hard to say if you should be using an FPGA without knowing more about your interface. Building custom interfaces is expensive, so you should do a tradeoff study to select the approach. Anyway, if this is FPGA based then keep the FPGA simple and use it for raw data acquisition. With current hard drive technology you can easily store 100's of Gigabytes of data for post-processing, so store the raw data on a disk drive. There's no way you want to be implementing even a 1 dimensional Kalman filter in an FPGA if you don't have to.
Data Analysis
Once you've got the data on a hard drive, then you have lots of options for data analysis. If you already know F#, then go with F#. Python and Matlab both have lots of data analysis libraries available.
This approach also makes it much easier to test your data analysis software than a solution where you have to do all the processing in real time. If the results don't seem right, you can easily rerun the analysis without having to go and collect the data again.
Physical System Corrections
Take the results of the data analysis and run the detector along the track again feeding it the appropriate inputs through the interface card.
I've done a lot of embedded engineering including hybrid systems such as the one you've described. At the data rates and sizes you need to process, I doubt that you need an FPGA ... simply find an off the shelf data acquisition system to plug into your PC.
I think the biggest issue you're going to run into is more related to language bindings for your hardware APIs. In the past, I've had to develop a lot of my software in C and assembly (and even some Forth) simply because that was the easiest way to get the data from the hardware.

Resources