Memory Consistency on iOS - ios

I was reading this article on general memory consistency and it raised some questions for me about memory consistency on iOS. It's a really interesting article, but it doesn't go into too many specifics about general platforms. The article mentions that languages such as C++ (and I'm guessing Objective-C/Cocoa Touch APIs) use sequential consistency for data-race-free programs to remove many of the weird behaviors that may occur when trying to write to the same memory.
So, say if I were to use Grand Central Dispatch to create a bunch of different threads, would it even be possible to declare and use global variables that are stored on the same location in memory? If so, how would the writing and reading process work? Is there a write buffer? I would test out some of this stuff if I could, but at the moment I cannot. Anything to help me understand this concept would be appreciated.

Related

Does Dart feature Tail Call Optimization (TCO)?

I wanted to try out in Dart some algorithms and patterns from Functional Programming, but a lot of them rely heavily on recursion, which might incur in serious memory leaks without Tail Call Optimization (TCO), which isn't mandatory for when implementing a language.
Is there an official statement on this topic from the Dart team or something about it in the documentation? I could probably figure out if this is currently present in the language by using Dart's Dev Tools and Profiling, however this way I would never be able to know the Dart team's intentions with respect to the topic, hence the raison d'être of this question.
Dart does not support tail-call optimization. There are no current plans to add it.
The primary reason is that it's a feature that you need to rely on in order to use, otherwise you get hugely inefficient code that might overflow the stack, and since JavaScript currently does not support tail call optimization, the feature cannot be efficiently compiled to JavaScript.

llvm based code mutation for genetic programming?

for a study on genetic programming, I would like to implement an evolutionary system on basis of llvm and apply code-mutations (possibly on IR level).
I found llvm-mutate which is quite useful executing point mutations.
As far as I have understood, the instructions get count/numbered, one can then e.g. delete a numbered instruction.
However, introduction of new instructions seems to be possible as one of the availeable statements in the code.
Real mutation however would allow to insert any of the allowed IR instructions, irrespective of it beeing used in the code to be mutated.
In addition, it should be possible to insert library function calls of linked libraries (not used in the current code, but possibly available, because the lib has been linked in clang).
Did I overlook this in the llvm-mutate or is it really not possible so far?
Are there any projects trying to /already have implement(ed) such mutations for llvm?
llvm has lots of code analysis tools which should allow the implementation of the afore mentioned approach. llvm is huge, so I'm a bit disoriented. Any hints which tools could be helpful (e.g. getting a list of available library functions etc.)?
Thanks
Alex
Very interesting question. I have been intrigued by the possibility of doing binary-level genetic programming for a while. With respect to what you ask:
It is apparent from their documentation that LLVM-mutate can't do what you are asking. However, I think it is wise for it not to. My reasoning is that any machine-language genetic program would inevitably face the "Halting Problem", e.g. it would be impossible to know if a randomly generated instruction would completely crash the whole computer (for example, by assigning a value to a OS-reserved pointer), or it might run forever and take all of your CPU cycles. Turing's theorem tells us that it is impossible to know in advance if a given program would do that. Mind you, LLVM-mutate can cause for a perfectly harmless program to still crash or run forever, but I think their approach makes it less likely by only taking existing instructions.
However, such a thing as "impossibility" only deters scientists, not engineers :-)...
What I have been thinking is this: In nature, real mutations work a lot more like LLVM-mutate that like what we do in normal Genetic Programming. In other words, they simply swap letters out of a very limited set (A,T,C,G) and every possible variation comes out of this. We could have a program or set of programs with an initial set of instructions, plus a set of "possible functions" either linked or defined in the program. Most of these functions would not be actually used, but they will be there to provide "raw DNA" for mutations, just like in our DNA. This set of functions would have the complete (or semi-complete) set of possible functions for a problem space. Then, we simply use basic operations like the ones in LLVM-mutate.
Some possible problems though:
Given the amount of possible variability, the only way to have
acceptable execution times would be to have massive amounts of
computing power. Possibly achievable in the Cloud or with GPUs.
You would still have to contend with Mr. Turing's Halting Problem.
However I think this could be resolved by running the solutions in a
"Sandbox" that doesn't take you down if the solution blows up:
Something like a single-use virtual machine or a Docker-like
container, with a time limitation (to get out of infinite loops). A
solution that crashes or times out would get the worst possible
fitness, so that the programs would tend to diverge away from those
paths.
As to why do this at all, I can see a number of interesting applications: Self-healing programs, programs that self-optimize for an specific environment, program "vaccination" against vulnerabilities, mutating viruses, quality assurance, etc.
I think there's a potential open source project here. It would be insane, dangerous and a time-sucking vortex: Just my kind of project. Count me in if someone doing it.

For distributed applications, which to use, ASIO vs. MPI?

I am a bit confused about this. If you're building a distributed application, which in some cases may perform parallel operations (although not necessarily mathematical), should you use ASIO or something like MPI? I take it MPI is a higher level than ASIO, but it's not clear where in the stack one would begin.
I know nothing about ASIO but from a quick Google it looks to me to be a lot lower level than MPI. For me the whole point of MPI is so that I can program against a higher level of abstraction from the messaging than, it seems, ASIO provides. Where you begin depends on your needs. For mine, parallelising scientific codes for high-performance, the obvious answer is MPI. I'm not sure I'd use it, or at least not sure it would be my default choice, if I were writing more general-purpose distributed, as opposed to parallel, applications. Well, actually, it probably would be my default choice to avoid learning another approach (most of which are less portable and less long-lived than MPI) but I'll admit it might not be the best choice if starting from an equal footing.
As far as I know MPI is currently incapable of handling the situation, when the new distributed nodes want to join the already started group. The problems also may occur if one of the nodes goes offline.
MPI does not reveal any network related machinery that is underneath. Thus if you would ever need something on the lower level -- you're in trouble. If you on the other hand do not aticipate such a need, then you'll save yourself a lot of time using MPI.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.
Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.
I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.
Have you seen Intel's (Open Source) Threading Building Blocks?
I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).
If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Functional programming and multicore architecture

I've read somewhere that functional programming is suitable to take advantage of multi-core trend in computing. I didn't really get the idea. Is it related to the lambda calculus and von neumann architecture?
Functional programming minimizes or eliminates side effects and thus is better suited to distributed programming. i.e. multicore processing.
In other words, lots of pieces of the puzzle can be solved independently on separate cores simultaneously without having to worry about one operation affecting another nearly as much as you would in other programming styles.
One of the hardest things about dealing with parallel processing is locking data structures to prevent corruption. If two threads were to mutate a data structure at once without having it locked perfectly, anything from invalid data to a deadlock could result.
In contrast, functional programming languages tend to emphasize immutable data. Any state is kept separate from the logic, and once a data structure is created it cannot be modified. The need for locking is greatly reduced.
Another benefit is that some processes that parallelize very easily, like iteration, are abstracted to functions. In C++, You might have a for loop that runs some data processing over each item in a list. But the compiler has no way of knowing if those operations may be safely run in parallel -- maybe the result of one depends on the one before it. When a function like map() or reduce() is used, the compiler can know that there is no dependency between calls. Multiple items can thus be processed at the same time.
I've read somewhere that functional programming is suitable to take advantage of multi-core trend in computing... I didn't really get the idea. Is it related to the lambda calculus and von neumann architecture?
The argument behind the belief you quoted is that purely functional programming controls side effects which makes it much easier and safer to introduce parallelism and, therefore, that purely functional programming languages should be advantageous in the context of multicore computers.
Unfortunately, this belief was long since disproven for several reasons:
The absolute performance of purely functional data structures is poor. So purely functional programming is a big initial step in the wrong direction in the context of performance (which is the sole purpose of parallel programming).
Purely functional data structures scale badly because they stress shared resources including the allocator/GC and main memory bandwidth. So parallelized purely functional programs often obtain poor speedups as the number of cores increases.
Purely functional programming renders performance unpredictable. So real purely functional programs often see performance degradation when parallelized because granularity is effectively random.
For example, the bastardized two-line quicksort often cited by the Haskell community typically runs thousands of times slower than a real in-place quicksort written in a more conventional language like F#. Moreover, although you can easily parallelize the elegant Haskell program, you are unlikely to see any performance improvement whatsoever because all of the unnecessary copying makes a single core saturate the entire main memory bandwidth of a multicore machine, rendering parallelism worthless. In fact, nobody has ever managed to write any kind of generic parallel sort in Haskell that is competitively performant. The state-of-the-art sorts provided by Haskell's standard library are typically hundreds of times slower than conventional alternatives.
However, the more common definition of functional programming as a style that emphasizes the use of first-class functions does actually turn out to be very useful in the context of multicore programming because this paradigm is ideal for factoring parallel programs. For example, see the new higher-order Parallel.For function from the System.Threading.Tasks namespace in .NET 4.
When there are no side effects the order of evaluation does not matter. It is then possible to evaluate expressions in parallel.
The basic argument is that it is difficult to automatically parallelize languages like C/C++/etc because functions can set global variables. Consider two function calls:
a = foo(b, c);
d = bar(e, f);
Though foo and bar have no arguments in common and one does not depend on the return code of the other, they nonetheless might have dependencies because foo might set a global variable (or other side effect) which bar depends upon.
Functional languages guarantee that foo and bar are independant: there are no globals, and no side effects. Therefore foo and bar could be safely run on different cores, automatically, without programmer intervention.
All the answers above go to the key idea that "no shared mutable storage" is a key enabler to execute pieces of a program in parallel. It does not really solve the equally hard problem of finding things to execute in parallel. But the typical clearer expressions of functionality in functional languages do make it theoretically easier to extract parallelism from a sequential expression.
In practice, I think the "no shared mutable storage" property of languages based on garbage collection and copy-on-change semantics make them easier to add threads to. The best example is probably Erlang, that combines near-functional semantics with explicit threads.
This is a little bit of a vague question. One perk of multi-core CPUs is that you can run a functional program and let it plug away serially without worrying about affecting any computing going on that has to do with other functions the machine is carrying out.
The difference between a multi-U server and a multi-core CPU in a server or PC is the speed savings you get by having it on the same BUS, allowing better and faster communication to the cores.
edit: I should probably qualify this post by saying that in most of the scripting I do, with or without multiple cores, I rarely see a problem in getting my data through hackish parallelizing, such as running multiple small scripts at once in my script so I'm not slowed down by things like waiting for URLs to load and what not.
double edit: Furthermore, a lot of functional programming languages have had forked parallel variants for decades. These better utilize parallel computation with some speed improvement, but they never really caught on.
Omitting any technical/scientific terms the reason is because functional program doesn't share data. Data is copied and transfered among functions, thus there is no shared data in the application.
And shared data is what causes half the headaches with multithreading.
The book Programming Erlang: Software for a Concurrent World by Joe Armstrong (the creator of Erlang) talks quite a bit about using Erlang for multicore(/multiprocessor) systems. As the wikipedia article states:
Creating and managing processes is trivial in Erlang, whereas threads are considered a complicated and error-prone topic in most languages. Though all concurrency is explicit in Erlang, processes communicate using message passing instead of shared variables, which removes the need for locks.

Resources