How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA - memory

If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?

There are a couple things you can try to mitigate the PCIe bottleneck:
Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution
Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.
With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.
With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.
You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.

You really need to do the math to be certain that you're going to be doing enough processing on the GPU to make it worthwhile transferring data between host and GPU. Ideally you do this at the design stage, before doing any coding, since it can be a deal-breaker.

Related

ESP32: Best way to store data frequently?

I'm developing a C++ application in the ESP32-DevKitC board where I sense acceleration from an accelerometer. The application goal is to store the accelerometer data until storage is full and then send all the data through WiFi and start all again. The micro also goes to deep-sleep mode when is possible.
I'm currently using the ESP32 NVS library which is very well documented and pretty easy to use. The negative side of this is that the library uses Flash memory, therefore a lot of writings will end up degrading the drive.
I know that Espressif also offers some other storage libraries (FAT, SPIFFS, etc.) but, as far as I know (correct me if I'm wrong), they all use Flash drive.
Is there any other possibility of doing what I want to but without using the Flash storage?
Aclarations
Using Flash memory is not the problem itself, but degrading it.
Storage has to be non volatile or at least not being erased when the micro goes to deep-sleep mode.
I'm not using any Arduino library.
That's a great question that I wish more people would ask.
ESP32s use NOR flash storage, which is usually rated for between 10,000 to 100,000 write cycles (100,000 seems to be the standard these days). Flash can't write single bytes; instead of writes a "page" of bytes, which I believe is 256 bytes. So each 256 byte page is rated for at least 100,000 cycles. When a device is rated for 100,000 cycles it's likely to be usable for at least 10 times that, but the manufacturer is not going to make any promises beyond the 100,000.
SPIFFS (and LittleFS, now used on the ESP8266 Arduino Core) perform "wear leveling", to minimize the number of times a particular page is written. So if you modify the same section of a file repeatedly, it will automatically be written to different pages of flash. FAT is not designed to work well with flash storage; I would avoid it.
Whether SPIFFS with wear leveling will be adequate for your needs depends on your needed lifetime of the device versus how much data you'll be writing and how frequently.
NVS may perform some level of wear levelling, to an extent I'm unsure about. Here, in a forum post with 2 ESP employees, they both confirm that NVS does do some form of wear levelling. NVS is best used to persist things like configuration information that doesn't change frequently. It's not a great choice for storing information that's updated often.
You mentioned that the data just needs to survive deep sleep. If that's the case, your best option (if it's large enough) is to use the ESP32's RTC static RAM. This chunk of memory will survive restarts and deep sleep mode, but will lose its state if power is interrupted. It's real RAM so you won't wear it out by writing to it frequently, and it doesn't cost a lot of energy to write to. The catch is there's only 8KB of it.
If the 8KB of RTC RAM isn't enough and you're writing too much data too frequently to trust that SPIFFS will be okay, your best bet would be an SD card. The ESP32 can talk to an SD card adapter. SD cards use NAND flash, which has a much greater lifespan than NOR and can be safely overwritten many more times (which is why these kinds of cards are usable for filesystems in devices like Raspberry Pis).
Writing to flash also takes much more energy than writing to regular RAM. If your device is going to be battery powered, the RTC RAM is also a better choice than SPIFFS or an SD card from a power savings perspective.
Finally, if you use the RTC RAM I'd recommend starting to write it over wifi before it's full, as bringing up wifi and transmitting the data could easily take long enough that you might run out of space for some samples. Using it as a ring buffer and starting the transmit process when you hit a high water mark rather than when the buffer is full would probably be your best bet.
I know i'm late with this answer but you can buy ESP32 modules with external RAM even with 4-8mb. External ram is really fast ( at least much faster than the flash, it uses SPI interface to communicate ) and you can fit a lot of sensor readings in there.
I'm using an ESP32_WROVER_E module with 8mb external ram ( 4mb is usable with normal function calls ) and 16mb flash.
Here is a link of the module that i'm using at TME's site.

Is memory outside each core always conceptually flat/uniform/synchronous in a multiprocessor system?

Multi processor systems perform "real" memory operations (those that influence definitive executions, not just speculative execution) out of order and asynchronously as waiting for global synchronization of global state would needlessly stall all executions nearly all the time. On the other hand, immediately outside each individual core, it seems that the memory system, starting with L1 cache, is purely synchronous, consistent, flat from the allowed behavior point of view (allowed semantics); obviously timing depends on the cache size and behavior.
So on a CPU there on one extreme are named "registers" which are private by definition, and on the other extreme there is memory which is shared; it seems a shame that outside the minuscule space of registers, which have peculiar naming or addressing mode, the memory is always global, shared and globally synchronous, and effectively entirely subject to all fences, even if it's memory used as unnamed registers, for the purpose of storing more data than would fit in the few registers, without a possibility of being examined by other threads (except by debugging with ptrace which obviously stalls, halts, serializes and stores the complete observable state of an execution).
Is that always the case on modern computers (modern = those that can reasonably support C++ and Java)?
Why doesn't the dedicated L1 cache provide register-like semantics for those memory units that are only used by a particular core? The cache must track which memory is shared, no matter what. Memory operations on such local data doesn't have to be stalled when strict global ordering of memory operations are needed, as no other core is observing it, and the cache has the power to stall such external accesses if needed. The cache would just have to know which memory units are private (non globally readable) until a stall of out of order operations, which makes then consistent (the cache would probably need a way to ask the core to serialize operations and publish a consistent state in memory).
Do all CPU stall and synchronize all memory accesses on a fence or synchronizing operation?
Can the memory be used as an almost infinite register resource not subject to fencing?
In practice, a single core operating on memory that no other threads are accessing doesn't slow down much in order to maintain global memory semantics, vs. how a uniprocessor system could be designed.
But on a big multi-socket system, especially x86, cache-coherency (snooping the other socket) is part of what makes memory latency worse for cache misses than on a single-socket system, though. (For accesses that miss in private caches).
Yes, all multi-core systems that you can run a single multi-threaded program on have coherent shared memory between all cores, using some variant of the MESI cache-coherency protocol. (Any exceptions to this rule are considered exotic and have to be programmed specially.)
Huge systems with multiple separate coherency domains that require explicit flushing are more like a tightly-coupled cluster for efficient message passing, not an SMP system. (Normal NUMA multi-socket systems are cache-coherent: Is mov + mfence safe on NUMA? goes into detail for x86 specifically.)
While a core has a cache line in MESI Modified or Exclusive state, it can modify it without notifying other cores about changes. M and E states in one cache mean that no other caches in the system have any valid copy of the line. But loads and stores still have to respect the memory model, e.g. an x86 core still has to commit stores to L1d cache in program order.
L1d and L2 are part of a modern CPU core, but you're right that L1d is not actually modified speculatively. It can be read speculatively.
Most of what you're asking about is handled by a store buffer with store forwarding, allowing store/reload to execute without waiting for the store to become globally visible.
what is a store buffer? and Size of store buffers on Intel hardware? What exactly is a store buffer?
A store buffer is essential for decoupling speculative out-of-order execution (writing data+address into the store buffer) from in-order commit to globally-visible L1d cache.
It's very important even for an in-order core, otherwise cache-miss stores would stall execution. And generally you want a store buffer to coalesce consecutive narrow stores into a single wider cache write, especially for weakly-ordered uarches that can do so aggressively; many non-x86 microarchitectures only have fully efficient commit to cache for aligned 4-byte or wider chunks.
On a strongly-ordered memory model, speculative out-of-order loads and checking later to see if any other core invalidated the line before we're "allowed" to read it is also essential for high performance, allowing hit-under-miss for out-of-order exec to continue instead of one cache miss load stalling all other loads.
There are some limitations to this model:
limited store-buffer size means we don't have much private store/reload space
a strongly-ordered memory model stops private stores from committing to L1d out of order, so a store to a shared variable that has to wait for the line from another core could result in the store buffer filling up with private stores.
memory barrier instructions like x86 mfence or lock add, or ARM dsb ish have to drain the store buffer, so stores to (and reloads from) thread-private memory that's not in practice shared still has to wait for stores you care about to become globally visible.
conversely, waiting for shared store you care about to become visible (with a barrier or a release-store) has to also wait for private memory operations even if they're independent.
the memory is always global, shared and globally synchronous, and
effectively entirely subject to all fences, even if it's memory used
as unnamed registers,
I'm not sure what you mean here. If a thread is accessing private data (i.e., not shared with any other thread), then there is almost no need for memory fence instructions1. Fences are used to control the order in which memory accesses from one core are seen by other cores.
Why doesn't the dedicated L1 cache provide register-like semantics for
those memory units that are only used by a particular execution unit?
I think (if I understand you correctly) what you're describing is called a scratchpad memory (SPM), which is a hardware memory structure that is mapped to the architectural physical address space or has its own physical address space. The software can directly access any location in an SPM, similar to main memory. However, unlike main memory, SPM has a higher bandwidth and/or lower latency than main memory, but is typically much smaller in size.
SPM is much simpler than a cache because it doesn't need tags, MSHRs, a replacement policy, or hardware prefetchers. In addition, the coherence of SPM works like main memory, i.e., it comes into play only when there are multiple processors.
SPM has been used in many commercial hardware accelerators such as GPUs, DSPs, and manycore processor. One example I am familiar with is the MCDRAM of the Knights Landing (KNL) manycore processor, which can be configured to work as near memory (i.e., an SPM), a last-level cache for main memory, or as a hybrid. The portion of the MCDRAM that is configured to work as SPM is mapped to the same physical address space as DRAM and the L2 cache (which is private to each tile) becomes the last-level cache for that portion of MCDRAM. If there is a portion of MCDRAM that is configured as a cache for DRAM, then it would be the last-level cache of DRAM only and not the SPM portion. MCDRAM has a much higher bandwdith than DRAM, but the latency is about the same.
In general, SPM can be placed anywhere in the memory hierarchy. For example, it could placed at the same level as the L1 cache. SPM improves performance and reduces energy consumption when there is no or little need to move data between SPM and DRAM.
SPM is very suitable for systems with real-time requirements because it provides guarantees regarding the maximum latency and/or lowest bandwdith, which is necessary to determine with certainty whether real-time constraints can be met.
SPM is not very suitable for general-purpose desktop or server systems where they can be multiple applications running concurrently. Such systems don't have real-time requirements and, currently, the average bandwdith demand doesn't justify the cost of including something like MCDRAM. Moreover, using an SPM at the L1 or L2 level imposes size constraints on the SPM and the caches and makes difficult for the OS and applications to exploit such a memory hierarchy.
Intel Optance DC memory can be mapped to the physical address space, but it is at the same level as main memory, so it's not considered as an SPM.
Footnotes:
(1) Memory fences may still be needed in single-thread (or uniprocessor) scenarios. For example, if you want to measure the execution time of a specific region of code on an out-of-order processor, it may be necessary to wrap the region between two suitable fence instructions. Fences are also required when communicating with an I/O device through write-combining memory-mapped I/O pages to ensure that all earlier stores have reached the device.

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

Maximum data a GPU can take?

I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?
A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Resources