Constant memory dumps

Constant memory dumps - memory

Is it possible to constantly dump the memory of a process to record every change that is happening? For example if I have a program that modifies the contents of an array I'd like to know the contents of that array before some modification. I imagine a program could save the initial memory and then all changes in a file and I'd just search the file by the modified contents of the array which I know. Then I'd look for changes in that specific memory location before that moment and find the initial contents.
Does a program like that exist? If so, what program would you recommend?
EDIT: I wrote a program in C++ that captures packets of another process using pcap and I would like to know how these packets are constructed inside that program. I'm using Windows.

Notice that memory content is (or may be) changing a lot faster than what a disk is capable of writing.
Also, your question is OS specific. I guess that you are using Linux.
In all cases, design your application very early with your goals.
Perhaps you are looking for application checkpointing. If on Linux, consider BLCR.
Perhaps you are looking for some persistence mechanism. A possible way might be to explicitly persist the state of your application at some points in your program, which are executed frequently. Persistence of the call stack or of continuations is a difficult issue
You may want to use textual formats (like JSON) for serialization. You could be interested in database technology, either relational-SQL (e.g. Sqlite or PostGreSQL) or noSQL mongodb
Persistence and checkpointing may be related to garbage collection algorithms (notably copying GC).
Some language implementations are able to persist their entire heap. For example, in Common Lisp, the SBCL implementation offers save-lisp-and-die
For debugging, you might want watchpoints, or the gcore(1) command.
Notice that if you fork(2) your process and sleep or idle immediately the child process you are keeping in that child process a snapshot of your address space.
Read also about transactional memory & ACID properties

Related

Mapping and allocating

I am little confused with term mapping, for example, when we say mapping memory for database, it means that we assigning specific amount of memory at some memory location to that database?
Also is allocating memory synonym for reserving memory?
Very often I encounter these two terms, and they aren't so clear to me.
If someone can clarify these two terms, I will be very thankful.

This might be a question better asked to the software community at stackoverflow. However, I am a CS.
I would say that terms aren't always used accurately and precisely.
In general allocating memory is making memory available to a program for an active purpose, such as allocating memory for buffers to hold a file or in in-memory structure now.
Reserving memory is often used to mean the same thing. However, it is sometimes more passive. For example reserving memory in case their is a future requirement, or protecting against too much memory allocation for a different purpose.
Often when the term 'mapping' is used, it is for a file. It may mean exactly the same as allocating. Or it means more; mapping may be using an underlying mechanism provided by virtual memory management systems, where part of virtual memory is 'mapped' to the file, without actually reading the file into physical memory. The trick is, as the memory-mapped file is accessed, the block/page being accessed is read in 'invisibly' to the process when necessary. This uses a mechanism called demand paging. It's benefit is a program can access the file as if it is all read into memory, but only the parts actually accessed are retrieved from the persistent storage system (disk, flash, whatever), which can be a huge win if only small parts of the file are needed.
Further, it simplifies the program, which can be written as if the whole file is in memory. Instead of the application developer trying to keep track of which parts of the file have been loaded into memory, the operating system does that instead.
Even better, the Operating system can be asked to track which blocks/pages have their contents changed, and it can be asked to periodically write that back out to persistent storage. This can even further simplify the application program.
This is popular with some databases.

Mapping basically means assigning. Except we often want a 1 to 1 mapping in the case of functions. If you define the function of an object, physical or just logical, and define it's relationships and how it changes under transformation then you have mapped it.

Does Erlang always copy messages between processes on the same node?

A faithful implementation of the actor message-passing semantics means that message contents are deep-copied from a logical point-of-view, even for immutable types. Deep-copying of message contents remains a bottleneck for implementations the actor model, so for performance some implementations support zero-copy message passing (although it's still deep-copy from the programmer's point-of-view).
Is zero-copy message-passing implemented at all in Erlang? Between nodes it obviously can't be implemented as such, but what about between processes on the same node? This question is related.

I don't think your assertion is correct at all - deep copying of inter-process messages isn't a bottleneck in Erlang, and with the default VM build/settings, this is exactly what all Erlang systems are doing.
Erlang process heaps are completely separate from each other, and the message queue is located in the process heap, so messages must be copied. This is also true for transferring data into and out of ETS tables as their data is stored in a separate allocation area from process heaps.
There are a number of shared datastructures however. Large binaries (>64 bytes long) are generally allocated in a node-wide area and are reference counted. Erlang processes just store references to these binaries. This means that if you create a large binary and send it to another process, you're only sending the reference.
Sending data between processes is actually worse in terms of allocation size than you might imagine - sharing inside a term isn't preserved during the copy. This means that if you carefully construct a term with sharing to reduce memory consumption, it will expand to its unshared size in the other process. You can see a practical example in the OTP Efficiency Guide.
As Nikolaus Gradwohl pointed out, there was an experimental hybrid heap mode for the VM which did allow term sharing between processes and enabled zero-copy message passing. It hasn't been a particularly promising experiment as I understand it - it requires extra locking and complicates the existing ability of processes to independently garbage collect. So not only is copying inter-process messages not the usual bottleneck in Erlang systems, allowing it actually reduced performance.

AFAIK there was/is experimental support for zero-copy message-passing in erlang using the -shared or -hybrid modell. I read a blog post in 2009 claiming that it's broken on smp machines, but I have no idea about the current status

As has been mentioned here and in other questions current versions of Erlang basically copy everything except for larger binaries. In older pre-SMP times it was feasible to not copy but pass references. While this resulted in very fast message passing it created other problems in the implementation, primarily it made garbage collection more difficult and complicated implementation. I think that today passing references and having shared data could result in excessive locking and synchronisation which is, of course, not a Good Thing.

I wrote the accepted answer to that other question you're referencing, and in it I give you a direct pointer to this line of code:
message = copy_struct(message, msize, &hp, &bp->off_heap);
This is in a function called when the Erlang run-time system needs to send a message, and it's not inside any kind of "if" that could cause it to be skipped. So, as far as I can tell, the answer is "yes, it's always copied." (That's not strictly true -- there is an "if", but it seems to be dealing with exceptional cases, not the normal code-flow path.)
(I'm ignoring the hybrid heap option brought up by Nikolaus. It looks like he's right, but since this isn't the way Erlang is normally built and it has its own penalties, I don't see that it's worth considering as a way to answer your concern.)
I don't know why you're considering 10 GByte/sec a bottleneck, though. Nothing short of registers or CPU cache goes faster in the computer, and such memories are small, thus constituting a kind of bottleneck themselves. Besides which, the zero-copy idea you're proposing would require locking in the case of cross-CPU message passing in a multi-core system, which is also a bottleneck. We're already paying the locking penalty once in this function to copy the message into the other process's message queue; why pay it again later when that process gets around to reading the message?
Bottom line, I don't think your ideas of ways to make it go faster would actually help much.

How to get the root cause of a memory corruption in a embedded environment?

I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it.
the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?
(A memory break point is break point launched when the environment change the value of a giving memory address)
note also that all my software is developed using C language.
Thanks for your help.

These are always difficult problems on embedded systems and there is no easy answer. Some tips:
Look at the value the memory gets corrupted with. This can give a clear hint.
Look at datastructures next to your memory corruption.
See if there is a pattern in the memory corruption. Is it always at a similar address?
See if you can set up the memory breakpoint at run-time.
Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.
Good luck!

Where is the data stored and how is it accessed by the two processes involved?
If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.
If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.
You can also test this by switching data from the heap into a data section or visa versa.
If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.
You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.
To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.
If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.
Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.
Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.

Yes these problems can be tough to track down with a debugger.
A few ideas:
Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
Comment-out or #if 0 out sections of code, then run the cut-down application. Try commenting-out different sections to try to narrow down in which section of the code the bug occurs.
If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.

I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.
I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.
Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).
If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.
Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).

You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?
Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.
So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.
Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.

Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.

A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).
The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.

First I am assuming you are on a baremetal chip that isn't running Linux or some other POSIX-capable OS (if you are there are much better techniques such as Valgrind and ASan).
Here's a couple tips for tracking down embedded memory corruption:
Use JTAG or similar to set a memory watchpoint on the area of memory that is being corrupted, you might be able to catch the moment when memory being is accidentally being written there vs a correct write, many JTAG debuggers include plugins for IDEs that allow you to get stack traces as well
In your hard fault handler try to generate a call stack that you can print so you can get a rough idea of where the code is crashing, note that since memory corruption can occur some time before the crash actually occurs the stack traces you get are unlikely to be helpful now but with better techniques mentioned below the stack traces will help, generating a backtrace on baremetal can be a very difficult task though, if you so happen to be using a Cortex-M line processor check this out https://github.com/armink/CmBacktrace or try searching the web for advice on generating a back/stack trace for your particular chip
If your compiler supports it use stack canaries to detect and immediately crash if something writes over the stack, for details search the web for "Stack Protector" for GCC or Clang
If you are running on a chip that has an MPU such as an ARM Cortex-M3 then you can use the MPU to write-protect the region of memory that is being corrupted or a small region of memory right before the region being corrupted, this will cause the chip to crash at the moment of the corruption rather than much later

Data Storage with AWE Memory via collections / lists / other containers

Does anyone have any suggestions (product, toolsets, methods or other) for the storage and processing of custom data (delphi collections, binary trees, DIContainers etc) that DOES NOT restrict itself to a standard win32 memory address space? To put that in the extreme, is there anything off the shelf that can do the equivalent of holding a 10GB TList, thereby blowing the /3GB switch barrier and the 4GB 'windows on windows' limit?
What we ideally need is something that is pretty transparent to the Delphi application programmer, but allows very fast access to the data held in its structures, preferably via key lookup. The equivalent of a delphi colletion container would be fine, but its memory usage needs to be via AWE. It would also need to take care of mapping and unmapping the physical space it uses into the win32 process making use of it i.e. that would be the transaprent bit...
Moving the data into a database is not the answer - the information needs to remain memory resident for very fast access. The in-memory databases/tables that we've tried do not make use of AWE and also are slow at accessing. Our current Delphi data structures are fine, but straining the limits of win32 address space.

I'm going to be a complete dork, and tell you that I've made something even more advanced than what you're describing.... at work. So it's all closed source I'm afraid. Never saw anything like this anywhere. We combine VM, AWE, MMF and (soon) 32<>64 bit IPC into one big, mean data-processing machine, addressing up to 64 GB of memory, while processing hundreds of datasets, tens of GBs each.
But I can give you a few tips : AWE view-swapping is rather slow, because it forcibly pauses all running threads during the swap. Therefor, choose your window-sizes wisely (the smaller, the faster the swap - but call-overhead is lower with larger sizes ofcourse). We've settled with AWE view-sizes equal to the Windows default page-size (4 KB), but only because random-access performs best that way. Lineair data-access could run faster with bigger view-sizes.
Each view can map to any part of the allocated AWE memory, so one thing that can help is mapping only those pages into a view that need to be accessed - and try to save on unnessecary view-swaps (a priority-queue comes to mind).
Also, there should be a registration-mechanism somewhere in your design that handles the linkage between a view and the AWE memory behind this. And this better be thread-safe!
As for general usage : No, this doesn't fit in with regular Delphi classes. You should switch over to another concept altogether - and base your data-structures on that.
Anyway, good luck mate! You're going to need it... ;-)

There are system calls that can do this but it is not supported on all versions of Windows (in particular, Windows XP does not support AWE).
Transparency would be something of an issue as the API could not return pointers to objects. Mapping more than 4GB of RAM into a 4GB address space means that a 32 bit pointer could be ambiguous - you could potentially map different objects into the same location.
This ambiguity means that you would have to generate proxies for the objects which hold a handle that could be used to access the 'record'. Some SQL server versions use this technique to store disk buffers in AWE memory. An approach like this would probably work for something like rows in a matrix where the operations are done on the whole row. Finer grained access would be more fiddly.
In order to provide direct access to the mapped object you would have to implement a protocol where a temporary pointer to the mapped memory was made available. This would also require the object to be locked in memory while in use - again, bang goes your transparency.
Assuming you can get a 64 bit version of Delphi now you might be better off going to a 64 bit version of Windows for customers that need more RAM.

You state that you do not want to move to a database, but what about a database that specifically uses AWE?
I've not tried it personally, but would consider using products from this company for my own projects.
[Edit]: NexusDB is Delphi-friendly: it originated from the old Turbopower FlashFiler development (but has moved on a long way since then).

The issue with AWE it works very much alike the old, DOS-based EMS and XMS - if you ever used them. Basically, a range of addressable memory is reserved, and the memory outside the addressable range is then mapped to the addressable range when needed, and unmapped when no longer need, allowing other memory to be mapped at the same addresses. Thereby most non-AWE aware data structures or containers wouldn't work in such a scenario - probably a TMemoryStream descendant is easier to build. It should be easy enough to build a TList or the like that store data in AWE memory, it should keep track where the data are really stored and recall them when needed, adjusting addresses as well when data are mapped to addressable memory. I am not aware of any Delphi containers library using AWE, and there is another issue: desktop 32 bit operating systems can't use more than 4GB of physical RAM, a server version would be required, and the supported physical RAM depends on what version is used, see here for a complete list.

Assuming the data is loaded once in bulk and fits available memory, NexusDB AWE will be very very fast. The database can be created as an in-memory only DB and will then not need any further harddrive access while manipulating.

Sounds to me like you guys might consider dropping the current database SQL backend and going to a 100% NexusDB + AWE solution.
(Or rather, dropping the day to day access to the SQL backend, and having an export/sync function that can write out any required NexusDB reporting data to an MSSQL reporting db.)
W

Your situation sounds similar to ours, our application uses a huge datafile that we store in a memory-mapped file. The files are around 750MB, and we allocate data structures from them that use up to 1.5GB of RAM.
We have found no solution to the 4GB limit other than moving some of it off to FPC/Lazarus until Delphi is 64-bit, unfortunately. AWE does not work with Vista Home versions, also we couldn't get it to work with MMFs.
You could try memory-mapped files with a sliding window, meaning you dynamically create views of different chunks of the file depending on what part of it the application is using. Sounds like that won't work though because you need the entire file in memory at once.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?

One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.

Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)

In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.

Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.

What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.

Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.

One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".

The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.

Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.

The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart