If two threads try to write to the same address at the same time, is the value after the concurrent write guaranteed to be one of the values that the threads tried to write? or is it possible to get a combination of the bits?
Also, is it possible for another thread to read the memory address while the bits are in an unstable state?
I guess what the question boils down to is if a read or write to a single memory address is atomic at the hardware level.
I think this all depends on the "memory model" for your particular programming language or system.
These questions are fundamentals of a system or/and programming language memory model. Therefore, pick your own OS and a programming language, read the specs and you'll see.
In some cases the results may be just as unpredictable when the two threads are writing to different memory addresses - in particular think about C bitfield structures, as well as compiler optimisations when writing to adjacent addresses.
If you fancy a read, Boehm's paper "Threads cannot be implemented as a library" covers this and other quirks of concurrency.
One thing for sure, for a datatype of size equal to CPU Registers can never have bits in unstable state, it will be either of the two values
On a multi-processor computer, there may not be a single "value" that is read. The two threads, and the third one, may see inconsistent values. You'd need a memory barrier to ensure every thread sees the same value at this address.
Apart from that, writes are generally atomic, so it would be either one or the other of the values that have been written (or were there in the first place) that are read. You're not talking about an Alpha processor, are you?
Related
Does AArch64 support unaligned access natively? I am asking because currently ocamlopt assumes "no".
Providing the hardware bit for strict alignment checking is not turned on (which, as on x86, no general-purpose OS is realistically going to do), AArch64 does permit unaligned data accesses to Normal (not Device) memory with the regular load/store instructions.
However, there are several reasons why a compiler would still want to maintain aligned data:
Atomicity of reads and writes: naturally-aligned loads and stores are guaranteed to be atomic, i.e. if one thread reads an aligned memory location simultaneously with another thread writing the same location, the read will only ever return the old value or the new value. That guarantee does not apply if the location is not aligned to the access size - in that case the read could return some unknown mixture of the two values. If the language has a concurrency model which relies on that not happening, it's probably not going to allow unaligned data.
Atomic read-modify-write operations: If the language has a concurrency model in which some or all data types can be updated (not just read or written) atomically, then for those operations the code generation will involve using the load-exclusive/store-exclusive instructions to build up atomic read-modify-write sequences, rather than plain loads/stores. The exclusive instructions will always fault if the address is not aligned to the access size.
Efficiency: On most cores, an unaligned access at best still takes at least 1 cycle longer than a properly-aligned one. In the worst case, a single unaligned access can cross a cache line boundary (which has additional overhead in itself), and generate two cache misses or even two consecutive page faults. Unless you're in an incredibly memory-constrained environment, or have no control over the data layout (e.g. pulling packets out of a network receive buffer), unaligned data is still best avoided.
Necessity: If the language has a suitable data model, i.e. no pointers, and any data from external sources is already marshalled into appropriate datatypes at a lower level, then there's really no need for unaligned accesses anyway, and it makes the compiler's life that much easier to simply ignore the idea altogether.
I have no idea what concerns OCaml in particular, but I certainly wouldn't be surprised if it were "all of the above".
A faithful implementation of the actor message-passing semantics means that message contents are deep-copied from a logical point-of-view, even for immutable types. Deep-copying of message contents remains a bottleneck for implementations the actor model, so for performance some implementations support zero-copy message passing (although it's still deep-copy from the programmer's point-of-view).
Is zero-copy message-passing implemented at all in Erlang? Between nodes it obviously can't be implemented as such, but what about between processes on the same node? This question is related.
I don't think your assertion is correct at all - deep copying of inter-process messages isn't a bottleneck in Erlang, and with the default VM build/settings, this is exactly what all Erlang systems are doing.
Erlang process heaps are completely separate from each other, and the message queue is located in the process heap, so messages must be copied. This is also true for transferring data into and out of ETS tables as their data is stored in a separate allocation area from process heaps.
There are a number of shared datastructures however. Large binaries (>64 bytes long) are generally allocated in a node-wide area and are reference counted. Erlang processes just store references to these binaries. This means that if you create a large binary and send it to another process, you're only sending the reference.
Sending data between processes is actually worse in terms of allocation size than you might imagine - sharing inside a term isn't preserved during the copy. This means that if you carefully construct a term with sharing to reduce memory consumption, it will expand to its unshared size in the other process. You can see a practical example in the OTP Efficiency Guide.
As Nikolaus Gradwohl pointed out, there was an experimental hybrid heap mode for the VM which did allow term sharing between processes and enabled zero-copy message passing. It hasn't been a particularly promising experiment as I understand it - it requires extra locking and complicates the existing ability of processes to independently garbage collect. So not only is copying inter-process messages not the usual bottleneck in Erlang systems, allowing it actually reduced performance.
AFAIK there was/is experimental support for zero-copy message-passing in erlang using the -shared or -hybrid modell. I read a blog post in 2009 claiming that it's broken on smp machines, but I have no idea about the current status
As has been mentioned here and in other questions current versions of Erlang basically copy everything except for larger binaries. In older pre-SMP times it was feasible to not copy but pass references. While this resulted in very fast message passing it created other problems in the implementation, primarily it made garbage collection more difficult and complicated implementation. I think that today passing references and having shared data could result in excessive locking and synchronisation which is, of course, not a Good Thing.
I wrote the accepted answer to that other question you're referencing, and in it I give you a direct pointer to this line of code:
message = copy_struct(message, msize, &hp, &bp->off_heap);
This is in a function called when the Erlang run-time system needs to send a message, and it's not inside any kind of "if" that could cause it to be skipped. So, as far as I can tell, the answer is "yes, it's always copied." (That's not strictly true -- there is an "if", but it seems to be dealing with exceptional cases, not the normal code-flow path.)
(I'm ignoring the hybrid heap option brought up by Nikolaus. It looks like he's right, but since this isn't the way Erlang is normally built and it has its own penalties, I don't see that it's worth considering as a way to answer your concern.)
I don't know why you're considering 10 GByte/sec a bottleneck, though. Nothing short of registers or CPU cache goes faster in the computer, and such memories are small, thus constituting a kind of bottleneck themselves. Besides which, the zero-copy idea you're proposing would require locking in the case of cross-CPU message passing in a multi-core system, which is also a bottleneck. We're already paying the locking penalty once in this function to copy the message into the other process's message queue; why pay it again later when that process gets around to reading the message?
Bottom line, I don't think your ideas of ways to make it go faster would actually help much.
In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses
Are 'boolean' variables thread-safe for reading and writing from any thread? I've seen some newsgroup references to say that they are. Are any other data types available? (Enumerated types, short ints perhaps?)
It would be nice to have a list of all data types that can be safely read from any thread and another list that can also be safely written to in any thread without having to resort to various synchronization methods.
Please note that you can make essentially everything in delphi unthreadsafe. While others mention alignment problems on boolean this in a way hides the real problem.
Yes, you can read a boolean in any thread and write to a boolean in any thread if it's correctly aligned. But reading from a boolean you change is not necessarily "thread safe" anyway. Say you have a boolean you set to true when you've updated a number so that another thread reads the number.
if NumberUpdated then
begin
LocalNumber = TheNumber;
end;
Due to optimizations the processor makes TheNumber may be read before NumberUpdated is read, thus you may get the old value of TheNumber eventhough you updated NumberUpdated last.
Aka, your code may become:
temp = TheNumber;
if NumberUpdated the
begin
LocalNumber = temp;
end;
Imho, a basic rule of thumb:
"Reads are thread safe. Writes are not thread safe."
So if you're going to do a write protect the data with synchronization everywhere you read the value while a write could potentially occur.
On the other hand, if you only read and write a value in one thread, then it's thread safe. So you can do a large chunk of writing in a temporary location, then synchronize an update of applicationwide data.
Bonus blurb:
The VCL is not thread safe. Keep all modification of ui stuff in the main thread. Keep the creation of all ui stuff in the main thread too.
Many functions are not thread safe either, while others are, it often depends on the underlying winapi calls.
I don't think a "list" would be helpful as "thread safe" can mean a lot of stuff.
This is not a question of data types being thread-safe, but it is a question of what you do with them. Without locking no operation is thread-safe that involves loading a value, then changing it, then writing it back: incrementing or decrementing a number, clearing or setting an element in a set - they are all not thread-safe.
There is a number of functions that allow for atomic operations: interlocked increment, interlocked decrement, and interlocked exchange. This is a common concept, nothing specific to Windows, x86 or Delphi. For Delphi you can use the InterlockedFoo() functions of the Windows API, there are several wrappers around those too. Or write your own. The functions operate on integers, so you can have atomic increment, decrement and exchange of integers (32 bit) with them.
You can also use assembler and prefix ops with the lock prefix.
For more information see also this StackOverflow question.
On 32-bit architecture, only properly aligned 32-bit or less data types should be considered atomic. 32-bit values must be 4-aligned (address of the data must be evenly divisible by four). You probably wouldn't run into interleaving at such tight level, but theoretically you could have double, Int64 or Extended non-atomic write.
With multi-core RISC processing and separate core cache memory being in the mix of a modern processor, it is no-longer the case that any 'trivial' high-level language read or write construct (or for that matter many once-upon-a-time 8086 'atomic' assembly instructions) can be considered atomic. Indeed, unless an assembler instruction is specifically designed to be atomic, it probably is not atomic - and that includes most mechanisms for memory reads. Even a long integer read at assembler level can be corrupted by a simultaneous write from another processor core that is sharing the same memory and using asynchronous cache update actions at the RISC-processor level. Remember that on a processor comprising multiple RISC cores, even assembly language instructions are effectively just "higher-level" code instructions! You never really know just how they are being implemented at the bit level, and it might not be quite what you expected if you were reading an old 8086 (single-core) assembler manual. Windows does provide native-system compatible atomic operators, and you would be well advised to use these rather than make any base assumptions about atomic operations.
Why use the Windows operators? Because one of the first things that Windows does is establish what the machine is that it is running upon. One of the key aspects it ensures it gets right is what atomic operations there are and how they will work. If you want your code to work well into the future upon any future processor, you can either duplicate (and constantly update) all this effort in your own code, or you can make use of the fact that Windows did it all already at startup. It then incorporated the necessary code into its API at runtime.
Read the MSDN pages on atomic operations. The Windows API surfaces these for you. They may sometimes seem clunky or clumsy - but they are future proof and they will always work exactly as it says on the tin.
How do I know this? Well, because if they didn't - then you wouldn't be able to run Windows. Full-stop. Never mind running your own code.
Whenever you write code, it is always a good idea to understand Parsimony and consider Occam's razor. In other words, if Windows already does it, and your code needs Windows to run, then use what Windows already does, rather than trying out many alternative and increasingly complex hypothetical solutions that may or may not work. Doing anything else is just a waste of your time (unless of course that is what you are in to).
The Indy code contains some atomic / thread safe data types in IdThreadSafe.pas:
TIdThreadSafeInteger
TIdThreadSafeBoolean
TIdThreadSafeString
TIdThreadSafeStringList
and some more ...
I have a VPS with not very much memory (256Mb) which I am trying to use for Common Lisp development with SBCL+Hunchentoot to write some simple web-apps. A large amount of memory appears to be getting used without doing anything particularly complex, and after a while of serving pages it runs out of memory and either goes crazy using all swap or (if there is no swap) just dies.
So I need help to:
Find out what is using all the memory (if it's libraries or me, especially)
Limit the amount of memory which SBCL is allowed to use, to avoid massive quantities of swapping
Handle things cleanly when memory runs out, rather than crashing (since it's a web-app I want it to carry on and try to clean up).
I assume the first two are reasonably straightforward, but is the third even possible?
How do people handle out-of-memory or constrained memory conditions in Lisp?
(Also, I note that a 64-bit SBCL appears to use literally twice as much memory as 32-bit. Is this expected? I can run a 32-bit version if it will save a lot of memory)
To limit the memory usage of SBCL, use --dynamic-space-size option (e.g.,sbcl --dynamic-space-size 128 will limit memory usage to 128M).
To find out who is using memory, you may call (room) (the function that tells how much memory is being used) at different times: at startup, after all libraries are loaded and then during work (of cource, call (sb-ext:gc :full t) before room not to measure the garbage that has not yet been collected).
Also, it is possible to use SBCL Profiler to measure memory allocation.
Find out what is using all the memory
(if it's libraries or me, especially)
Attila Lendvai has some SBCL-specific code to find out where an allocated objects comes from. Refer to http://article.gmane.org/gmane.lisp.steel-bank.devel/12903 and write him a private mail if needed.
Be sure to try another implementation, preferably with a precise GC (like Clozure CL) to ensure it's not an implementation-specific leak.
Limit the amount of memory which SBCL
is allowed to use, to avoid massive
quantities of swapping
Already answered by others.
Handle things cleanly when memory runs
out, rather than crashing (since it's
a web-app I want it to carry on and
try to clean up).
256MB is tight, but anyway: schedule a recurring (maybe 1s) timed thread that checks the remaining free space. If the free space is less than X then use exec() to replace the current SBCL process image with a new one.
If you don't have any type declarations, I would expect 64-bit Lisp to take twice the space of a 32-bit one. Even a plain (small) int will use a 64-bit chunk of memory. I don't think it'll use less than a machine word, unless you declare it.
I can't help with #2 and #3, but if you figure out #1, I suspect it won't be a problem. I've seen SBCL/Hunchentoot instances running for ages. If I'm using an outrageous amount of memory, it's usually my own fault. :-)
I would not be surprised by a 64-bit SBCL using twice the meory, as it will probably use a 64-bit cell rather than a 32-bit one, but couldn't say for sure without actually checking.
Typical things that keep memory hanging around for longer than expected are no-longer-useful references that still have a path to the root allocation set (hash tables are, I find, a good way of letting these things linger). You could try interspersing explicit calls to GC in your code and make sure to (as far as possible) not store things in global variables.