Practical limit to number of pthreads rwlocks? - pthreads

I am writing an application that requires that certain activities for a given user are not stomped on by possibly competing threads. My entire user database is in memory and I was thinking of adding a pthread_rwlock_t to the user data structure. I do not anticipate more than about 10 to 20 thousand users. At 56 bytes for the lock structure that's not a lot of RAM at all. My question is, is there a practical limitation to the number of actual rwlocks you can have in a process? Please note I am NOT talking about the number of threads that can get a lock, or the number of times a given thread can increment the lock counter. Rather, I am wondering if there is some underlying kernel or other resource that backs each individual lock that I may end up exhausting.

This is a quality-of-implementation issue: POSIX allows the initialisation of a rwlock to fail due to resource exhaustion. However, common Linux implementations, for example, don't require any per-lock resources other than the memory for the pthread_rwlock_t itself.

Related

Can two processes share same page, in case that the other page isn't "full"?

If I have a process which needs 6KB of RAM and the page size is 4KB, I need to allocate two pages. Can another PROCESS access the remaining 2KB for himself, so that two processes share same page table?
Can two processes share same page, in case that the other page isn't "full"?
In theory it is possible, but there are significant issues:
As #Peter says, allowing both processes to share the same page would bypass traditional process memory protections, as for most processors access protection granularity would be no smaller than a whole page.
The two processes would have to coordinate in some way about who gets what part of that shared page.  This could range from
simple coordination that says, process 1 gets the first half and process 2 gets the 2nd half — but this becomes silly when process 1 or process 2 needs more memory, since at that point probably would have been better off simply having their own pages.
the processes communicate with each other to formalize a split between who gets what of that page.  Such communication would typically be some kind of synchronization, which is a bottleneck for many situations.
Consider multiple threads in the same process — some modern runtime systems, e.g. for Java and C#, provide a separate per-thread heap so that simple memory allocations do not require synchronization with other threads.
Having one page that is not full represents less than a page of waste per process, which is not very high overhead, so not really a problem that needs solving, given the issues of security and coordination.
Effectively, the operating systems already share the whole of physical memory between processes albeit at a page granularity, so there is sharing (just not intra-page sharing) and the amount of waste is bounded.

What's the relationship between CPU Out-of-order execution and memory order?

In my understanding, CPU changes the operations order which are written on machine code for optimization and it is called out-of-order execution.
In the term "memory order", it defines the order of accessing to the memory. For example, in relaxed order, it defines very weak ordering rules and execution reordering is easy to happen.
There are some memory ordering models like TSO in x86. In such memory ordering model, the semantics of memory access order by the processor is defined.
What I don't understand is the relationship of them.
Is memory order a kind of out of order execution and are there any other ways for OoOe?
Or, is memory order the implementation of out of order execution and all the reorders by processors are based on the semantics?
The general issue is that on a modern multiprocessor system, load and store instructions may become visible to other cores in a different order than program order. Out-of-order execution is one way in which this can happen, but there are others.
For instance, you could have a CPU which executes and retires all instructions in strict program order, but when it does a store instruction, instead of committing it to L1 cache immediately, it puts it in a store buffer to be written to cache later. The store buffer could be designed to write out stores in a different order than they came in; for instance, if a first store misses L1 cache but a second one would hit, you could save time by writing out the second one while waiting for the first one's cache line to load.
Or, even if the store buffer doesn't reorder, you could have a situation where, while a store is still waiting in the store buffer, the CPU executes a load instruction that came later in program order. Other cores will thus see the load happening before the store. This is the situation with x86, for instance.
The memory ordering model defines, in an abstract way, what the programmer is entitled to expect about the order in which loads and stores become visible to other cores (or hardware, etc). It also usually specifies how the programmer can gain stronger guarantees when needed (e.g. by executing barrier instructions). The CPU then has to be designed to provide the defined behavior, which may place constraints on the features it can include. For instance, if the architecture promises TSO, the CPU probably can't include a store buffer that's capable of reordering, unless they manage to do it in such a clever way that the reordering can never be noticed by other cores.
Related questions:
Are memory barriers needed because of cpu out of order execution or because of cache consistency problem?
Out of Order Execution and Memory Fences
How does memory reordering help processors and compilers?
How do modern Intel x86 CPUs implement the total order over stores

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

Data Retrieval Throughput - ETS lookup vs inter-process Messaging

suppose we have an erlang application which involves thousands of processes. Suppose there is a single resource X which may be a tuple, a list, or any erlang term, which all these processes may need to read / pick out something from it, at any moment in time.
An example of such an occurrence, is say, an API system, in which client processes may need to read and write on a remote machine. Ant it happens that you do not want, for each read/write request, a new connection to be created. So, what you do, you create a pool of connections, consider them as a pool of open pipes/sockets/channels.
Now, this pool of resources is to be shared by thousands of processes such that for each read or write demand, you want that process to retrieve any available open channel/resource.
Question is, what if i have a process (a single process) hold this information, whether in its process dictionary or in its receive loop. It would mean that all the processes would have to send a message to this process whenever they need a free resource. This single process would have a huge mailbox at any time because of the high demand for this single resource. OR I could use an ETS Table, and have only one row, say, #resources{key=pool,value= List_of_openSockets_or_channels}. But this would mean that, all our processes would attempt to make a read from the ETS Table for the same row at (high probability) same instantaneous times.
How would the ETS Table handle, if 10,000 process atttempt a read, for the same row/record from it, at the same time/at almost same time ? and yet, if i use a process, its mailbox, if 10,000 processes send a message to it, at same time, for the same resource (and it would need to reply each requestor). And remember this action may occur so frequently. What option (dis-regarding availability issues of process going down blah blah), would provide higher throughput, in a way that, processes would get what they need faster ? Is there any other better way, of handling high demand data structures in the Erlang VM in a way that will provide very fast access to millions of processes, even if they all needed that resource at the same time ?
Short answer: profile. Try different approaches and verify how your system behaves.
Firstly, I would look at ETS' {read_concurrency, true} option. From the documentation:
{read_concurrency,boolean()} Performance tuning. Default is false.
When set to true, the table is optimized for concurrent read
operations. When this option is enabled on a runtime system with SMP
support, read operations become much cheaper; especially on systems
with multiple physical processors. However, switching between read and
write operations becomes more expensive. You typically want to enable
this option when concurrent read operations are much more frequent
than write operations, or when concurrent reads and writes comes in
large read and write bursts (i.e., lots of reads not interrupted by
writes, and lots of writes not interrupted by reads). You typically do
not want to enable this option when the common access pattern is a few
read operations interleaved with a few write operations repeatedly. In
this case you will get a performance degradation by enabling this
option. The read_concurrency option can be combined with the
write_concurrency option. You typically want to combine these when
large concurrent read bursts and large concurrent write bursts are
common.
Secondly, I would look at caching possibilities. Are the processes reading that information only once or multiple times? If they're accessing it multiple times, you could read it once and store it in your process state.
Thirdly, you could try to replicate and distribute that piece of information across your system. Divide et impera.
If you use the process approach, in order to avoid having all the read requests serialized on the message queue of the 'server' process you must replicate.
Using an ETS table with read_concurrency feels more natural and it is something that I used when developing the parallel version of Dialyzer. However, ETS access was never a bottleneck in that case.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources