pthreads mutex number and performance - pthreads

how many pthreads mutex are usually
available in a typical system?
Is having many pthreads mutex degrades performance?

POSIX allows mutexes to be implemented as a system-level resource, but such an implementation would be considered extremely poor quality and I can't imagine anyone using it. In reality, on modern implementations (think Linux), the number of mutexes you can have is limited only by the virtual address space. (Actually you can have even more, up to the total size of physical memory/swap plus filesystem size, if you use mmap and munmap to map them in and out as needed.)
As for performance, on 32-bit glibc systems, unlocking robust mutexes is an O(n) operation, where n is the number of robust mutexes currently locked. This is due to using a singly-linked list where a doubly-linked one is needed; they ran out of space in their pthread_mutex_t structure to fit both pointers. This issue only applies for robust mutexes, which are rarely used in practice, and only on 32-bit Linux/glibc. For all other mutex types, the number of mutexes has no impact on performance. The number of currently-contended-for mutexes, however, does have some impact on performance, but it's a complicated subject beyond the scope of a simple answer.

Several answers:
You are asking the wrong question.
If you need more than a 1000 or so
mutexes, you are likely doing something
wrong.
As many as you need.
Non-process-shared mutex does not
usually consume any resources except
RAM.
Having many unused mutexes degrades performance in exactly the same way as having many integers; that is, not at all (assuming sufficient RAM).

Related

Are there any problems for which SIMD outperforms Cray-style vectors?

CPUs intended to provide high-performance number crunching, end up with some kind of vector instruction set. There are basically two kinds:
SIMD. This is conceptually straightforward, e.g. instead of just having a set of 64-bit registers and operations thereon, you have a second set of 128-bit registers and you can operate on a short vector of two 64-bit values at the same time. It becomes complicated in the implementation because you also want to have the option of operating on four 32-bit values, and then a new CPU generation provides 256-bit vectors which requires a whole new set of instructions etc.
The older Cray-style vector instructions, where the vectors start off large e.g. 4096 bits, but the number of elements operated on simultaneously is transparent, and the number of elements you want to use in a given operation is an instruction parameter. The idea is that you bite off a little more complexity upfront, in order to avoid creeping complexity later.
It has been argued that option 2 is better, and the arguments seem to make sense, e.g. https://www.sigarch.org/simd-instructions-considered-harmful/
At least at first glance, it looks like option 2 can do everything option 1 can, more easily and generally better.
Are there any workloads where the reverse is true? Where SIMD instructions can do things Cray-style vectors cannot, or can do something faster or with less code?
The "traditional" vector approaches (Cray, CDC/ETA, NEC, etc) arose in an era (~1976 to ~1992) with limited transistor budgets and commercially available low-latency SRAM main memories. In this technology regime, processors did not have the transistor budget to implement the full scoreboarding and interlocking for out-of-order operations that is currently available to allow pipelining of multi-cycle floating-point operations. Instead, a vector instruction set was created. Vector arithmetic instructions guaranteed that successive operations within the vector were independent and could be pipelined. It was relatively easy to extend the hardware to allow multiple vector operations in parallel, since the dependency checking only needed to be done "per vector" instead of "per element".
The Cray ISA was RISC-like in that data was loaded from memory into vector registers, arithmetic was performed register-to-register, then results were stored from vector registers back to memory. The maximum vector length was initially 64 elements, later 128 elements.
The CDC/ETA systems used a "memory-to-memory" architecture, with arithmetic instructions specifying memory locations for all inputs and outputs, along with a vector length of 1 to 65535 elements.
None of the "traditional" vector machines used data caches for vector operations, so performance was limited by the rate at which data could be loaded from memory. The SRAM main memories were a major fraction of the cost of the systems. In the early 1990's SRAM cost/bit was only about 2x that of DRAM, but DRAM prices dropped so rapidly that by 2002 SRAM price/MiB was 75x that of DRAM -- no longer even remotely acceptable.
The SRAM memories of the traditional machines were word-addressable (64-bit words) and were very heavily banked to allow nearly full speed for linear, strided (as long as powers of two were avoided), and random accesses. This led to a programming style that made extensive use of non-unit-stride memory access patterns. These access patterns cause performance problems on cached machines, and over time developers using cached systems quit using them -- so codes were less able to exploit this capability of the vector systems.
As codes were being re-written to use cached systems, it slowly became clear that caches work quite well for the majority of the applications that had been running on the vector machines. Re-use of cached data decreased the amount of memory bandwidth required, so applications ran much better on the microprocessor-based systems than expected from the main memory bandwidth ratios.
By the late 1990's, the market for traditional vector machines was nearly gone, with workloads transitioned primarily to shared-memory machines using RISC processors and multi-level cache hierarchies. A few government-subsidized vector systems were developed (especially in Japan), but these had little impact on high performance computing, and none on computing in general.
The story is not over -- after many not-very-successful tries (by several vendors) at getting vectors and caches to work well together, NEC has developed a very interesting system (NEC SX-Aurora Tsubasa) that combines a multicore vector register processor design with DRAM (HBM) main memory, and an effective shared cache. I especially like the ability to generate over 300 GB/s of memory bandwidth using a single thread of execution -- this is 10x-25x the bandwidth available with a single thread with AMD or Intel processors.
So the answer is that the low cost of microprocessors with cached memory drove vector machines out of the marketplace even before SIMD was included. SIMD had clear advantages for certain specialized operations, and has become more general over time -- albeit with diminishing benefits as the SIMD width is increased. The vector approach is not dead in an architectural sense (e.g., the NEC Vector Engine), but its advantages are generally considered to be overwhelmed by the disadvantages of software incompatibility with the dominant architectural model.
Cray-style vectors are great for pure-vertical problems, the kind of problem that some people think SIMD is limited to. They make your code forward compatible with future CPUs with wider vectors.
I've never worked with Cray-style vectors, so I don't know how much scope there might be for getting them to do horizontal shuffles.
If you don't limit things to Cray specifically, modern instruction-sets like ARM SVE and RISC-V extension V also give you forward-compatible code with variable vector width, and are clearly designed to avoid that problem of short-fixed-vector SIMD ISAs like AVX2 and AVX-512, and ARM NEON.
I think they have some shuffling capability. Definitely masking, but I'm not familiar enough with them to know if they can do stuff like left-pack (AVX2 what is the most efficient way to pack left based on a mask?) or prefix-sum (parallel prefix (cumulative) sum with SSE).
And then there are problems where you're working with a small fixed amount of data at a time, but more than fits in an integer register. For example How to convert a binary integer number to a hex string? although that's still basically doing the same stuff to every element after some initial broadcasting.
But other stuff like Most insanely fastest way to convert 9 char digits into an int or unsigned int where a one-off custom shuffle and horizontal pairwise multiply can get just the right work done with a few single-uop instructions is something that requires tight integration between SIMD and integer parts of the core (as on x86 CPUs) for maximum performance. Using the SIMD part for what it's good at, then getting the low two 32-bit elements of a vector into an integer register for the rest of the work. Part of the Cray model is (I think) a looser coupling to the CPU pipeline; that would defeat use-cases like that. Although some 32-bit ARM CPUs with NEON have the same loose coupling where mov from vector to integer is slow.
Parsing text in general, and atoi, is one use-case where short vectors with shuffle capabilities are effective. e.g. https://www.phoronix.com/scan.php?page=article&item=simdjson-avx-512&num=1 - 25% to 40% speedup from AVX-512 with simdjson 2.0 for parsing JSON, over the already-fast performance of AVX2 SIMD. (See How to implement atoi using SIMD? for a Q&A about using SIMD for JSON back in 2016).
Many of those tricks depend on x86-specific pmovmskb eax, xmm0 for getting an integer bitmap of a vector compare result. You can test if it's all zero or all-1 (cmp eax, 0xffff) to stay in the main loop of a memcmp or memchr loop for example. And if not then bsf eax,eax to find the position of the first difference, possibly after a not.
Having vector width limited to a number of elements that can fit in an integer register is key to this, although you could imagine an instruction-set with compare-into-mask with scalable width mask registers. (Perhaps ARM SVE is already like that? I'm not sure.)

What is "expensive" vs "inexpensive" memory?

I've been reading that accessing data from lower components in the memory hierarchy is slower but less expensive. For example, fetching data from registers is fast but expensive. Could someone explain what 'expensive' means here? Is it literally the dollar cost of components? If so, I don't understand why faster components would be more expensive. I read this answer (Memory Hierarchy - Why are registers expensive?) and it talks about how accessing data in registers requires additional data paths that aren't required by lower memory components, but I didn't understand from any of the examples why those data paths would be required when fetching from registers, but not from something like main memory.
So to summarize, my two questions are:
1) What does 'expensive' mean in this context?
2) Why are faster areas of memory like registers more expensive?
Thanks!
1) What does 'expensive' mean in this context?
Expensive has the usual meaning ($). For integrated circuit, price depends on circuit size and is directly related to the number and size of transistors. It happens that and "expensive" memory requires more area on an integrated circuit.
Actually, there are several technologies used to implement a memorizing device.
Registers that are used in processor. They are realized with some kind of logic devices called latches, and their main quality is to be fast, in order to allow two reads/one write per cycle. To that purpose, transistors are dimensioned to improve driving. It depends on the actual design, but typically a bit of memory requires ~10 transistors in a register.
Static memory (SRAM) is designed as a matrix of simplified flip-flops, with 2 inverters per cell and only requires 6 transistors per memorized bit. More, static memory is a memory and to improve the number of bits per unit area, transistors are designed to be smaller than for registers. SRAM are used in cache memory.
Dynamic memory (DRAM) uses only a unique transistor as a capacitance for memorization. The transistor is either charged or discharged to represent a 1 or 0. Though extremely economic, this technique cannot be very fast, especially when a large number of cells is concerned as in present DRAM chips. To improve capacity (number of bits on a given area), transistors are rendered as small as possible, and a complex analog circuitry is used to detect small voltage variations to speed up cell content reading. More reads destroys the cell content and requires a write. Last, there are leaks in the capacitance and data must be periodically rewritten to insure data integrity. Put altogether, it makes a DRAM a slow device, with an access time of 100-200 processor cycles, but they can provide extremely cheap physical memory.
2) Why are faster areas of memory like registers more expensive?
Processor rely on a memory hierarchy and different level of the hierarchy have specific constraints. To make a cheap memory, you need small transistors to reduce the size required to memorize a bit. But for electrical reasons, a small transistor is a poor generator that cannot provide enough current to drive rapidly its output. So, while the underlying technology is similar, design choices are different in different part of the memory hierarchy.
What does 'expensive' mean in this context?
More and larger transistors (more silicon per bit), more power to runs those transistors
Why are faster areas of memory like registers more expensive?
All memory is made as cheap as it can be for the needed speed -- there's no reason to make it more expensive if it doesn't need to be, or slower if it doesn't need to be. So it becomes a tradeoff and finding "sweet spots" in the design space -- make a particular type of circuit as fast and cheap as possible, while a different circuit is also tuned to be as fast and cheap as possible. If one design is both slower and more expensive, then there's no reason to ever use it. Its only when one design is faster while the other is cheaper, does it make sense to use both in different parts of the system.

Good memory use habits when accessing integer values in NVIDIA CUDA

I am completely new to CUDA programming and I want to make sure I understand some basic memory related principles, as I was a little confused with some of my thoughts.
I am working on simulation, using billions of one time use random numbers in range of 0 to 50.
After cuRand fill a huge array with random 0.0 to 1.0 floats, I run a kernel that convert all this float data to desired integer range. From what I learned I had a feeling that storing 5 these values on one unsigned int by using just 6 bits is better because of very low bandwidth of global memory. So I did it.
Now I have to store somewhere around 20000 read-only yes/no values, that will be accessed randomly with let's say the same probability based on random values going to the simulation.
First I thought about shared memory. Using one bit looked great until I realized that the more information in one bank, the more collisions there will be. So the solution seems to be use of unsigned short (2Byte->40KB total) to represent one yes/no information, using maximum of available memory and so minimizing probabilities of reading the same bank by different threads.
Another thought came from using constant memory and L1 cache. Here, from what I learned, the approach would be exactly opposite from shared memory. Reading the same location by more threads is now desirable, so putting 32 values on one 4B bank is now optimal.
So based on overall probabilities I should decide between shared memory and cache, with shared location probably being better as with so many yes/no values and just tens or hundreds of threads per block there will not be many bank collisions.
But am I right with my general understanding of the problem? Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?
Sorry for a wall of text, there is million of ways I can make the simulation work, but just a few ways of making it the right way. I just don't want to repeat some stupid mistake again and again only because I understand something badly.
The first two optimization priorities for any CUDA programmer are:
Launch enough threads (i.e. expose enough parallelism) that the machine has sufficient work to be able to hide latency
Make efficient use of memory.
The second item above will have various guidelines based on the types of GPU memory you are using, such as:
Strive for coalesced access to global memory
Take advantage of caches and shared memory to optimize the reuse of data
For shared memory, strive for non-bank-conflicted access
For constant memory, strive for uniform access (all threads within a warp read from the same location, in a given access cycle)
Consider use of texture caching and other special GPU features.
Your question is focused on memory:
Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?
Optimizing memory access is usually quite important to have a code that runs fast on a GPU. Most of the above recommendations/guidelines I gave are not focused on the idea that you should pack multiple data items per int (or some other word quantity), however it's not out of the question to do so for a memory-bound code. Many programs on the GPU end up being memory-bound (i.e. their usage of memory is ultimately the performance limiter, not their usage of the compute resources on the GPU).
An example of a domain where developers are looking at packing multiple data items per word quantity is in the Deep Learning space, specifically with convolutional neural networks on GPUs. Developers have found, in some cases, that they can make their training codes run faster by:
store the data set as packed half (16-bit float) quantities in GPU memory
load the data set in packed form
unpack (and convert) each packed set of 2 half quantities to 2 float quantities
perform computations on the float quantities
convert the float results to half and pack them
store the packed quantities
repeat the process for the next training cycle
(See this answer for some additional background on the half datatype.)
The point is, that efficient use of memory is a high priority for making codes run fast on GPUs, and this may include using packed quantities in some cases.
Whether or not it's sensible for your case is probably something you would have to benchmark and compare to determine the effect on your code.
Some other comments reading your question:
Here, from what I learned, the approach would be exactly opposite from shared memory.
In some ways, the use of shared memory and constant memory are "opposite" from an access-pattern standpoint. Efficient use of constant memory involves uniform access as already stated. In the case of uniform access, shared memory should also perform pretty well, because it has a broadcast feature. You are right to be concerned about bank-conflicts in shared memory, but shared memory on newer GPUs have a broadcast feature that negates the bank-conflict issue when the accesses are taking place from the same actual location as opposed to just the same bank. These nuances are covered in many other questions about shared memory on SO so I'll not go into it further here.
Since you are packing 5 integer quantities (0..50) into a single int, you might also consider (maybe you are doing this already) having each thread perform the computations for 5 quantities. This would eliminate the need to share that data across multiple threads, or have multiple threads read the same location. If you're not doing this already, there may be some efficiencies in having each thread perform the computation for multiple data points.
If you wanted to consider switching to storing 4 packed quantities (per int) instead of 5, there might be some opportunities to use the SIMD instrinsics, depending on what exactly you are doing with those packed quantities.

cuda 'memory bound' vs 'latency bound' vs 'bandwidth bound' vs 'compute bound'

In the many resources online it is possible to find different usages of 'memory','bandwidth' 'latency' bound kernels. It seems to me that the authors sometimes use their own definition of these terms and I think if would be very beneficial for someone to make a clear distinction.
To my understanding:
Bandwidth bound kernels approach the physical limits of the device in terms of access to global memory. E.g. an application uses 170GB/s out of 177GB/s on an M2090 device.
A latency bound kernel is one whose predominant stall reason is due to memory fetches. So we are not saturating the global memory bus, but still have to wait to get the data into the kernel.
A compute bound kernel is one in which computation dominates the kernel time, under the assumption that there is no problem feeding the kernel with memory, and there is good overlap of arithmetic and latency.
If I got these correct, what would a 'memory bound' kernel be? Is there ambiguity, and if yes, should we limit the conversation to the three above terms?
Thanks!
what would a 'memory bound' kernel be?
Memory bound refers to a general case where a code is limited by memory access, ie. it includes codes that are latency bound and codes that are bandwidth bound. You've defined pretty much all the other terms correctly.
Is there ambiguity, and if yes, should we limit the conversation to the three above terms?
I don't think there's much ambiguity (you've clearly demarcated 3 of the 4 terms, anyway), and you're not going to impose order on the world in a SO question/answer.

Why is MPI considered harder than shared memory and Erlang considered easier, when they are both message-passing?

There's a lot of interest these days in Erlang as a language for writing parallel programs on multicore. I've heard people argue that Erlang's message-passing model is easier to program than the dominant shared-memory models such as threads.
Conversely, in the high-performance computing community the dominant parallel programming model has been MPI, which also implements a message-passing model. But in the HPC world, this message-passing model is generally considered very difficult to program in, and people argue that shared memory models such as OpenMP or UPC are easier to program in.
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
I agree with all previous answers, but I think a key point that is not made totally clear is that one reason that MPI might be considered hard and Erlang easy is the match of model to the domain.
Erlang is based on a concept of local memory, asynchronous message passing, and shared state solved by using some form of global database that all threads can get to. It is designed for applications that do not move a whole lot of data around, and that is not supposed to explode out to a 100k separate nodes that need coordination.
MPI is based on local memory and message passing, and is intended for problems where moving data around is a key part of the domain. High-performance computing is very much about taking the dataset for a problem, and splitting it up among a host of compute resources. And that is pretty hard work in a message-passing system as data has to be explicitly distributed with balancing in mind. Essentially, MPI can be viewed as a grudging admittance that shared memory does not scale. And it is targeting high-performance computation spread across 100k processors or more.
Erlang is not trying to achieve the highest possible performance, rather to decompose a naturally parallel problem into its natural threads. It was designed with a totally different type of programming tasks in mind compared to MPI.
So Erlang is best compared to pthreads and other rather local heterogeneous thread solutions, rather than MPI which is really aimed at a very different (and to some extent inherently harder) problem set.
Parallelism in Erlang is still pretty hard to implement. By that I mean that you still have to figure out how to split up your problem, but there's a few minor things that ease this difficulty when compared to some MPI library in C or C++.
First, since Erlang's message-passing is a first-class language feature, the syntactic sugar makes it feel easier.
Also, Erlang libraries are all built around Erlang's message passing. This support structure helps give you a boost into parallel-processling land. Take a look at the components of OTP like gen_server, gen_fsm, gen_event. These are very easy to use structures that can help your program become parallel.
I think it's more the robustness of the available standard library that differentiates erlang's message passing from other MPI implementations, not really any specific feature of the language itself.
Usually concurrency in HPC means working on large amounts of data. This kind of parallelism is called data parallelism and is indeed easier to implement using a shared memory approach like OpenMP, because the operating system takes care of things like scheduling and placement of tasks, which one would have to implement oneself if using a message passing paradigm.
In contrast, Erlang was designed to cope with task parallelism encountered in telephone systems, where different pieces of code have to be executed concurrently with only a limited amount of communication and strong requirements for fault tolerance and recovery.
This model is similar to what most people use PThreads for. It fits applications like web servers, where each request can be handled by a different thread, while HPC applications do pretty much the same thing on huge amounts of data which also have to be exchanged between workers.
I think it has something to do with the mind-set when you're programming with MPI and when you're programming with Erlang. For instance, MPI is not built-into the language whereas Erlang has built-in support for message passing. Another possible reason is the disconnect between merely sending/receiving messages and partitioning solutions into concurrent units of execution.
With Erlang you are forced to think in a functional programming frame where data actually zips by from function call to function call -- and receiving is an active act which looks like a normal construct in the language. This gives you a closer connection between the computation you're actually performing and the act of sending/receiving messages.
With MPI on the other hand you are forced to think merely about the actual message passing but not really the decomposition of work. This frame of thinking requires somewhat of a context switch between writing the solution and the messaging infrastructure in your code.
The discussion can go on but the common view is that if the construct for message passing is actually built into the programming language and paradigm that you're using, usually that's a better means of expressing the solution compared to something else that is "tacked on" or exists as an add-on to a language (in the form of a library or extension).
Does anybody know why there is such a difference in the perception of message-passing vs. shared memory in the IT and HPC worlds? Is it due to some fundamental difference in how Erlang and MPI implement message passing that makes Erlang-style message-passing much easier than MPI? Or is there some other reason?
The reason is simply parallelism vs concurrency. Erlang is bred for concurrent programming. HPC is all about parallel programming. These are related but different objectives.
Concurrent programming is greatly complicated by heavily non-deterministic control flow and latency is often an important objective. Erlang's use of immutable data structures greatly simplifies concurrent programming.
Parallel programming has much simpler control flow and the objective is all about maximal total throughput and not latency. Efficient cache usage is much more important here, which renders both Erlang and immutable data structures largely unsuitable. Mutating shared memory is both tractable and substantially better in this context. In effect, cache coherence is providing hardware-accelerated message passing for you.
Finally, in addition to these technical differences there is also a political issue. The Erlang guys are trying to ride the multicore hype by pretending that Erlang is relevant to multicore when it isn't. In particular, they are touting great scalability so it is essential to consider absolute performance as well. Erlang scales effortlessly from poor absolute performance on one core to poor absolute performance on any number of cores. As you can imagine, that does not impress the HPC community (but it is adequate for a lot of heavily concurrent code).
Regarding MPI vs OpenMP/UPC: MPI forces you to slice the problem in small pieces and take responsibility for moving data around. With OpenMP/UPC, "all the data is there", you just have to dereference a pointer. The MPI advantage is that 32-512 CPU clusters are much cheaper than 32-512 CPU single machines. Also, with MPI the expense is upfront, when you design the algorithm. OpenMP/UPC can hide the latencies that you'll get at runtime, if your system uses NUMA (and all big systems do) - your program won't scale and it will take a while to figure out why.
This article actually explaines it well, Erlang is best when we are sending small pieces of data arround and MPI does much better on more complex things. Also The Erlang model is easy to understand :-)
Erlang Versus MPI - Final Results and Source Code

Resources