Pthread policy in high contention applications - pthreads

Anybody knows about how pthread behaves in high contention and low contention situations? As far as I know, there is a unix-base policy (SCHED_OTHER) for mutex handling which optimizes the execution time in terms of contention. I want to know when pthread changes its policy from SCHED_FIFO to SCHED_RR and conversely. How does pthread detect the contention (threshold of changing the policy)?

Related

Does pthread mutex guarantee starvation freedom?

Background
I often stumble open cases where the order of lock acquisitions must be the same as real-time order of lock attempts. Those cases are usually about semaphore like locks.
Theory
From what I read in "The Art of Multiprocessor Programming", deadlock freedom and first-come-first-served guarantees are sufficient to make the lock starvation free. Deadlock freedom seems to be on the users since they have to remember to unlock properly. I have looked at possible types of mutexes provided on the pthreads manual page, but it doesn't seem to mention any ordering on lock acquisitions.
Question
Does pthread mutex guarantee starvation freedom? Are there implementations that do (I'm mainly concerned about linux family and macOS)? Are semaphore guaranteed the same properties as mutex?

Which one is the performance cores in instruments?

When I'm running profile in instrument on iPhone X with A11 CPU. This CPU has two performance cores and four efficiency cores.
May I ask if there is a way to tell which one is the performance core? And as for the main thread, will GCD put main thread tasks more on the performance cores rather than the efficiency ones?
I'm very interested to understand how this actually works.
GCD doesn't know anything about different kind of cores and GCD also doesn't decide which code runs on which core.
GCD decides which queue gets a thread of which thread pool and which code is scheduled to run next on the thread of the queue.
Deciding when a thread will run and on which core it will run is done by the thread schedular of the kernel. And the kernel also decides how many threads are available in which GCD thread pool.
The main thread is just a thread like any other thread. How much CPU time a thread gets depends on its own priority level, the amount of other threads, their priority levels, and the amount of workload scheduled for each of them.
As the A11 allows all 6 cores to be active at the same time, the kernel will decide which thread gets a high performance core and which one just a low performance one. High priority threads and threads with high computation workload (those that want to run very often and usually use up their full runtime quantum when running) are preferred for high performance cores. Low priority threads and threads with little computation workload (those that want to run infrequently and very often yield/block although their runtime quantum hasn't been used up yet) are preferred for low performance cores. Though, in theory every thread can run on any core as it would be stupid to leave cores unused if threads are waiting to run, yet low power cores are generally preferred as that reduces power consumption and increases battery runtime.

Why POSIX standardize semaphore as system call but leave mutex and condition variable to Pthread (user level)

I came up with this strange question which haunted me. Why POSIX standardize support for semaphore as syscall but leave condition variable and mutex to pthread library?
What's the division of responsibility here? Why semaphore is not standardized in Pthread package? Why the syscall for synchronization that POSIX standardize is semaphore but not mutex, condition variable?
Don't know. Guess performance is the concern for not implementing mutex as syscall. (Atomic hardware instructions are unprivileged so implementing them at user level is possible. Even though Linux provide futex, it is actually trying to optimize spin lock into two phase lock, towards sleep lock). And the reason for semaphore is that semaphore can be manipulated by different process, compared to the fact that mutex can only be unlocked by the process that hold it? Semaphore's V operation allows process waiting for it unblocked. So semaphore is kept by kernel, and semaphore's id is like the file descriptor, a capability given out by kernel, which makes it a syscall but not purely user level package.
But what about condition variable? Any reason to specify it in Pthread but not syscall level? Because it is stateless and originates from monitor, which is purely stateless programming construct, so it can be implemented using mutex?
Thanks!
Short answer: semaphores and pthreads have separate histories.
Yes: semaphores can be used between processes where pthreads stuff is (generally) all within the current process, or between processes which share memory.
From a performance perspective: a quick poke (on my x86_64) tells me that sem_wait() and sem_post() use straightforward lock cmpxchg instructions, doing syscall only to suspend/wake-up a thread. That is essentially the same as a pthread_mutex_t -- when the semaphore is used as a mutex.
Obviously a semaphore can do things that a mutex and a condition variable do not do, and you can use unnamed semaphores within a process -- sem_init() with pshared=0.
I guess the pthread developers decided that specifying a pthread_sema_t would be unnecessary duplication. Sadly, it does leave room for doubt that the (more general) semaphore might have performance issues even when used only within a process :-( Or, indeed, some doubt that semaphore and pthread stuff always play nicely together :-(

Measuring memory performance in Erlang

Is there a way to measure the complete memory usage when running a program in Erlang? My benchmarks are such that I spawn a process which in turn will spawn more processes, etc. Towards the end, they are all collapsed until only the initial process receives some result.
I am interested in the highest momentary memory usage altogether. Assuming before I spawn my process that memory usage is 0, what is the peak momentary memory usage?
I looked at this thread: GC performance in Erlang, which describes process_info/2. It seems, however, that if I spawn a process the memory reported by process_info(self(), memory) does not increase.
Percept seems to mainly gather statistics of processes and their lifetimes, rather than their resource consumption.
Any help is appreciated.

Memory Profiler tool to get an estimation of improvement enabling NUMA

I work on a low latency application that in my opinion would greatly benefit from the enabling of NUMA (or improving the memory locality anyway).
Is there a profiling tool that would give me an estimation on what could be the improvement, maybe in terms of percent/factor of reduction of execution time?
I was considering using cachegrind. I would expect a lot of LL cache miss, but still I wouldn't have an idea of the expected improvement.
Thanks a lot.
Edit:
The goal here is trying to reduce the latency. Currently, there is a thread that works on startup and do all the allocations. A better implementation, I believe, would be to pin the threads to cpu cores, and make every threads to make the allocation it needs. Before to do that i'd like to have, somehow, an estimation of the benefit in terms of latency.

Resources