How can threads ID be equal? - pthreads

I'm currently learning about operating systems and concurrency and I'm tasked with using pthreads in c.
After trying to understand and use pthreads, there's a couple of things that I don't quite understand.
If I create two seperate pthreads (two processes, to my understanding), how can the two threads be equal (pthread_equal)? Or what does it mean by this?
Thanks!

If I create two seperate pthreads (two processes, to my understanding), ..
Nope. Threads are not processes. (Threads may be implemented using processes under the hood - but it's still a thread as far as user programs are concerned and should treat them as such).
how can the two threads be equal (pthread_equal)? Or what does it mean by this?
It means it's same thread and is reported by pthread_equal() which compares whether given thread IDs (pthread_t) are equal.
A direct comparsion using == isn't possible because pthread_t is an opaque type and the only way to compare thread IDs is to use pthread_equal() API.
By the way, two threads in different processes may have the same ID (pthread_t).

Related

What does it mean that "two store are seen in a consistent order by other processors"?

In intel's manual:
section of : "8.2.2 Memory Ordering in P6 and More Recent Processor Families"
Any two stores are seen in a consistent order by processors other than those performing the stores
what's meaning of this statement ?
It means no IRIW reordering (Independent Readers, Independent Writers; at least 4 separate cores, at least 2 each writers and readers). 2 readers will always agree on the order of any 2 stores performed other cores.
Weaker memory models don't guarantee this, for example ISO C++11 only guarantees it for seq_cst operations, not for acq_rel or any weaker orders.
A few hardware memory models allow it on paper, including ARM before ARMv8. But in practice it's very rare POWER hardware can actually violate this in practice: See my answer Will two atomic writes to different locations in different threads always be seen in the same order by other threads? for an explanation of a hardware mechanism that can make it happen (store-forwarding between SMT "hyperthreads" on the same physical core making a store visible to some cores before it's globally visible).
x86 forbids this so communication between hyperthreads has to wait for commit to L1d cache, i.e. waiting for the store to be globally visible (thanks to MESI) before any other core can see it. What will be used for data exchange between threads are executing on one Core with HT?

Is there a way to limit the number of threads spawned by GCD in my application?

I know that the max number of threads spawned cannot exceed 66 through the response to this question. But is there a way to limit the thread count to a value which an user has defined?
From my experience and work with GCD under various circumstances, I believe this is not possible.
Said that, it is very important to understand, that by using GCD, you spawn queues, not threads. Whenever a call to create a queue is made from your code, GCD subsystem in its turn checks OS condition and seeks for available resources. New threads are then created under the hood based on these conditions – in the order and with the resources allocated, not controlled by you. This is clearly explained in official documentation:
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Source: Dispatch Queues
There is no way you can control resources consumption with GCD, like by setting some kind of threshold. GCD is a high-level abstraction over low-level things, such as threads, and it manages it for you.
The only way you can possibly influence how many resources particular task within your application should take, is by setting its QoS (Quality of Service) class (formerly known simply as priority, extended to a more complex concept). To be brief, you can classify tasks within your application based on their importance, this way helping GCD and your application be more resource- and battery- efficient. Its employment is highly encouraged in complex applications with vast concurrency usage.
Even still, however, this kind of regulation from developer end has its limits and ultimately does not address the goal to control threads creation:
Apps and operations compete to use finite resources—CPU, memory,
network interfaces, and so on. In order to remain responsive and
efficient, the system needs to prioritize tasks and make intelligent
decisions about when to execute them.
Work that directly impacts the user, such as UI updates, is extremely
important and takes precedence over other work that may be occurring
in the background. This higher priority work often uses more energy,
as it may require substantial and immediate access to system
resources.
As a developer, you can help the system prioritize more effectively by
categorizing your app’s work, based on importance. Even if you’ve
implemented other efficiency measures, such as deferring work until an
optimal time, the system still needs to perform some level of
prioritization. Therefore, it is still important to categorize the
work your app performs.
Source: Prioritize Work with Quality of Service Classes
To conclude, if you are deliberate in your intent to control threads, don't use GCD. Use low-level programming techniques and manage them yourself. If you use GCD, then you agree to leave this kind of responsibility to GCD.

How to convert synchronous blocking shared memory model code to asynchronous coroutines running on thread pool?

While there are lots of solutions matching my question partially, I'd like to know if a complete match exists. It's hard to find a complete solution because of these partial ones occupying search results. This should be a runtime framework and (optionally) a transformation required to source language code when the language doesn't support coroutines.
There are libraries like lthread having lthread_cond_wait() API, but every lthread is bounded by a single pthread. I'd like lightweight threads to be able to run in several pthreads. They should be arbitrary picked by thread pool. Either single-threaded schedulers or global lock schedulers don't match. I think we can do better.
lthreads is also not an option because it neither involves source code transformation nor avoids it like protothreads.
Several green-threading runtimes (Erlang, Limbo) don't match because they are limited to CSP (communicating sequential processes) model only, but I'd like to have shared memory model synchronization primitives as well: mutexes, condition variables, rwlocks.
Transformation involves:
Transforming stack contexts into objects in heap
Transforming mutex calls into manipulating disabling and activating jobs on thread pool and publish-subscribe
Condition variables should also be transformed into publish-subscribe realtionships
It would be nice to have Ada-style rendezvous
I failed to do straightforward runtime implementation due to potential deadlocks in publish-subscribe mechanism without using global lock or single scheduler thread, but I still think this is possible.
Disclaimer: lthread author.
You can launch several pthreads and run an lthread scheduler in each one (this is done automagically by calling lthread_run() in the pthread function). This way each pthread will run a bunch of lthreads.

Reasons of sub-linear speedup in parallel programs

What are the reasons a parallelized program doesn't achieve the ideal speedup?
For example, I have thought about data dependencies, the cost of data transfer between threads (or actors), synchronisation for access to the same data structures, any other ideas (or subcategories of the reasons i mentioned)?
I'm particularly interested for problems occurring in the erlang actor model but any other issues are welcomed.
A few in no particular order:
Cache line sharing - multiple variables on the same cache-line can incur overhead between processors, even if the theoretical model says they should be independent.
Context switch overhead - if you have more threads than cores, there will be overhead in context switching.
Kernel scalability issues: kernels may be fine at say 4 cores, but less efficient at 8.
Lock conveying
Amdahl's law - The limit of the parallel speed up of a program is the proportion of the program that can parallelized.
One reason is that parallelizing a program is often more difficult than one imagines and there are many subtle problems which can occur. For a very good discussion on this see Amdahl's Law.
The main problem in the Erlang Actor model is that each process has its own heap of memory and messages passed are copied around. Contrast with the usual way of using shared memory where you can pass a pointer to a structure between processes.
In a shared memory environment, it is up to the programmer to ensure that only a single process/thread operates on a piece of memory at a time. That is, some process is designated as it and has responsibility for doing the right thing on that memory area. Not so much in Erlang: One process can't by design rummage in other processes memory areas and you must copy values to other processes. This is tremendously powerful when we consider robustness of programs, but not so much if we consider the speed by which the program executes. On the other hand, if we want a distributed environment of multiple computers, copying reigns king and is the only way to transfer data between machines.
Amdahl's law comes into play because parts of your program may be impossible to spread out over multiple cores. There are some problems which are inherently serial in nature: You have no hope of ever speeding them up. Usually they are iterative where each new iteration is dependent on the former and you can't make a guess at the new one.

concurrent write to same memory address

If two threads try to write to the same address at the same time, is the value after the concurrent write guaranteed to be one of the values that the threads tried to write? or is it possible to get a combination of the bits?
Also, is it possible for another thread to read the memory address while the bits are in an unstable state?
I guess what the question boils down to is if a read or write to a single memory address is atomic at the hardware level.
I think this all depends on the "memory model" for your particular programming language or system.
These questions are fundamentals of a system or/and programming language memory model. Therefore, pick your own OS and a programming language, read the specs and you'll see.
In some cases the results may be just as unpredictable when the two threads are writing to different memory addresses - in particular think about C bitfield structures, as well as compiler optimisations when writing to adjacent addresses.
If you fancy a read, Boehm's paper "Threads cannot be implemented as a library" covers this and other quirks of concurrency.
One thing for sure, for a datatype of size equal to CPU Registers can never have bits in unstable state, it will be either of the two values
On a multi-processor computer, there may not be a single "value" that is read. The two threads, and the third one, may see inconsistent values. You'd need a memory barrier to ensure every thread sees the same value at this address.
Apart from that, writes are generally atomic, so it would be either one or the other of the values that have been written (or were there in the first place) that are read. You're not talking about an Alpha processor, are you?

Resources