How to convert synchronous blocking shared memory model code to asynchronous coroutines running on thread pool? - pthreads

While there are lots of solutions matching my question partially, I'd like to know if a complete match exists. It's hard to find a complete solution because of these partial ones occupying search results. This should be a runtime framework and (optionally) a transformation required to source language code when the language doesn't support coroutines.
There are libraries like lthread having lthread_cond_wait() API, but every lthread is bounded by a single pthread. I'd like lightweight threads to be able to run in several pthreads. They should be arbitrary picked by thread pool. Either single-threaded schedulers or global lock schedulers don't match. I think we can do better.
lthreads is also not an option because it neither involves source code transformation nor avoids it like protothreads.
Several green-threading runtimes (Erlang, Limbo) don't match because they are limited to CSP (communicating sequential processes) model only, but I'd like to have shared memory model synchronization primitives as well: mutexes, condition variables, rwlocks.
Transformation involves:
Transforming stack contexts into objects in heap
Transforming mutex calls into manipulating disabling and activating jobs on thread pool and publish-subscribe
Condition variables should also be transformed into publish-subscribe realtionships
It would be nice to have Ada-style rendezvous
I failed to do straightforward runtime implementation due to potential deadlocks in publish-subscribe mechanism without using global lock or single scheduler thread, but I still think this is possible.

Disclaimer: lthread author.
You can launch several pthreads and run an lthread scheduler in each one (this is done automagically by calling lthread_run() in the pthread function). This way each pthread will run a bunch of lthreads.

Related

Why is there only limited usage of thread pools in TensorFlow-Federated?

TFF's threading libraries start a new thread from ThreadRun by default, and the only usage (as of TFF 0.42.0) of the optional ThreadPool parameter is in the implementation of a single executor. Why is this the case?
After conferring with some people who were close to the implementation, the understanding we came to was:
The issue with totally general usage of thread pools in TFF is generally that if used incorrectly, we may be courting deadlock. We need FIFO scheduling in the thread pool itself, and FIFO-compatible usage in the runtime (if you need the result of a computation, you need to know it will be started before you start).
When implementing the first usages of thread pools in the TF executor, we reasoned ourselves to believing the following statement is true: at the leaf executors (that is, so long as an executor doesnt have any children), this FIFO-compatible programming is guaranteed by the stateful executor interface. That is, if you need a value, you know it has already been created (otherwise the executor wouldn't be able to resolve it), so as long as the thread pool is FIFO, it will be ready before you execute. Either the creating function already pushed a function onto this FIFO queue, or just created the value directly, so you can push yourself onto the FIFO queue no sweat.
Due to difficulty, we haven't really tried to reason too hard about how / whether we might be able to make similar statements about executors which have children (and these children may be pushing work onto the queue; AFAIK we dont really currently make any guarantees about how we do this, but i could imagine reasoning about a similar invariant step-by-step 'up the stack'). Thus we have only considered it safe so far to inject thread pool usage at leaf executors. The fact that we don't have this in the XLAExecutor yet is simply due to lack of use.

Synchronizing forked processes with pthread_mutex in C

Is it possible to use mutex from pthread.h to synchronize processes created with fork() from unistd.h? Afaik, both in the end are using system call clone().
I am asking it in the scope of shared memory segment (from ipc.h, shm.h) with critical data, which should be protected against concurrent writes from different processes. In that memory then semaphores can be defined and later used in different processes. Why couldn't mutexes be used instead of semaphores?
Why am I asking?
First of all I was told that it won't work, without receiving any explanation for that. On the Internet I was not able to find any answer so I decided to ask here.
Second, forked process is safer than thread created with pthread_create - if forked process crashes, the rest of the program continues to work and if thread crashes then whole program exits.
Third, mutexes seem to be more human-friendly than semaphores in managing.

Is JavoNet a threadsafe library, and more imporantlty, does it allow usage of all threads?

Is javonet threadsafe? I couldn't find any documentation one way or the other. Even if it is threadsafe, is there some sort of "mutex" that's preventing full usages of all threads?
When I tried to run javonet in parallel, it did work, but the CPU usage did not significantly increase above the sequential load (ie on a 10CPU system, the CPU usage hovered around 20% for parallel load, whcih was only merely double the sequential CPU load of 10%); however, if I ran 10 version of the exact same sequential code (that used javonet), I achieved 100% CPU usage....so it "feels" like javonet must have some built-in mutexes that's preventing full parallel usage.
Javonet is thread safe. You just need to follow standard practices for writing multi-threaded applications and Javonet will take care of executing your code properly.
Javonet creates new corresponding .NET thread for calling Java threads. Also the other way for callbacks, events and delegates if called from other thread Javonet will create the corresponding thread on Java side. Once the calling thread completes, Javonet will close the thread on the other side.
If the corresponding thread already exists, Javonet will rejoin to valid thread.
Javonet does use internal mutexes / readwritelocks while accessing objects instances, some caching collections and types what depending on your Java code might affect the parallelization capabilities.

Is there a way to limit the number of threads spawned by GCD in my application?

I know that the max number of threads spawned cannot exceed 66 through the response to this question. But is there a way to limit the thread count to a value which an user has defined?
From my experience and work with GCD under various circumstances, I believe this is not possible.
Said that, it is very important to understand, that by using GCD, you spawn queues, not threads. Whenever a call to create a queue is made from your code, GCD subsystem in its turn checks OS condition and seeks for available resources. New threads are then created under the hood based on these conditions – in the order and with the resources allocated, not controlled by you. This is clearly explained in official documentation:
When it comes to adding concurrency to an application, dispatch queues
provide several advantages over threads. The most direct advantage is
the simplicity of the work-queue programming model. With threads, you
have to write code both for the work you want to perform and for the
creation and management of the threads themselves. Dispatch queues let
you focus on the work you actually want to perform without having to
worry about the thread creation and management. Instead, the system
handles all of the thread creation and management for you. The
advantage is that the system is able to manage threads much more
efficiently than any single application ever could. The system can
scale the number of threads dynamically based on the available
resources and current system conditions. In addition, the system is
usually able to start running your task more quickly than you could if
you created the thread yourself.
Source: Dispatch Queues
There is no way you can control resources consumption with GCD, like by setting some kind of threshold. GCD is a high-level abstraction over low-level things, such as threads, and it manages it for you.
The only way you can possibly influence how many resources particular task within your application should take, is by setting its QoS (Quality of Service) class (formerly known simply as priority, extended to a more complex concept). To be brief, you can classify tasks within your application based on their importance, this way helping GCD and your application be more resource- and battery- efficient. Its employment is highly encouraged in complex applications with vast concurrency usage.
Even still, however, this kind of regulation from developer end has its limits and ultimately does not address the goal to control threads creation:
Apps and operations compete to use finite resources—CPU, memory,
network interfaces, and so on. In order to remain responsive and
efficient, the system needs to prioritize tasks and make intelligent
decisions about when to execute them.
Work that directly impacts the user, such as UI updates, is extremely
important and takes precedence over other work that may be occurring
in the background. This higher priority work often uses more energy,
as it may require substantial and immediate access to system
resources.
As a developer, you can help the system prioritize more effectively by
categorizing your app’s work, based on importance. Even if you’ve
implemented other efficiency measures, such as deferring work until an
optimal time, the system still needs to perform some level of
prioritization. Therefore, it is still important to categorize the
work your app performs.
Source: Prioritize Work with Quality of Service Classes
To conclude, if you are deliberate in your intent to control threads, don't use GCD. Use low-level programming techniques and manage them yourself. If you use GCD, then you agree to leave this kind of responsibility to GCD.

Does Erlang always copy messages between processes on the same node?

A faithful implementation of the actor message-passing semantics means that message contents are deep-copied from a logical point-of-view, even for immutable types. Deep-copying of message contents remains a bottleneck for implementations the actor model, so for performance some implementations support zero-copy message passing (although it's still deep-copy from the programmer's point-of-view).
Is zero-copy message-passing implemented at all in Erlang? Between nodes it obviously can't be implemented as such, but what about between processes on the same node? This question is related.
I don't think your assertion is correct at all - deep copying of inter-process messages isn't a bottleneck in Erlang, and with the default VM build/settings, this is exactly what all Erlang systems are doing.
Erlang process heaps are completely separate from each other, and the message queue is located in the process heap, so messages must be copied. This is also true for transferring data into and out of ETS tables as their data is stored in a separate allocation area from process heaps.
There are a number of shared datastructures however. Large binaries (>64 bytes long) are generally allocated in a node-wide area and are reference counted. Erlang processes just store references to these binaries. This means that if you create a large binary and send it to another process, you're only sending the reference.
Sending data between processes is actually worse in terms of allocation size than you might imagine - sharing inside a term isn't preserved during the copy. This means that if you carefully construct a term with sharing to reduce memory consumption, it will expand to its unshared size in the other process. You can see a practical example in the OTP Efficiency Guide.
As Nikolaus Gradwohl pointed out, there was an experimental hybrid heap mode for the VM which did allow term sharing between processes and enabled zero-copy message passing. It hasn't been a particularly promising experiment as I understand it - it requires extra locking and complicates the existing ability of processes to independently garbage collect. So not only is copying inter-process messages not the usual bottleneck in Erlang systems, allowing it actually reduced performance.
AFAIK there was/is experimental support for zero-copy message-passing in erlang using the -shared or -hybrid modell. I read a blog post in 2009 claiming that it's broken on smp machines, but I have no idea about the current status
As has been mentioned here and in other questions current versions of Erlang basically copy everything except for larger binaries. In older pre-SMP times it was feasible to not copy but pass references. While this resulted in very fast message passing it created other problems in the implementation, primarily it made garbage collection more difficult and complicated implementation. I think that today passing references and having shared data could result in excessive locking and synchronisation which is, of course, not a Good Thing.
I wrote the accepted answer to that other question you're referencing, and in it I give you a direct pointer to this line of code:
message = copy_struct(message, msize, &hp, &bp->off_heap);
This is in a function called when the Erlang run-time system needs to send a message, and it's not inside any kind of "if" that could cause it to be skipped. So, as far as I can tell, the answer is "yes, it's always copied." (That's not strictly true -- there is an "if", but it seems to be dealing with exceptional cases, not the normal code-flow path.)
(I'm ignoring the hybrid heap option brought up by Nikolaus. It looks like he's right, but since this isn't the way Erlang is normally built and it has its own penalties, I don't see that it's worth considering as a way to answer your concern.)
I don't know why you're considering 10 GByte/sec a bottleneck, though. Nothing short of registers or CPU cache goes faster in the computer, and such memories are small, thus constituting a kind of bottleneck themselves. Besides which, the zero-copy idea you're proposing would require locking in the case of cross-CPU message passing in a multi-core system, which is also a bottleneck. We're already paying the locking penalty once in this function to copy the message into the other process's message queue; why pay it again later when that process gets around to reading the message?
Bottom line, I don't think your ideas of ways to make it go faster would actually help much.

Resources