Is there any specific way to get a list of all threads from pthread_create? - pthreads

On linux systems, using pthread_create, we can create a thread. But how do we know the count of threads created?
Is there any way to get that from pthread_create? or any other way which will give us a list of threads created for a process? I am looking for a code sample in C or C++.

On linux systems, using pthread_create, we can create a thread. But how do we know the count of threads created?
If you need that information, you keep track manually.
Is there any way to get that from pthread_create?
No.
or any other way which will give us a list of threads created for a process?
Pthreads provides no API for extracting a list of thread IDs for live threads belonging to the current process. A pthreads program is responsible for tracking its own threads.
There are various ways to get information about a process's threads, such as ps H and the system calls on which it relies, but there's not much actionable information to be had there. You can get a count, at least:
char command[50];
int thread_count;
sprintf(command, "ps H --no-headers %d | wc -l", (int) getpid());
FILE *thread_counter = popen(command, "r");
fscanf(thread_counter, "%d", &thread_count);
pclose(thread_counter);
printf("%d\n", thread_count);

Related

Where are NVMe commands located inside the PCIe BAR?

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:
Submission Queue y Tail Doorbell (SQyTDBL):
Start: 1000h + (2y * (4 << CAP.DSTRD))
End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))
Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.
Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).
HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)
If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.
Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.
These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.
28h 2Fh ASQ Admin Submission Queue Base Address
30h 37h ACQ Admin Completion Queue Base Address
Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.
STEPS TO WRITING DATA TO AN NVMe DRIVE.
In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.
I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.

Why can't a load bypass a value written by another thread on the same core from a write buffer?

If a CPU core uses a write buffer, then the load can bypass the most recent store to the referenced location from the write buffer, without waiting until it will appear in the cache. But, as it's written in A Primer on Memory Consistency and Coherence, if the CPU honors TSO memory model, then
... multithreading introduces a subtle write buffer issue for TSO. TSO
write buffers are logically private to each thread context (virtual
core). Thus, on a multithreaded core, one thread context should never
bypass from the write buffer of another thread context. This logical
separation can be implemented with per-thread-context write buffers
or, more commonly, by using a shared write buffer with entries tagged
by thread-context identifiers that permit bypassing only when tags
match.
I can't grasp the necessity of this limitation. Could you please give me an example when allowing some thread to bypass a write buffer entry written by another thread on the same core leads to the violation of the TSO memory model?
The classic example of how TSO differs from sequential consistency (SC) is:
(This is example 2.4 here - http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf)
thread 0 | thread 1
---------------------------------
write 1-->[x] | write 1-->[y]
a = read [x] | b = read [y]
c = read [y] | d = read [x]
Both addresses store 0 initially. The question is: would c=d=0 be a valid outcome? We know a and b must forward the stores before them since they match the addresses of the local stores, and will probably be forwarded from the local threads store buffer. However, c and d may not be forwarded across context, so they may still show the old value.
The interesting gotcha here is that since each thread observes both stores, and forwards the local one, and outcome of a=1,c=0 would mean that t0 saw the store to [x] occurring first. An outcome of b=1,d=0 would mean that t1 saw the store to [y] occurring first. The fact that this is a possible outcome due to store buffer forwarding would break sequential consistency as it requires that all contexts agree on the same global order of stores. Instead, x86 settled for a weaker TSO model that allows this case.
Forwarding stores globally is practically impossible since buffered stores are not necessarily committed, which means they may even be in the wrong path of a branch misprediction. Forwarding locally is fine since a flush would also eliminate all the loads that forwarded from them, but on multiple contexts you don't have that.
I've also seen work that tries to buffer stores globally outside of the core, but this is not very practical due to latency and bandwidth. For further reading, here's a recent paper that may be relevant - http://ieeexplore.ieee.org/abstract/document/7783736/

pthread use condition variable to start a few threads "at once"

i've just started playing around with posix pthreads (on c++).
I'm trying to use a conditional variable to start many threads at once.
Does someone know a better way to do this or can give an example of how one would?
If you have ruled out pthread_cond_broadcast, and are trying to do this you probably have already created the threads and might be looking for a way to gather release them all at once. If that is the case you may want to use a barrier.
You can initialize a barrier with pthread_barrier_init which takes a parameter for the number of threads you want to wait on. When the specified number of threads have hit a pthread_barrier_wait statement all the waiting threads are released at once (i.e. marked ready to run), though of course they remain subject to the whims of scheduler as to which may or may not immediately get processor time.
A very simple sketch
void* tfunc(void *)
{
pthread_barrier_wait(&bar);
//do stuff
}
pthread_barrier_init(&bar, NULL, 4);
for (int i = 0; i < 4; ++i)
pthread_create(&tid[i], NULL, tfunc, NULL);
When the 4th thread hits the wait all the waiting threads will continue.

What happens to memory after I call MPI_Init()?

Suppose I have some code that looks like this:
#include "mpi.h"
int main( int argc, char** argv )
{
int my_array[10];
//fill the array with some data
MPI_Init(&argc, &argv);
// Some code here
MPI_Finalize();
return 0;
}
Will each MPI instance get its own copy of my_array? Only rank 0? None of them? Is it bad practice to have any code before MPI_Init at all?
The short answer to "what happens to memory when I call MPI_Init" is: nothing.
MPI_Init initializes the MPI library in the calling process. Nothing more, nothing less. At the time of the MPI_Init call, all the MPI processes already exist, they just don't know about each other yet and can't communicate.
Each MPI process is a separately executing program. The processes do not share memory, and communicate by passing messages.
Indeed, the processes calling MPI_Init can even be different programs entirely, as long as the messages they pass around match. This is the MPMD model.
When you run mpi code, you are running the same code in different process (they can not share memory), so each process will have his own array.
The arrays should be equal, unless your data depend of time (the process are not necessarily synchronized), process rank (I think the rank is only available after the init call) or any random number generators (some may generate random seeds as well).

Is multiple-producer, single-consumer possible in a lockfree setting?

I have a bunch of threads that are doing lots of communication with each other.
I would prefer this be lock free.
For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)
Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.
The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.
Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).
The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.
To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.
Two notes if you roll your own functions:
1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.
2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.
Sure, if you have an atomic CompareAndSwap instruction:
for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
if ((mailbox[i].owned == false) &&
(CompareAndSwap(&mailbox[i].owned, true, false) == false))
break;
}
mailbox[i].message = message;
mailbox[i].ready = true;
After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).
Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.
may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.

Resources