Posix / Thread with join - pthreads

I read a book, which give the next code:
void *printme(void *id) {
int *i;
i = (int *)id;
printf("Hi. I'm thread %d\n", *i);
return NULL;
}
void main() {
int i, vals[4];
pthread_t tids[4];
void *retval;
for (i = 0; i < 4; i++) {
vals[i] = i;
pthread_create(tids+i, NULL, printme, vals+i);
}
for (i = 0; i < 4; i++) {
printf("Trying to join with tid%d\n", i);
pthread_join(tids[i], &retval);
printf("Joined with tid%d\n", i);
}
}
and the next possible output:
Trying to join with tid0
Hi. I'm thread 0
Hi. I'm thread 1
Hi. I'm thread 2
Hi. I'm thread 3
Joined with tid0
Trying to join with tid1
Joined with tid1
Trying to join with tid2
Joined with tid2
Trying to join with tid3
Joined with tid3
And I don't understand how is it possible. We start with the main thread, and create 4 threads: tids[0]... tids[3]. Then, we suspend the execution (by the join instruction): the main thread would wait that tids[0] would stop the execution, tids[0] would wait to tids[1] and so on.
So the output should be:
Hi. I'm thread 0
Hi. I'm thread 1
Hi. I'm thread 2
Hi. I'm thread 3
Trying to join with tid0
Trying to join with tid1
Joined with tid0
Trying to join with tid2
Joined with tid1
Trying to join with tid3
Joined with tid2
Joined with tid3
I feel that I don't understand something really basic. Thanks.

I think what you're missing is that pthread_create is very different from fork. The created thread starts at the supplied function (printme, in this case) and exits as soon as that function returns. Hence, none of the newly created threads ever reaches the second for loop.

When you create new thread pthread_create then both thread #1 and main works in parallel. Main goes to next instruction which is phtread_join and hang until thread #1 finishes. This is why you have Trying to join with tid0 , then hello I'm thread #1.
Please also notice that main thread will join child threads in specified order. It means that when you have thread #1, thread #2 and thread #3 and thread 1 takes 10 second to execute, thread 2 takes 6 seconds to execute and thread 3 takes 7 seconds to execute, then first join will take place after 10 seconds and then in few milisecond you should have next joins, since all other thread should finish their jobs.

Related

Rx SerialDispatchQueueScheduler doesn't seem to make the code run in serial sequence

I have a problem with an Observable<Data?> function that is called so many times and so fast that the function doesn't complete until the next one is run. This makes sense and is good in most cases. But in this case it becomes very problematic because the function in question uses a counter.
func sendMessage(input: MessageToSend) -> Observable<Data?> {
input.counter = self.counter
print("-- 1", input.counter)
let transformMessage = transform(message: input)
self.counter += 1
print("-- 2", input.counter)
return transformMessage
}
Obviously I need input.counter to increase by 1 every time the function is called. But unfortunately this isn't what's happening. Because of the async nature of rx sendMessage() is run and adds 0 to input.counter, but before it has had the chance to increment self.counter by 1, sendMessage() is run again and again adds 0 to input.counter. Then the first call reaches self.counter += 1, and immediately after, the second call reaches self.counter += 1 as well. So now the counter has reached 2. So when the third call is made, input.counter gets 2 as value.
The prints will look like this:
-- 1 0
-- 1 0
-- 2 0
-- 2 0
-- 1 2
-- 2 2
My initial idea on how to fix this was to force the call to a serial task. So instead of calling it like this:
disposeble = tsiHandler?.sendMessage(message: message).subscribe()
I called it like this:
disposeble = tsiHandler?.sendMessage(message: message)
.subscribeOn(SerialDispatchQueueScheduler(internalSerialQueueName: "serial"))
.subscribe()
In my world this would force it to become serial which would make all the calls to sendMessage() wait for other calls to complete. But for some reason it doesn't work. I get the exact same result in the prints. Am I missunderstanding how SerialDispatchQueueScheduler work?
As it is written, each call to sendMessage is being placed on a different scheduler, so even if the schedulers are serial, that wouldn't accomplish what you are trying to do. Instead of creating a new scheduler for each call, have them all use the same scheduler.
Even so, this code looks fishy and your comments about it implies a fundamental misunderstanding of Rx. For example it is not inherently async. It merely sets up callback chains...

Thread index as an memory location index in CUDA

By definition, a thread is a path of execution within a process.
But during the implementation of a kernel, a thread_id or global_index is generated to access a memory location allocated. For instance, in the Matrix Multiplication code below, ROW and COL are generated to access matrix A and B sequential.
My doubt here is, index generated isn't pointing to a thread(by definition), instead, it is used to access the location of the data in the memory, then why do we refer to it as thread index or global thread index and why not memory index or something else?
__global__ void matrixMultiplicationKernel(float* A, float* B, float* C, int N) {
int ROW = blockIdx.y*blockDim.y+threadIdx.y;
int COL = blockIdx.x*blockDim.x+threadIdx.x;
float tmpSum = 0;
if (ROW < N && COL < N) {
// each thread computes one element of the block sub-matrix
for (int i = 0; i < N; i++) {
tmpSum += A[ROW * N + i] * B[i * N + COL];
}
}
C[ROW * N + COL] = tmpSum;
}
 This question seems to be mostly about semantics, so let's start at Wikipedia
.... a thread of execution is the smallest sequence of programmed
instructions that can be managed independently by a scheduler ....
That is pretty much describes exactly what s thread in CUDA is -- the kernel is the sequence of instructions, and the scheduler is the warp/thread scheduler in each streaming multiprocessor on the GPU.
The code in your question is calculating the unique ID of the thread in the kernel launch, as it is abstracted in the CUDA programming/execution model. It has no intrinsic relationship to memory layouts, only to the unique ID in the kernel launch. The fact it is being used to ensure that each parallel operation is being performed on a different memory location is programming technique and nothing more.
Thread ID seems like a logical moniker to me, but to paraphrase Miles Davis when he was asked what the name of the jam his band just played at the Isle of Wight festival in 1970: "call it whatever you want".

GCD concurrent queue not starting tasks in FIFO order [duplicate]

This question already has answers here:
iOS GCD custom concurrent queue execution sequence
(2 answers)
Closed 5 years ago.
I have a class which contains two methods as per the example in Mastering Swift by Jon Hoffman. The class is as below:
class DoCalculation {
func doCalc() {
var x = 100
var y = x * x
_ = y/x
}
func performCalculation(_ iterations: Int, tag: String) {
let start = CFAbsoluteTimeGetCurrent()
for _ in 0..<iterations {
self.doCalc()
}
let end = CFAbsoluteTimeGetCurrent()
print("time for \(tag): \(end - start)")
}
}
Now in the viewDidLoad() of the ViewController from the single view template, I create an instance of the above class and then create a concurrent queue. I then add the blocks executing the performCalculation(: tag:) method to the queue.
cqueue.async {
print("Starting async1")
calculation.performCalculation(10000000, tag: "async1")
}
cqueue.async {
print("Starting async2")
calculation.performCalculation(1000, tag: "async2")
}
cqueue.async {
print("Starting async3")
calculation.performCalculation(100000, tag: "async3")
}
Every time I run the application on simulator, I get random out put for the start statements. Example outputs that I get are below:
Example 1:
Starting async1
Starting async3
Starting async2
time for async2: 4.1961669921875e-05
time for async3: 0.00238299369812012
time for async1: 0.117094993591309
Example 2:
Starting async3
Starting async2
Starting async1
time for async2: 2.80141830444336e-05
time for async3: 0.00216799974441528
time for async1: 0.114436984062195
Example 3:
Starting async1
Starting async3
Starting async2
time for async2: 1.60336494445801e-05
time for async3: 0.00220298767089844
time for async1: 0.129496037960052
I don't understand why the blocks don't start in FIFO order. Can somebody please explain what am I missing here?
I know they will be executed concurrently, but its stated that concurrent queue will respect FIFO for starting the execution of tasks, but won't guarantee which one completes first. So at least the starting task statements should have started with
Starting async1
Starting async3
Starting async2
and this completion statements random:
time for async2: 4.1961669921875e-05
time for async3: 0.00238299369812012
time for async1: 0.117094993591309
and the completion statements random.
A concurrent queue runs the jobs you submit to it concurrentlyThat's what it's for.
If you want a queue the runs jobs in FIFO order, you want a serial queue.
I see what you're saying about the docs claiming that the jobs will be submitted in FIFO order, but your test doesn't really establish the order in which they're run. If the concurrent queue has 2 threads available but only one processor to run those threads on, it might swap out one of the threads before it gets a chance to print, run the other job for a while, and then go back to running the first job. There's no guarantee that a job runs to the end before getting swapped out.
I don't think a print statement gives you reliable information about the order in which the jobs are started.
cqueue is a concurrent queue which is dispatching your block of work to three different threads(it actually depends on the threads availability) at almost the same time but you can not control the time at which each thread completes the work.
If you want to perform a task serially in a background queue, you are much better using serial queue.
let serialQueue = DispatchQueue(label: "serialQueue")
Serial Queue will start the next task in queue only when your previous task is completed.
"I don't understand why the blocks don't start in FIFO order" How do you know they don't? They do start in FIFO order!
The problem is that you have no way to test that. The notion of testing it is, in fact, incoherent. The soonest you can test anything is the first line of each block — and by that time, it is perfectly legal for another line of code from another block to execute, because these blocks are asynchronous. That is what asynchronous means.
So, they start in FIFO order, but there is no guarantee about the order in which, given multiple asynchronous blocks, their first lines will be executed.
With a concurrent queue, you are effectively specifing that they can run at the same time. So while they’re added in FIFO manner, you have a race condition between these various worker threads, and thus you have no assurance which will hit its respective print statement first.
So, this raises the question: Why do you care which order they hit their respective print statements? If order is really important, you shouldn't be using concurrent queue. Or, the other way of saying that, if you want to use a concurrent queue, write code that isn't dependent upon the order with which they run.
You asked:
Would you suggest some way to get the info when a Task is dequeued from the queue so that I can log it to get the FIFO order.
If you're asking how to enjoy FIFO starting of the tasks on concurrent queue in real-world app, the answer is "you don't", because of the aforementioned race condition. When using concurrent queues, never write code that is strictly dependent upon the FIFO behavior.
If you're asking how to verify this empirically for purely theoretical purposes, just do something that ties up the CPUs and frees them up one by one:
// utility function to spin for certain amount of time
func spin(for seconds: TimeInterval, message: String) {
let start = CACurrentMediaTime()
while CACurrentMediaTime() - start < seconds { }
os_log("%#", message)
}
// my concurrent queue
let queue = DispatchQueue(label: label, attributes: .concurrent)
// just something to occupy up the CPUs, with varying
// lengths of time; don’t worry about these re FIFO behavior
for i in 0 ..< 20 {
queue.async {
spin(for: 2 + Double(i) / 2, message: "\(i)")
}
}
// Now, add three tasks on concurrent queue, demonstrating FIFO
queue.async {
os_log(" 1 start")
spin(for: 2, message: " 1 stop")
}
queue.async {
os_log(" 2 start")
spin(for: 2, message: " 2 stop")
}
queue.async {
os_log(" 3 start")
spin(for: 2, message: " 3 stop")
}
You'll be able to see those last three tasks are run in FIFO order.
The other approach, if you want to confirm precisely what GCD is doing, is to refer to the libdispatch source code. It's admittedly pretty dense code, so it's not exactly obvious, but it's something you can dig into if you're feeling ambitious.

How can I use foreach and fork together to do something in parallel?

This question is not UVM specific but the example that I am working on is UVM related.
I have an array of agents in my UVM environment and I would like to launch a sequence on all of them in parallel.
If I do the below:
foreach (env.agt[i])
begin
seq.start(env.agt[i].sqr);
end
, the sequence seq first executes on env.agt[0].sqr. Once that gets over, it then executes on env.agt[1].sqr and so on.
I want to implement a foreach-fork statement so that seq is executed in parallel on all agt[i] sequencers.
No matter how I order the fork-join and foreach, I am not able to achieve that. Can you please help me get that parallel sequence launching behavior?
Thanks.
Update to clarify the problem I am trying to solve:
The outcome of the below code constructs is the exact same as above without the fork-join.
foreach (env.agt[i])
fork
seq.start(env.agt[i].sqr);
join
fork
foreach (env.agt[i])
seq.start(env.agt[i].sqr);
join
// As per example in § 9.3.2 of IEEE SystemVerilog 2012 standard
for (int i=0; i<`CONST; ++i)
begin
fork
automatic int var_i = i;
seq.start(env.agt[var_i].sqr);
join
end
The issue is each thread of the fork is pointing to the same static variable i. Each thread needs its own unique copy and this can be achieved with the automatic keyword.
foreach (env.agt[i])
begin
automatic int var_i = i;
fork
seq.start(env.agt[var_i].sqr);
join_none // non_blocking, allow next operation to start
end
wait fork;// wait for all forked threads in current scope to end
IEEE std 1800-2012 § 6.21 "Scope and lifetime" gives examples of the uses static and automatic. Also check out § 9.3.2 "Parallel blocks", the last example shows demonstrates parallel threads in a for-loop.
Use join_none to create new threads; § 9.3.2 "Parallel blocks", Table 9-1—"fork-join control options".
Use fork wait statement to wait for all threads in the current scope to complete; § 9.6.1 "Wait fork statement"
Example:
byte a[4];
initial begin
foreach(a[i]) begin
automatic int j =i;
fork
begin
a[j] = j;
#($urandom_range(3,1));
$display("%t :: a[i:%0d]:%h a[j:%0d]:%h",
$time, i,a[i], j,a[j]);
end
join_none // non-blocking thread
end
wait fork; // wait for all forked threads in current scope to end
$finish;
end
Outputs:
2 :: a[i:4]:00 a[j:3]:03
2 :: a[i:4]:00 a[j:0]:00
3 :: a[i:4]:00 a[j:2]:02
3 :: a[i:4]:00 a[j:1]:01
I think that the more "UVM" way to approach this is with a virtual sequence. Assuming you already have a virtual sequencer which instantiates an array of agent sequencers then the body your virtual sequence would look something like this:
fork
begin: isolation_thread
foreach(p_sequencer.agent_sqr[i])
automatic int j = i;
fork
begin
`uvm_do_on(seq, p_sequencer.agent_sqr[j]);
end
join_none
end
wait fork;
end: isolation_thread
join
This has worked for me in the past.
Greg's solution below helped me derive the solution to my UVM based problem. Here is my solution:
The below fork-join block resides in the main_phase task in a test case class. The wait fork; statement waits for all the fork statements in its scope (= the foreach_fork begin-end block) to finish before proceeding further. An important thing to note is that the wrapping fork-join around the begin-end was required for setting the scope of wait fork; to the foreach_fork block.
fork
begin : foreach_fork
seq_class seq **[`CONST]**;
foreach(env.agt[i])
begin
int j = i;
**seq[j] = seq_class::type_id::create
(.name($sformatf("seq_%0d", j)), .contxt(get_full_name()));**
fork
begin
seq[j].start(env.agt[j].sqr);
end
join_none // non-blocking thread
end
**wait fork;**
end : foreach_fork
join
Alternative solution that makes use of the in-sequence objection to delay the sim end.
begin
seq_class seq **[`CONST]**;
foreach(env.agt[i])
begin
int j = i;
**seq[j] = seq_class::type_id::create
(.name($sformatf("seq_%0d", j)), .contxt(get_full_name()));**
fork
begin
**seq[j].starting_phase = phase;**
seq[j].start(env.agt[j].sqr);
end
join_none // non-blocking thread
end
end
I realized that I also needed to create a new sequence object for each seq I wanted to run in parallel.
Thanks to Dave for making the point that the properties in System Verilog classes are automatic by default.
Note for the alternative solution:
As I didn't use wait fork; I use the UVM objections raised in the sequence itself to do the job of holding off the simulation $finish call. To enable raising of the objections in the sequences, I use the seq[j].starting_phase = phase; construct.
try with
int i = 0
foreach (env.agt)
begin
seq.start(env.agt[i].sqr);
i++;
end

pthread: one printf statement get printed twice in child thread

this is my first pthread program, and I have no idea why the printf statement get printed twice in child thread:
int x = 1;
void *func(void *p)
{
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
return NULL;
}
int main(void)
{
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
printf("main thread: %ld\n", pthread_self());
func(NULL);
}
Observed output on my platform (Linux 3.2.0-32-generic #51-Ubuntu SMP x86_64 GNU/Linux):
1.
main thread: 140144423188224
tid 140144423188224: x is 2
2.
main thread: 140144423188224
tid 140144423188224: x is 3
3.
main thread: 139716926285568
tid 139716926285568: x is 2
tid 139716918028032: x is 3
tid 139716918028032: x is 3
4.
main thread: 139923881056000
tid 139923881056000: x is 3
tid 139923872798464tid 139923872798464: x is 2
for 3, two output lines from the child thread
for 4, the same as 3, and even the outputs are interleaved.
Threading generally occurs by time-division multiplexing. It is generally in-efficient for the processor to switch evenly between two threads, as this requires more effort and higher context switching. Typically what you'll find is a thread will execute several times before switching (as is the case with examples 3 and 4. The child thread executes more than once before it is finally terminated (because the main thread exited).
Example 2: I don't know why x is increased by the child thread while there is no output.
Consider this. Main thread executes. it calls the pthread and a new thread is created.The new child thread increments x. Before the child thread is able to complete the printf statement the main thread kicks in. All of a sudden it also increments x. The main thread is however also able to run the printf statement. Suddenly x is now equal to 3.
The main thread now terminates (also causing the child 3 to exit).
This is likely what happened in your case for example 2.
Examples 3 clearly shows that the variable x has been corrupted due to inefficient locking and stack data corruption!!
For more info on what a thread is.
Link 1 - Additional info about threading
Link 2 - Additional info about threading
Also what you'll find is that because you are using the global variable of x, access to this variable is shared amongst the threads. This is bad.. VERY VERY bad as threads accessing the same variable create race conditions and data corruption due to multiple read writes occurring on the same register for the variable x.
It is for this reason that mutexes are used which essentially create a lock whilst variables are being updated to prevent multiple threads attempting to modify the same variable at the same time.
Mutex locks will ensure that x is updated sequentially and not sporadically as in your case.
See this link for more about Pthreads in General and Mutex locking examples.
Pthreads and Mutex variables
Cheers,
Peter
Hmm. your example uses the same "resources" from different threads. One resource is the variable x, the other one is the stdout-file. So you should use mutexes as shown down here. Also a pthread_join at the end waits for the other thread to finish its job. (Usually a good idea would also be to check the return-codes of all these pthread... calls)
#include <pthread.h>
#include <stdio.h>
int x = 1;
pthread_mutex_t mutex;
void *func(void *p)
{
pthread_mutex_lock (&mutex);
x = x + 1;
printf("tid %ld: x is %d\n", pthread_self(), x);
pthread_mutex_unlock (&mutex);
return NULL;
}
int main(void)
{
pthread_mutex_init(&mutex, 0);
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_mutex_lock (&mutex);
printf("main thread: %ld\n", pthread_self());
pthread_mutex_unlock (&mutex);
func(NULL);
pthread_join (tid, 0);
}
It looks like the real answer is Michael Burr's comment which references this glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14697
In summary, glibc does not handle the stdio buffers correctly during program exit.

Resources