Hi, I have written an MPI quicksort program which works like this:
In my cluster 'Master' will divide the integer data and send these to 'Slave nodes'. Upon receiving at the Slave nodes, each slave will perform individual sorting operations and send the sorted data back to Master.
Now my problem is I'm interested in introducing hyper-threading for the slaves.
I have data coming from master
sub (which denotes the array)
count (size of an array)
Now I have initialized Pthreads as where
num_threads=12.
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for (i = 0; i < num_pthreads; i++) {
if (pthread_create(&thread[i], &attr, new_thread, (void *) &sub[i]))
{
printf("error creating a new thread \n");
exit(1);
}
else
{
printf(" threading is successful %d at node %d \n \t ",i,rank);
}
and in a new thread function
void * new_thread(int *sub)
{
quick_sort(sub,0, count-1);
}
return(0);
}
I don't understand whether my way is correct or not. Can anyone help me with this problem?
Your basic idea is correct, except you also need to determine how you're going to get results back from the threads.
Normally you would want to pass all relevant information for the thread to complete its task through the *arg argument from pthread_create. In your new_thread() function, the count variable is not passed in to the function and so is global between all threads. A better approach is to pass a pointer to a struct through from pthread_create.
typedef struct {
int *sub; /* Pointer to first element in array to sort, different for each thread. */
int count; /* Number of elements in sub. */
int done; /* Flag to indicate that sort is complete. */
} thread_params_t
void * new_thread(thread_params_t *params)
{
quick_sort(params->sub, 0, params->count-1);
params->done = 1;
return 0;
}
You would fill in a new thread_params_t for each thread that was spawned.
The other item that has to be managed is the sort results. It would be normal for the main thread to do a pthread_join on each thread which ensure that it has completed before continuing. Depending on your requirements you could either have each thread send results back to the master directly, of have the main function collect the results from each thread and send results back external to the worker threads.
You can use OpenMP instead of pthreads (just for the record - combining MPI with threading is called hybrid programming). It is a lightweight set of compiler pragmas that turn sequential code into parallel one. Support for OpenMP is available in virtually all modern compilers. With OpenMP you introduce the so-called parallel regions. A team of threads is created at the beginning of the parallel region, then the code continues to execute concurrently until the end of the parallel region, where all threads are joined and only the main thread continues execution (thread creation and joining is logical, e.g. it doesn't have to be implemented like this in real life and most implementations actually use thread pools to speed up the creation of threads):
#pragma omp parallel
{
// This code gets executed in parallel
...
}
You can use omp_get_thread_num() inside the parallel block to get the ID of the thread and make it compute something different. Or you can use one of the worksharing constructs of OpenMP like for, sections, etc. to make it divide the work automatically.
The biggest advantage of OpenMP is that is doesn't introduce deep changes to the source code and it abstracts threads creation/joining away so you don't have to do it manually. Most of the time you can get around with just a few pragmas. Then you have to enable OpenMP during compilation (with -fopenmp for GCC, -openmp for Intel compilers, -xopenmp for Sun/Oracle compilers, etc.). If you do not enable OpenMP or the particular compiler doesn't support it, you'll get a serial program.
You can find a quick but comprehensive OpenMP tutorial at LLNL.
Related
I am looking to find the number of tasks. How to get the number of tasks created by the openMP program ?
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
quicksort(A,left,last-1);
quicksort(A,last+1,right);
#pragma omp taskwait
}
If you want to gain an insight in what your OpenMP program is doing, you should use a OpenMP-task-aware performance analysis tool. For example Score-P can record all task operations in either a trace with full timing information or a summary profile. There are then several other tools to analyse and visualize the recorded information.
Take a look at this paper for more information for performance analysis of task-based OpenMP applications.
There's no good way of counting the number of OpenMP tasks, as OpenMP does not offer any way to actually query how many tasks have been created thus far. The OpenMP runtime system may or may not keep track of this number, so it would unfair (and would have performance implications) if such a number would be kept in a runtime that is otherwise not interested in this number.
The following is a terrible hack! Be sure you absolutely want to do this!
Having said the above, you can do the count manually. Assuming that your code is deterministically creating the same number of tasks for each execution, you can do this:
int tasks_created;
void quicksort(int* A,int left,int right)
{
int i,last;
if(left>=right)
return;
swap(A,left,(left+right)/2);
last=left;
for(i=left+1;i<=right;i++)
if(A[i] < A[left])
swap(A,++last,i);
swap(A,left,last);
#pragma omp task
{
#pragma omp atomic
tasks_created++
quicksort(A,left,last-1);
}
quicksort(A,last+1,right);
#pragma omp taskwait
}
I'm saying that this is a terrible hack, because
it requires you to find all the task-creating statements to modify them with the atomic construct and the increment
it does not work well for some task-generating directives, e.g., taskloop
it may horribly slow down execution, so that you cannot leave the modification in for production (that's the part abut determinism, you need run once with the count and then remove the counting for production)
Another way...
If you are using a reasonably new implementation of OpenMP that already supports the new OpenMP tools interfaces of OpenMP 5.0, you can write a small tool that hooks into the OpenMP events for task-creation. Then you can do the count in the tool and attach it to our execution through the OpenMP tools mechanism.
i've just started playing around with posix pthreads (on c++).
I'm trying to use a conditional variable to start many threads at once.
Does someone know a better way to do this or can give an example of how one would?
If you have ruled out pthread_cond_broadcast, and are trying to do this you probably have already created the threads and might be looking for a way to gather release them all at once. If that is the case you may want to use a barrier.
You can initialize a barrier with pthread_barrier_init which takes a parameter for the number of threads you want to wait on. When the specified number of threads have hit a pthread_barrier_wait statement all the waiting threads are released at once (i.e. marked ready to run), though of course they remain subject to the whims of scheduler as to which may or may not immediately get processor time.
A very simple sketch
void* tfunc(void *)
{
pthread_barrier_wait(&bar);
//do stuff
}
pthread_barrier_init(&bar, NULL, 4);
for (int i = 0; i < 4; ++i)
pthread_create(&tid[i], NULL, tfunc, NULL);
When the 4th thread hits the wait all the waiting threads will continue.
Can somebody tell me whether the following compute shader is possible with DirectX 11?
I want the first thread in a Dispatch that accesses an element in a buffer (g_positionsGrid) to set (compare exchange) that element with a temporary value to signify that it is taking some action.
In this case the temp value is 0xffffffff and the first thread is going to go continue on and allocate a value from a structed append buffer (g_positions) and assign it to that element.
So all fine so far but the other threads in the dispatch can come in inbetween the compare exchange and the allocation of the first thread and so need to wait until the allocation index is available. I do this with a busy wait ie the while loop.
However sadly this just locks up the GPU as I'm assuming that the value written by the first thread is not propogated through to the other threads stuck in the while loop.
Is there any way to get those threads to see that value?
Thanks for any help!
RWStructuredBuffer<float3> g_positions : register(u1);
RWBuffer<uint> g_positionsGrid : register(u2);
void AddPosition( uint address, float3 pos )
{
uint token = 0;
// Assign a temp value to signify first thread has accessed this particular element
InterlockedCompareExchange(g_positionsGrid[address], 0, 0xffffffff, token);
if(token == 0)
{
//If first thread in here allocate index and assign value which
//hopefully the other threads will pick up
uint index = g_positions.IncrementCounter();
g_positionsGrid[address] = index;
g_positions[index].m_position = pos;
}
else
{
if(token == 0xffffffff)
{
uint index = g_positionsGrid[address];
//This never meets its condition
[allow_uav_condition]
while(index == 0xffffffff)
{
//For some reason this thread never gets the assignment
//from the first thread assigned above
index = g_positionsGrid[address];
}
g_positions[index].m_position = pos;
}
else
{
//Just assign value as the first thread has already allocated a valid slot
g_positions[token].m_position = pos;
}
}
}
Thread sync in DirectCompute is very easy, but comparing to same features to CPU threading is very unflexible. AFAIK, the only way to sync data between threads in compute shader is to use groupshared memory and GroupMemoryBarrierWithGroupSync(). That means, that you can:
create small temporary buffer in groupshared memory
calculate value
write to groupshared buffer
synchronize threads with GroupMemoryBarrierWithGroupSync()
read from groupshared from another thread and use it somehow
To implement all this stuff, you need proper array indices. But where you can take it from? In DirectCompute values passed in Dispatch and system values that you can get in shader (SV_GroupIndex, SV_DispatchThreadID, SV_GroupThreadID, SV_GroupID) related. Using that values you can calculate indices to assess you buffers.
Compute shaders are not well documented, and there is no easy way, but to find out more info at least you can:
read MSDN: Compute shader overview
watch DirectCompute Lecture Series videos on channel9
examine compute shader samples from DirectX SDK, very nice
samples from NVIDIA`s SDK (10 and 11)
read this advanced NVIDIA paper where they implemented thread reduction and then optimize their code to run 10 times faster ;)
As of your code. Well, probably you can redesign it a little.
It is always good to all threads do the same task. Symmetric loading. Actually, you can not assign different tasks for you threads as you do it in CPU code.
If your data first need some preprocessing, and further processing, you may want to divide it to differrent Dispatch() calls (different shaders) that you will call in sequence:
preprocessShader reads from buffer inputData and writes to preprocessedData
calculateShader feads from preprocessedData and writes to finalData
In this case you can drop out any slow thread sync and slow group shared memory.
Look at "Thread reduction" trick mentioned above.
Hope it helps! And happy coding!
Suppose I have some code that looks like this:
#include "mpi.h"
int main( int argc, char** argv )
{
int my_array[10];
//fill the array with some data
MPI_Init(&argc, &argv);
// Some code here
MPI_Finalize();
return 0;
}
Will each MPI instance get its own copy of my_array? Only rank 0? None of them? Is it bad practice to have any code before MPI_Init at all?
The short answer to "what happens to memory when I call MPI_Init" is: nothing.
MPI_Init initializes the MPI library in the calling process. Nothing more, nothing less. At the time of the MPI_Init call, all the MPI processes already exist, they just don't know about each other yet and can't communicate.
Each MPI process is a separately executing program. The processes do not share memory, and communicate by passing messages.
Indeed, the processes calling MPI_Init can even be different programs entirely, as long as the messages they pass around match. This is the MPMD model.
When you run mpi code, you are running the same code in different process (they can not share memory), so each process will have his own array.
The arrays should be equal, unless your data depend of time (the process are not necessarily synchronized), process rank (I think the rank is only available after the init call) or any random number generators (some may generate random seeds as well).
I have a bunch of threads that are doing lots of communication with each other.
I would prefer this be lock free.
For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)
Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.
The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.
Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).
The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.
To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.
Two notes if you roll your own functions:
1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.
2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.
Sure, if you have an atomic CompareAndSwap instruction:
for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
if ((mailbox[i].owned == false) &&
(CompareAndSwap(&mailbox[i].owned, true, false) == false))
break;
}
mailbox[i].message = message;
mailbox[i].ready = true;
After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).
Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.
may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.