Submitting a Command Buffer DX vs. OpenCL

Submitting a Command Buffer DX vs. OpenCL - directx

I am wondering if there is a difference between submitting the command buffer in OpenCL vs. DirectX.
As far as I know, submitting the Command Buffer in OpenCL is performed when cFlush or clFinish is called.
Submitting the command buffer in DirectX is explained in http://msdn.microsoft.com/en-us/library/windows/hardware/ff569747(v=vs.85).aspx.
My question is: are OpenCL and DirectX command buffer submissions conceptually the same?

I am not familiar with DX. But the OpenCL submit system does NOT need clFlush and clFinish in order to run the commands in queue.
Each time a kernel or operation is submitted to a OpenCL queue it is processed as soon as it can asynchronously to the CPU ejecution.
The clFlush() command just forces the command to enter in the queue. clFinish() ensures that all the jobs have finished in the queue before returning the control to the CPU (it is a blocking call).
For example this will provably work perfectly:
clEnqueueWriteBuffer()
clEnqueueNDRangeKernel()
clEnqueueReadBuffer()
sleep(10)
//continue the processing
But the proper way is to check it out by calling clFinish() (ensueres the queue is empty) or by the clEvent subsystem (checks if the queued tasks have been finished):
clEnqueueWriteBuffer()
clEnqueueNDRangeKernel()
clEnqueueReadBuffer()
clFinish()
//continue the processing

Related

Apple Metal blitCommandEncoder in multi-thread situation

I have a loop that send off jobs to GPU in the managed memory model. code is:
var commandBufferArray : [MTLCommandBuffer] = []
var blitCommandArray : [MTLBlitCommandEncoder] = []
for i_cycle in 0..<n
{
commandBufferArray.append(mc.metalCommandQueue.makeCommandBuffer())
let outputDeviate = [float4](repeating: float4(0.0),count: 1024)
outputDeviateBufferArray.append(mc.createFloat4MetalBufferManaged(outputDeviate))
populateBuffersMetalJob(.....)
blitCommandArray.append(commandBufferArray[i_cycle].makeBlitCommandEncoder())
blitCommandArray[i_cycle].synchronize(resource: outputDeviateBufferArray[i_cycle])
blitCommandArray[i_cycle].endEncoding()
commandBufferArray[i_cycle].addCompletedHandler({ _ in
// do stuff with result
})
commandBufferArray[i_cycle].commit()
}
for i_cycle in 0..<numCycles
{
commandBufferArray[i_cycle].waitUntilCompleted()
}
I am using the AMD process on a 2015 MBP. If n = 1, this works fine. Once n > 1, it seems to hang on the synchronization call and never completes.
Any thoughts on what is going wrong here?

What is in the // do stuff with result code? I suspect you're doing something in there that's deadlocking. Perhaps it's trying to run something on the main thread where the code you've shown is blocked. Or it's trying to access a resource that you have locked. That prevents the completed handler(s) from finished, which prevents the command buffer from moving on and letting the next command buffer run or complete.
If you take a sample of the process, it can provides hints about where it's stuck and what it's waiting for. You can do that using the sample command-line tool or Activity Monitor > View > Sample Process.
Also, why are you using multiple command buffers? And why multiple blit command encoders? You do realize you could do all of this using a single command buffer and a single blit command encoder, right?

Thread creation using pthread_create with SCHED_RR scheduling fails

I try to write some cores for create a pthread with SCHED_RR:
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setinheritsched (&attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_setschedpolicy(&attr, SCHED_RR);
struct sched_param params;
params.sched_priority = 10;
pthread_attr_setschedparam(&attr, &params);
pthread_create(&m_thread, &attr, &startThread, NULL);
pthread_attr_destroy(&attr);
But the thread does't run, do I need set more parameters?

A thread can only set the scheduling to SCHED_OTHER without the CAP_SYS_NICE capability. From sched(7):
In Linux kernels before 2.6.12, only privileged (CAP_SYS_NICE)
threads can set a nonzero static priority (i.e., set a real-time
scheduling policy). The only change that an unprivileged thread can
make is to set the SCHED_OTHER policy, and this can be done only if
the effective user ID of the caller matches the real or effective
user ID of the target thread (i.e., the thread specified by pid)
whose policy is being changed.
That means when you set the scheduling policy to round-robin scheduling (SCHED_RR) using pthread_attr_setschedpolicy() it's failed in all likelihood (unless you have enabled this capability for the user you are running as or running the program as sysadmin/root user who can override CAP_SYS_NICE).
You can set the capability using the setcap program:
$ sudo setcap cap_sys_nice=+ep ./a.out
(assuming a.out is your program name).
You'd have figured this out if you did error checking. You should check the return value of all the pthread functions (and generally all the library functions) for failure.
Since you haven't posted the full code, it might be an issue if you haven't joined with the thread you create (as main thread could exit before the m_thread was created and this exit the whole process). So, you might want to join:
pthread_join(m_thread, NULL);
or you could exit main thread without joining if main thread is no longer needed using pthread_exit(NULL); in main().

Handing off a piece of work to a thread and waiting for it to accept

My application works as follows:
the worker-threads initialize and begin waiting in pthread_cond_wait()
the main thread connects to DB and starts handing over one row at a time to the proper worker
Because of the DB-driver internals, the next row can not be read until the current one is extracted, so the main thread has to wait for the worker to "accept" the row.
I achieve this by calling pthread_cond_wait() inside the main thread -- waiting for a pthread_signal() from the worker. This works cleanly -- on both Linux and FreeBSD -- but usually takes much longer on Linux. Whereas I consistently process the entire 1.6M rows in about 27 seconds on FreeBSD, on Linux it usually takes over 2 minutes. Except sometimes the Linux box shows the same time...
The code is compiled from the same source and the program talks to the same DB-server. If anything, the Linux box is located on the same LAN as the DB, whereas the FreeBSD machine connects via VPN (so it should be a bit slower). But it is the wide inconsistency of the Linux results that bothers me, and I suspect the thread-coordination...
Here is what I have now:
MAIN THREAD WORKER
--------------------------------------------------------------------------
get new row
figure out, which worker it belongs to lock my mutex
lock the worker's mutex go into pthread_cond_wait
signal the worker extract the row's data
unlock the worker's mutex signal the main thread
go into pthread_cond_wait unlock the mutex
go on back to getting the next row go on to process the row's data
Is there a better way? Thanks!

If reading the next row must be serial anyway, why are you delegating this to the worker? As the main thread has to wait anyway, have the main thread do the extraction and have the hand-off occur as soon as the row has been sufficiently extracted that the master can proceed to the next row.
Other than that, you will need to provide code, as your description is incomplete, as would be any question of this nature submitted without code.

It looks like your problem is that you are calling pthread_cond_wait() without the mutex locked in the main thread. This means that there's a race-condition: if the worker thread wakes up, extracts the data and signals the condition before the parent executes pthread_cond_wait(), the wakeup will be lost.
What you should have is some shared state paired with the condition variable, like this:
Main Thread:
get_new_row();
worker = decide_worker();
pthread_mutex_lock(&mutex);
/* Signal worker that data is available */
flag[worker] = 1;
pthread_cond_signal(&cond);
/* Wait for worker to extract it */
while (flag[worker] == 1)
pthread_cond_wait(&cond, &mutex):
pthread_mutex_unlock(&mutex);
Worker Thread:
pthread_mutex_lock(&mutex);
/* Wait for data to be available */
while (flag[worker] == 0)
pthread_cond_wait(&cond, &mutex):
extract_row_data();
/* Signal main thread that extraction is complete */
flag[worker] = 0;
pthread_cond_signal(&cond);
pthread_mutex_unlock(&mutex);

pthread_create and EAGAIN

I got an EAGAIN when trying to spawn a thread using pthread_create. However, from what I've checked, the threads seem to have been terminated properly.
What determines the OS to give EAGAIN when trying to create a thread using pthread_create? Would it be possible that unclosed sockets/file handles play a part in causing this EAGAIN (i.e they share the same resource space)?
And lastly, is there any tool to check resource usage, or any functions that can be used to see how many pthread objects are active at the time?

Okay, found the answer. Even if pthread_exit or pthread_cancel is called, the parent process still need to call pthread_join to release the pthread ID, which will then become recyclable.
Putting a pthread_join(tid, NULL) in the end did the trick.
edit (was not waitpid, but rather pthread_join)

As a practical matter EAGAIN is almost always related to running out of memory for the process. Often this has to do with the stack size allocated for the thread which you can adjust with pthread_attr_setstacksize(). But there are process limits to how many threads you can run. You can query the hard and soft limits with getrlimit() using RLIMIT_NPROC as the first parameter.
There are quite a few questions here dedicated to keeping track of threads, their number, whether they are dead or alive, etc. Simply put, the easiest way to keep track of them is to do it yourself through some mechanism you code, which can be as simple as incrementing and decrementing a global counter (protected by a mutex) or something more elaborate.
Open sockets or other file descriptors shouldn't cause pthread_create() to fail. If you reached the maximum for descriptors you would have already failed before creating the new thread and the new thread would have already have had to be successfully created to open more of them and thus could not have failed with EAGAIN.

As per my observation if one of the parent process calls pthread_join(), and chilled processes are trying to release the thread by calling pthread_exit() or pthread_cancel() then system is not able to release that thread properly. In that case, if pthread_detach() is call immediately after successful call of pthread_create() then this problem has been solved. A snapshot is here -
err = pthread_create(&(receiveThread), NULL, &receiver, temp);
if (err != 0)
{
MyPrintf("\nCan't create thread Reason : %s\n ",(err==EAGAIN)?"EAGAUIN":(err==EINVAL)?"EINVAL":(err==EPERM)?"EPERM":"UNKNOWN");
free(temp);
}
else
{
threadnumber++;
MyPrintf("Count: %d Thread ID: %u\n",threadnumber,receiveThread);
pthread_detach(receiveThread);
}

Another potential cause: I was getting this problem (EAGAIN on pthread_create) because I had forgotten to call pthread_attr_init on the pthread_attr_t I was trying to initialize my thread with.

trouble reading from __global memory after atom_inc in OpenCL

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?

Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart