Apple Metal blitCommandEncoder in multi-thread situation - metal

I have a loop that send off jobs to GPU in the managed memory model. code is:
var commandBufferArray : [MTLCommandBuffer] = []
var blitCommandArray : [MTLBlitCommandEncoder] = []
for i_cycle in 0..<n
{
commandBufferArray.append(mc.metalCommandQueue.makeCommandBuffer())
let outputDeviate = [float4](repeating: float4(0.0),count: 1024)
outputDeviateBufferArray.append(mc.createFloat4MetalBufferManaged(outputDeviate))
populateBuffersMetalJob(.....)
blitCommandArray.append(commandBufferArray[i_cycle].makeBlitCommandEncoder())
blitCommandArray[i_cycle].synchronize(resource: outputDeviateBufferArray[i_cycle])
blitCommandArray[i_cycle].endEncoding()
commandBufferArray[i_cycle].addCompletedHandler({ _ in
// do stuff with result
})
commandBufferArray[i_cycle].commit()
}
for i_cycle in 0..<numCycles
{
commandBufferArray[i_cycle].waitUntilCompleted()
}
I am using the AMD process on a 2015 MBP. If n = 1, this works fine. Once n > 1, it seems to hang on the synchronization call and never completes.
Any thoughts on what is going wrong here?

What is in the // do stuff with result code? I suspect you're doing something in there that's deadlocking. Perhaps it's trying to run something on the main thread where the code you've shown is blocked. Or it's trying to access a resource that you have locked. That prevents the completed handler(s) from finished, which prevents the command buffer from moving on and letting the next command buffer run or complete.
If you take a sample of the process, it can provides hints about where it's stuck and what it's waiting for. You can do that using the sample command-line tool or Activity Monitor > View > Sample Process.
Also, why are you using multiple command buffers? And why multiple blit command encoders? You do realize you could do all of this using a single command buffer and a single blit command encoder, right?

Related

How to pass native void pointers to a Dart Isolate - without copying?

I am working on exposing an audio library (C library) for Dart. To trigger the audio engine, it requires a few initializations steps (non blocking for UI), then audio processing is triggered with a perform function, which is blocking (audio processing is a heavy task). That is why I came to read about Dart isolates.
My first thought was that I only needed to call the performance method in the isolate, but it doesn't seem possible, since the perform function takes the engine state as first argument - this engine state is an opaque pointer ( Pointer in dart:ffi ). When trying to pass engine state to a new isolate with compute function, Dart VM returns an error - it cannot pass C pointers to an isolate.
I could not find a way to pass this data to the isolate, I assume this is due to the separate memory of main isolate and the one I'm creating.
So, I should probably manage the entire engine state in the isolate which means :
Create the engine state
Initialize it with some options (strings)
trigger the perform function
control audio at runtime
I couldn't find any example on how to perform this actions in the isolate, but triggered from main thread/isolate. Neither on how to manage isolate memory (keep the engine state, and use it). Of course I could do
Here is a non-isolated example of what I want to do :
Pointer<Void> engineState = createEngineState();
initEngine(engineState, parametersString);
startEngine(engineState);
perform(engineState);
And at runtime, triggered by UI actions (like slider value changed, or button clicked) :
setEngineControl(engineState, valueToSet);
double controleValue = getEngineControl(engineState);
The engine state could be encapsulated in a class, I don't think it really matters here.
Whether it is a class or an opaque datatype, I can't find how to manage and keep this state, and perform triggers from main thread (processed in isolate). Any idea ?
In advance, thanks.
PS: I notice, while writing, that my question/explaination may not be precise, I have to say I'm a bit lost here, since I never used Dart Isolates. Please tell me if some information is missing.
EDIT April 24th :
It seems to be working with creating and managing object state inside the Isolate. But the main problem isn't solved. Because the perform method is actually blocking while it is not completed, there is no way to still receive messages in the isolate.
An option I thought first was to use the performBlock method, which only performs a block of audio samples. Like this :
while(performBlock(engineState)) {
// listen messages, and do something
}
But this doesn't seem to work, process is still blocked until audio performance finishes. Even if this loop is called in an async method in the isolate, it blocks, and no message are read.
I now think about the possibility to pass the Pointer<Void> managed in main isolate to another, that would then be the worker (for perform method only), and then be able to trigger some control methods from main isolate.
The isolate Dart package provides a registry sub library to manage some shared memory. But it is still impossible to pass void pointer between isolates.
[ERROR:flutter/lib/ui/ui_dart_state.cc(157)] Unhandled Exception: Invalid argument(s): Native objects (from dart:ffi) such as Pointers and Structs cannot be passed between isolates.
Has anyone already met this kind of situation ?
It is possible to get an address which this Pointer points to as a number and construct a new Pointer from this address (see Pointer.address and Pointer.fromAddress()). Since numbers can freely be passed between isolates, this can be used to pass native pointers between them.
In your case that could be done, for example, like this (I used Flutter's compute to make the example a bit simpler but that would apparently work with explicitly using Send/ReceivePorts as well)
// Callback to be used in a backround isolate.
// Returns address of the new engine.
int initEngine(String parameters) {
Pointer<Void> engineState = createEngineState();
initEngine(engineState, parameters);
startEngine(engineState);
return engineState.address;
}
// Callback to be used in a backround isolate.
// Does whichever processing is needed using the given engine.
void processWithEngine(int engineStateAddress) {
final engineState = Pointer<Void>.fromAddress(engineStateAddress);
process(engineState);
}
void main() {
// Initialize the engine in a background isolate.
final address = compute(initEngine, "parameters");
final engineState = Pointer<Void>.fromAddress(address);
// Do some heavy computation in a background isolate using the engine.
compute(processWithEngine, engineState.address);
}
I ended up doing the processing of callbacks inside the audio loop itself.
while(performAudio())
{
tasks.forEach((String key, List<int> value) {
double val = getCallback(key);
value.forEach((int element) {
callbackPort.send([element, val]);
});
});
}
Where the 'val' is the thing you want to send to callback. The list of int 'value' is a list of callback index.
Let's say you audio loop performs with vector size of 512 samples, you will be able to pass your callbacks after every 512 audio samples are processed, which means 48000 / 512 times per second (assuming you sample rate is 48000). This method is not the best one but it works, I still have to see if it works in very intensive processing context though. Here, it has been thought for realtime audio, but it could work the same for audio rendering.
You can see the full code here : https://framagit.org/johannphilippe/csounddart/-/blob/master/lib/csoundnative.dart

how to properly wait for completion of NtCreateFile/etc?

I am using native NT API in my application to access files (NtCreateFile/etc). In order to avoid dealing with STATUS_PENDING I am using FILE_SYNCHRONOUS_IO_NONALERT flag when opening related file. So, opening file looks like this:
UNICODE_STRING fname = toNtUnicode(ntpath);
OBJECT_ATTRIBUTES oa;
InitializeObjectAttributes(&oa, &fname, 0, at.handle(), NULL);
HANDLE h;
IO_STATUS_BLOCK io_status;
NTSTATUS r = NtOpenFile(&h, GENERIC_READ|SYNCHRONIZE, &oa, &io_status,
FILE_SHARE_READ, FILE_SYNCHRONOUS_IO_NONALERT|FILE_DIRECTORY_FILE);
if (r != STATUS_SUCCESS)
...; // error handling
Unfortunately, it causes kernel to serialize all operations on given handle. I.e. if I try to execute multiple reads in parallel (using multiple threads) -- only one request will be processed at any point in time.
I could get rid of serialization:
HANDLE h;
IO_STATUS_BLOCK io_status;
NTSTATUS r = NtOpenFile(&h, GENERIC_READ|SYNCHRONIZE, &oa, &io_status,
FILE_SHARE_READ, FILE_DIRECTORY_FILE);
if (r == STATUS_PENDING)
...; // what to do here???
but how exactly should I wait for completion -- WaitForSingleObject() on file handle? As far as I know it can change to signaled state due to many reasons -- is there any way to tell that my open file (or dir) operation completed?
Similarly, if I submit multiple reads (from multiple threads) -- how can I tell which one (if any) has finished?
NtOpenFile is synchronous api. it never return STATUS_PENDING to you. even if driver return STATUS_PENDING for IRP_MJ_CREATE i/o sub-system will be wait for IRP complete
https://github.com/Zer0Mem0ry/ntoskrnl/blob/master/Io/iomgr/parse.c#L1404
so you never need check for STATUS_PENDING after NtOpenFile and never need wait (and in principle we can not wait here - we yet not have file handle -so can not wait on it or bind it to say IOCP. we not pass any event or another callback mechanism for NtOpenFile)

Resart a task in FreeRTOS

I have a specific task routine which performs some operations in a specific order, and these operations handle few volatile variables. There is a specific interrupt which updates these volatile variables asynchronously. Hence, the task routine should restart if such an interrupt occurs. Normally FreeRTOS will resume the task, but this will result in wrong derived values, hence the requirement for restarting the routine. I also cannot keep the task routine under critical section, because I should not be missing any interrupts.
Is there a way in FreeRTOS with which I can achieve this? Like a vtaskRestart API. I could have deleted the task and re-created it, but this adds a lot of memory management complications, which I would like to avoid. Currently my only option is to add checks in the routine on a flag to see if a context switch have occured and if yes, restart, else continue.
Googling did not fetch any clue on this. Seems like people never faced such a problem or may be its that this design is poor. In FreeRTOS forum, few who asked for a task-restart didn't seem to have this same problem. stackOverflow didn't have a result on freertos + task + restart. So, this could be the first post with this tag combination ;)
Can someone please tell me if this is directly possible in FreeRTOS?
You can use semaphore for this purpose. If you decide using semaphore, you should do the steps below.
Firstly, you should create a binary semaphore.
The semaphore must be given in the interrupt routine with
xSemaphoreGiveFromISR( Example_xSemaphore, &xHigherPriorityTaskWoken
);
And, you must check taking semaphore in the task.
void vExample_Task( void * pvParameters )
{
for( ;; )
{
if (xSemaphoreTake( Example_xSemaphore, Example_PROCESS_TIME)==pdTRUE)
{
}
}
}
For this purpose you should use a queue and use the queue peek function to yield at your volatile data.
I'm using it as I have a real time timer and this way I make the time available to all my task, without any blocking.
Here it how it goes:
Declare the queue:
xQueueHandle RTC_Time_Queue;
Create the queue of 1 element:
RTC_Time_Queue = xQueueCreate( 1, sizeof(your volatile struct) );
Overwrite the queue everytime your interrupt occurs:
xQueueOverwriteFromISR(RTC_Time_Queue, (void*) &time);
And from other task peek the queue:
xQueuePeek(RTC_GetReadQueue(), (void*) &TheTime, 0);
The 0 at the end of xQueuePeek means you don't want to wait if the queue is empty. The queue peek won't delete the value in the queue so it will be present every time you peek and the code will never stop.
Also you should avoid having variable being accessed from ISR and the RTOS code as you may get unexpected corruption.

How to terminate a long running isolate #2

I am trying to understand how I shall port my Java chess engine to dart.
So I have understood that I should use an Isolates to run my engine in parallell with the GUI but how can I force the engine to terminate the search.
In java I just set some boolean that where shared between the engine thread and the gui thread.
Answer I got:
You should send a message to the isolate, telling it to stop. You can simply do something like:
port.send('STOP');
My request
Thanks for the clarification. What I don't understand is that if the chess engine isolate is busy due to a port.send('THINK') command how can it respond to a port.send('STOP') command
Each isolate is single-threaded. As long as your program is running nobody else will have the means to interfere with your execution.
If you want to be able to react to outside events (including messages from other isolates) you need to split your long running execution into smaller parts. A chess-engine probably has already some state to know where to look for the next move (assuming it's built with something like A*). In this case you could just periodically interrupt your execution and resume after a minimal timeout.
Example:
var state;
var stopwatch = new Stopwatch()..run();
void longRunning() {
while (true) {
doSomeWorkThatUpdatesTheState();
if (stopwatch.elapsedMilliseconds > 200) {
stopwatch.reset();
Timer.run(longRunning);
return;
}
}
}
The new API will contain a
isolate.kill(loopForever ? Isolate.IMMEDIATE : Isolate.AS_EVENT);
See https://code.google.com/p/dart/issues/detail?id=21189#c4 for a full example.

trouble reading from __global memory after atom_inc in OpenCL

OpenCL doesn't have a global barrier that will stop all threads, so I'm trying to create a work around with the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all of them increment that one piece of memory.
However, the value read from scratch[0] never changes for the threads once it's been read, and it loops forever. I know it's being incremented because it's the correct value when I read it back to the host.
Is the global memory being locally cached? What's going on here?
Found the problem: the order in which work groups are executed is implementation defined. This means that some threads might start only after others have finished.
In the code I gave, the work groups that are started first will loop forever waiting on the the others to hit the 'barrier'. And the work groups that would be started later won't ever start because they're waiting for the first ones to finish.
If the implementation (I'm on a Radeon 5750, using Stream SDK 2.2) executes all work groups concurrently, then it probably wouldn't be an issue. But that's not the case for my setup.

Resources