epoll_wait, select and poll functions all provide a timeout. However with epoll, it's at a large resolution of 1ms. Select & ppoll are the only one providing sub-millisecond timeout.
That would mean doing other things at 1ms intervals at best. I could do a lot of other things within 1ms on a modern CPU.
So to do other things more often than 1ms I actually have to provide a timeout of zero (essentially disabling it). And I'd probably add my own usleep somewhere in the main loop to stop it chewing up too much CPU.
So the question is, why is the timeout in milli's when I would think clearly there is a case for a higher resolution timeout.
Since you are on Linux, instead of providing a zero timeout value and manually usleeeping in the loop body, you could simply use the timerfd API. This essentially lets you create a timer (with a resolution finer than 1ms) associated with a file descriptor, which you can add to the set of monitored descriptors.
The epoll_wait interface just inherited a timeout measured in milliseconds from poll. While it doesn't make sense to poll for less than a millisecond, because of the overhead of adding the calling thread to all the wait sets, it does make sense for epoll_wait. A call to epoll_wait doesn't require ever putting the calling thread onto more than one wait set, the calling overhead is very low, and it could make sense, on very rare occasions, to block for less than a millisecond.
I'd recommend just using a timing thread. Most of what you would want to do can just be done in that timing thread, so you won't need to break out of epoll_wait. If you do need to make a thread return from epoll_wait, just send a byte to a pipe that thread is polling and the wait will terminate.
In Linux 5.11, an epoll_pwait2 API has been added, which uses a struct timespec as timeout. This means you can now wait using nanoseconds precision.
Related
How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.
I have been using Reactor pretty extensively for a while now.
The biggest caveat I have had coming up multiple times is default request sizes / prefetch.
Take this simple code for example:
Mono.fromCallable(System::currentTimeMillis)
.repeat()
.delayElements(Duration.ofSeconds(1))
.take(5)
.doOnNext(n -> log.info(n.toString()))
.blockLast();
To the eye of someone who might have worked with other reactive libraries before, this piece of code
should log the current timestamp every second for five times.
What really happens is that the same timestamp is returned five times, because delayElements doesn't send one request upstream for every elapsed duration, it sends 32 requests upstream by default, replenishing the number of requested elements as they are consumed.
This wouldn't be a problem if the environment variable for overriding the default prefetch wasn't capped to minimum 8.
This means that if I want to write real reactive code like above, I have to set the prefetch to one in every transformation. That sucks.
Is there a better way?
Changing the Title in the hopes of being more accurate.
We have some code which runs several programs in succession by calling drawArrays() . The output textures from each stage are fed into the next and so on.
After the final call to draw, a call to readPixels() is made.
This call takes an enormous amount of time (for an output of < 1000 floats). I have measured a readPixels of that size in isolation which takes 1 or 2 ms. However in our case we see a delay of about 1500ms.
So we conjectured that the actual computation must have not started until we called readPixels(). To test this theory and to force the computation, we placed a call to gl.flush() after each gl.drawxx(). This made no difference.
So we replaced that with a call to gl.finish(). Again no difference. We finally replaced it with a call to getError(). Still no difference.
Can we conclude that gpu actually does not draw anything unless the framebuffer is read from? Can we force it to do so?
I'm currently using tf-slim to create and read tfrecord files into my models, and through this method there is an automatic tensorboard visualization available showing:
The tf.train.batch batch/fraction_of_32_full visualization, which is consistently near 0 value. I believe this should be dependent on how fast the dequeue operation gives the tf.train.batch FIFO queue its tensors.
The parallel reader parallel_read/filenames/fraction_of_32_full and paralell_read/fraction_of_5394_full visualizations, which are always at 1.0 value. I believe this op is what extracts the tensors from the tfrecords and put them into a queue ready for dequeuing.
My question is this: Is my dequeuing operation too slow and causing a bottleneck in my model evaluation?
Why is it that "fraction_of_32" appears although I'm using a batch size of 256? Also, is a queue fraction value of 1.0 the ideal case? Since it would mean the data is always ready for the GPU to work on.
If my dequeueing operation is too slow, how do I actually improve the dequeueing speed? I've checked the source code for tf-slim and it seems that the decoder is embedded within the function I'm using, and I'm not sure if there's an external way to work around it.
I had a similar problem. If batch/fraction_of_32_full gets close to zero, it means that you are consuming data faster than you are producing it.
32 is the default size of the queue, regardless of your batch size. It is wise to set it at least as large as the batch size.
This is the relevant doc: https://www.tensorflow.org/api_docs/python/tf/train/batch
Setting num_threads = multiprocessing.cpu_count(), and capacity = batch_size can help to keep the queue full.
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.