I have a question about custom DispatchQueue.
I created a queue and I use it as a queue for captureOutput: method. Here's a code snippet:
//At the file header
private let videoQueue = DispatchQueue(label: "videoQueue")
//setting queue to AVCaptureVideoDataOutput
dataOutput.setSampleBufferDelegate(self, queue: videoQueue)
After I get the frame, I'm doing an expensive performance with it and I'm doing it for each frame.
When I launch the app, my expensive performance takes to 17 ms to compute and thankfully to that I have around 46-47 fps. So far so good.
But after some time (around 10-15 seconds), this expensive performance starts taking more and more time and in 1-2 minute I end up with 35-36 fps, and instead of 17 ms I have around 20-25 ms.
Unfortunately, I can't provide the code of expensive performance because there's a lot of it and at least XCode tells that I do not have any memory leaks.
I know that manually created DispatchQueue doesn't really work in its own because all tasks I put there eventually end up in iOS default thread pool (I'm talking about BACKGROUND, UTILITY, USER_INTERACTIVE, etc). And for me it looks like videoQueue looses a priority with some period of time.
If my guess is right - is there any way to influence that? The performance of my DispatchQueue is very crucial and I want to give it the highest priority all the time.
If I'm not right, I would very much appreciate if someone can give me a direction I should investigate. Thanks in advance!
First, I would suspect other parts of your code, and probably some kind of memory accumulation. I would expect you're either reprocessing some portion of the same data over and over again, or you're doing a lot of memory allocation/deallocation (which can lead to memory fragmentation). Lots of memory issues don't show up as "leaks" (because they're not leaks). The way to explore this is with Instruments.
That said, you probably don't want to run this at the default QoS. You shouldn't think of QoS as "priority." It's more complicated than that (which is why it's called "quality-of-service," not "priority"). You should assign work to a queue whose QoS matches the how it impacts the user. In this case, it looks like you are updating the UI in real-time. That matches the .userInteractive QoS:
private let videoQueue = DispatchQueue(label: "videoQueue", qos: .userInteractive)
This may improve things, but I suspect other problems in your code that Instruments will help you reveal.
Related
How should I implement a lock/unlock sequence with Compare and Swap using a Metal compute shader.
I’ve tested this sample code but it does not seem to work. For some reason, the threads are not detecting that the lock was released.
Here is a brief explanation of the code below:
The depthFlag is an array of atomic_bools. In this simple example, I simply try to do a lock by comparing the content of depthFlag[1]. I then go ahead and do my operation and once the operation is done, I do an unlock.
As stated above, only one thread is able to do the locking/work/unlocking but the rest of the threads get stuck in the while loop. They NEVER leave. I expect another thread to detect the unlock and go through the sequence.
What am I doing wrong? My knowledge on CAS is limited, so I appreciate any tips.
kernel void testFunction(device float *depthBuffer[[buffer(4)]], device atomic_bool *depthFlag [[buffer(5)]], uint index[[thread_position_in_grid]]){
//lock
bool expected=false;
while(!atomic_compare_exchange_weak_explicit(&depthFlag[1],&expected,true,memory_order_relaxed,memory_order_relaxed)){
//wait
expected=false;
}
//Do my operation here
//unlock
atomic_store_explicit(&depthFlag[1], false, memory_order_relaxed);
//barrier
}
You essentially can't use the locking programming model for GPU concurrency. For one, the relaxed memory order model (the only one available) is not suitable for this; for another, you can't guarantee that other threads will make progress between your atomic operations. Your code must always be able to make progress, regardless of what the other threads are doing.
My recommendation is that you use something like the following model instead:
Read atomic value to check if another thread has already completed the operation in question.
If no other thread has done it yet, perform the operation. (But don't cause any side effects, i.e. don't write to device memory.)
Perform an atomic operation to indicate your thread has completed the operation while checking whether another thread got there first. (e.g. compare-and-swap a boolean, but increasing a counter also works)
If another thread got there first, don't perform side effects.
If your thread "won" and no other thread registered completion, perform your operation's side effects, e.g. do whatever you need to do to write out the result etc.
This works well if there's not much competition, and if the result does not vary depending on which thread performs the operation.
The occasional discarded work should not matter. If there is significant competition, use thread groups; within a thread group, the threads can coordinate which thread will perform which operation. You may still end up with wasted computation from competition between groups. If this is a problem, you may need to change your approach more fundamentally.
If the results of the operation are not deterministic, and the threads all need to proceed using the same result, you will need to change your approach. For example, split your kernels up so any computation which depends on the result of the operation in question runs in a sequentially queued kernel.
Here's a very simple demo:
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
for i in 0..<500000 {
DispatchQueue.global().async {
print(i)
}
}
}
}
When I run this demo from my simulator, the memory usage goes up to ~17MB and then drops to ~15MB in the end. However, if I comment out the dispatch code and only keeps the print() line, the memory usage is only ~10MB. The increase amount varies whenever I change the loop count.
Is there a memory leak? I tried Leaks and didn't find anything.
When looking at memory usage, one has to run the cycle several times before concluding that there is a “leak”. It could just be caching.
In this particular case, you might see memory growth after the first iteration, but it will not continue to grow on subsequent iterations. If it was a true leak, the post-peak baseline would continue to creep up. But it does not. Note that the baseline after the second peak is basically the same as after the first peak.
As an aside, the memory characteristics here are a result of the thread explosion (which you should always avoid). Consider:
for i in 0 ..< 500_000 {
DispatchQueue.global().async {
print(i)
}
}
That dispatches half a million work items to a queue that can only support 64 worker threads at a time. That is “thread-explosion” exceeding the worker thread pool.
You should instead do the following, which constraints the degree of concurrency with concurrentPerform:
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 500_000) { i in
print(i)
}
}
That achieves the same thing, but limits the degree of concurrency to the number of available CPU cores. That avoids many problems (specifically it avoids exhausting the limited worker pool thread which could deadlock other systems), but it also avoids the spike in memory, too. (See the above graph, where there is no spike after either of the latter two concurrentPerform processes.)
So, while the thread-explosion scenario does not actually leak, it should be avoided at all costs because of both the memory spike and the potential deadlock risks.
Memory used is not memory leaked.
There is a certain amount of overhead associated with certain OS services. I remember answering a similar question posed by someone using a WebView. There are global caches. There is simply code that has to be paged in from disk to memory (which is big with WebKit), and once the code is paged in, it's incredibly unlikely to ever be paged out.
I've not looked at the libdispatch source code lately, but GCD maintains one or more pools of threads that it uses to execute the blocks you enqueue. Every one of those threads has a stack. The default thread stack size on macOS is 8MB. I don't know about iOS's default thread stack size, but it for sure has one (and I'd bet one stiff drink that it's 8MB). Once GCD creates those threads, why would it shut them down? Especially when you've shown the OS that you're going to rapidly queue 500K operations?
The OS is optimizing for performance/speed at the expense of memory use. You don't get to control that. It's not a "leak" which is memory that's been allocated but has no live references to it. This memory surely has live references to it, they're just not under your control. If you want more (albeit different) visibility into memory usage, look into the vmmap command. It can (and will) show you things that might surprise you.
I'm currently using tf-slim to create and read tfrecord files into my models, and through this method there is an automatic tensorboard visualization available showing:
The tf.train.batch batch/fraction_of_32_full visualization, which is consistently near 0 value. I believe this should be dependent on how fast the dequeue operation gives the tf.train.batch FIFO queue its tensors.
The parallel reader parallel_read/filenames/fraction_of_32_full and paralell_read/fraction_of_5394_full visualizations, which are always at 1.0 value. I believe this op is what extracts the tensors from the tfrecords and put them into a queue ready for dequeuing.
My question is this: Is my dequeuing operation too slow and causing a bottleneck in my model evaluation?
Why is it that "fraction_of_32" appears although I'm using a batch size of 256? Also, is a queue fraction value of 1.0 the ideal case? Since it would mean the data is always ready for the GPU to work on.
If my dequeueing operation is too slow, how do I actually improve the dequeueing speed? I've checked the source code for tf-slim and it seems that the decoder is embedded within the function I'm using, and I'm not sure if there's an external way to work around it.
I had a similar problem. If batch/fraction_of_32_full gets close to zero, it means that you are consuming data faster than you are producing it.
32 is the default size of the queue, regardless of your batch size. It is wise to set it at least as large as the batch size.
This is the relevant doc: https://www.tensorflow.org/api_docs/python/tf/train/batch
Setting num_threads = multiprocessing.cpu_count(), and capacity = batch_size can help to keep the queue full.
Please consider the following statement:
DispatchQueue.global(qos: .userInitiated).asyncAfter(deadline: .now() + .milliseconds(500), qos: .utility, flags: .noQoS) {
print("What is my QOS?")
}
Notice how many of the parameters refer to the quality of service. How can a mere mortal possibility sort out the permutations?
Generally you shouldn't try to sort out all those permutations. In most cases, messing around too much with QoS is a recipe for trouble. But there are fairly simple rules.
Queues have priorities, and they can assign that priority to blocks that request to inherit.
This particular block is explicitly requesting a lower priority, but then says "ignore my QoS request." As a rule, don't do that. The only reason I know of for doing that is if you're interacting with some legacy API that doesn't understand QoS. (I've never encountered this myself, and it's hard to imagine it coming up in user-level code.)
A more interesting question IMO (and one that comes up much more often in real code) is this one:
DispatchQueue.global(qos: .utility).async(qos: .userInitiated) {}
What is the priority of this block? The answer is .userInitiated, and the block will "loan" its priority to the queue until it finishes executing. So for some period of time, this entire queue will become .userInitiated. This is to prevent priority inversion (where a high-priority task blocks waiting for a low-priority task).
This is all discussed in depth in Concurrent Programming With GCD in Swift 3, which is a must-watch for anyone interested in non-trivial GCD.
epoll_wait, select and poll functions all provide a timeout. However with epoll, it's at a large resolution of 1ms. Select & ppoll are the only one providing sub-millisecond timeout.
That would mean doing other things at 1ms intervals at best. I could do a lot of other things within 1ms on a modern CPU.
So to do other things more often than 1ms I actually have to provide a timeout of zero (essentially disabling it). And I'd probably add my own usleep somewhere in the main loop to stop it chewing up too much CPU.
So the question is, why is the timeout in milli's when I would think clearly there is a case for a higher resolution timeout.
Since you are on Linux, instead of providing a zero timeout value and manually usleeeping in the loop body, you could simply use the timerfd API. This essentially lets you create a timer (with a resolution finer than 1ms) associated with a file descriptor, which you can add to the set of monitored descriptors.
The epoll_wait interface just inherited a timeout measured in milliseconds from poll. While it doesn't make sense to poll for less than a millisecond, because of the overhead of adding the calling thread to all the wait sets, it does make sense for epoll_wait. A call to epoll_wait doesn't require ever putting the calling thread onto more than one wait set, the calling overhead is very low, and it could make sense, on very rare occasions, to block for less than a millisecond.
I'd recommend just using a timing thread. Most of what you would want to do can just be done in that timing thread, so you won't need to break out of epoll_wait. If you do need to make a thread return from epoll_wait, just send a byte to a pipe that thread is polling and the wait will terminate.
In Linux 5.11, an epoll_pwait2 API has been added, which uses a struct timespec as timeout. This means you can now wait using nanoseconds precision.