Here's a very simple demo:
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
for i in 0..<500000 {
DispatchQueue.global().async {
print(i)
}
}
}
}
When I run this demo from my simulator, the memory usage goes up to ~17MB and then drops to ~15MB in the end. However, if I comment out the dispatch code and only keeps the print() line, the memory usage is only ~10MB. The increase amount varies whenever I change the loop count.
Is there a memory leak? I tried Leaks and didn't find anything.
When looking at memory usage, one has to run the cycle several times before concluding that there is a “leak”. It could just be caching.
In this particular case, you might see memory growth after the first iteration, but it will not continue to grow on subsequent iterations. If it was a true leak, the post-peak baseline would continue to creep up. But it does not. Note that the baseline after the second peak is basically the same as after the first peak.
As an aside, the memory characteristics here are a result of the thread explosion (which you should always avoid). Consider:
for i in 0 ..< 500_000 {
DispatchQueue.global().async {
print(i)
}
}
That dispatches half a million work items to a queue that can only support 64 worker threads at a time. That is “thread-explosion” exceeding the worker thread pool.
You should instead do the following, which constraints the degree of concurrency with concurrentPerform:
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 500_000) { i in
print(i)
}
}
That achieves the same thing, but limits the degree of concurrency to the number of available CPU cores. That avoids many problems (specifically it avoids exhausting the limited worker pool thread which could deadlock other systems), but it also avoids the spike in memory, too. (See the above graph, where there is no spike after either of the latter two concurrentPerform processes.)
So, while the thread-explosion scenario does not actually leak, it should be avoided at all costs because of both the memory spike and the potential deadlock risks.
Memory used is not memory leaked.
There is a certain amount of overhead associated with certain OS services. I remember answering a similar question posed by someone using a WebView. There are global caches. There is simply code that has to be paged in from disk to memory (which is big with WebKit), and once the code is paged in, it's incredibly unlikely to ever be paged out.
I've not looked at the libdispatch source code lately, but GCD maintains one or more pools of threads that it uses to execute the blocks you enqueue. Every one of those threads has a stack. The default thread stack size on macOS is 8MB. I don't know about iOS's default thread stack size, but it for sure has one (and I'd bet one stiff drink that it's 8MB). Once GCD creates those threads, why would it shut them down? Especially when you've shown the OS that you're going to rapidly queue 500K operations?
The OS is optimizing for performance/speed at the expense of memory use. You don't get to control that. It's not a "leak" which is memory that's been allocated but has no live references to it. This memory surely has live references to it, they're just not under your control. If you want more (albeit different) visibility into memory usage, look into the vmmap command. It can (and will) show you things that might surprise you.
Related
I have 2 flavors of the same ALGOL code - its a ONE-ONE replacement
Which uses - RESIZE (TO RETURN IT LIBARAY POOL)
Which uses - DEALLOCATE (TO RETURN IT SYSTEM)
The one which the DEALLOCATE Consumes more CPU time and inturn more %processor Usage
Why is that the DEALLOCATE consumes more CPU?
and how to mitigate this?
Burroughs/Unisys/A-Series, I presume?
It's been a few years since I used one of those systems, but I presume this hasn't changed too much.
RESIZE changes the size of your object (let's say it's an array to simplify life). When the original array was created, it was pulled from an ASD pool. There are several pools of various sizes. The actual memory assigned to your program may not have been the exact size you requested, although your descriptor will be "doctored" so that it appears to be exactly that size (so various intrinsic calls work properly). If you RESIZE within the actual size of that memory item, then only that doctoring has to be updated. Fast, easy.
Otherwise, it actually calls the MCP procedure EXPANDAROW instead of RESIZEANDDEALLOCATE. Usually, but not always, this can find additional memory without having to return the original memory to the ASD pools.
In the second case, DEALLOCATE, the MCP procedure RESIZEANDDEALLOCATE is indeed called, the memory is returned to the ASD pools, dope vectors are deallocated, memory is cleared, memory link words are updated and the ASD pools are updated. Your program pays for all of that (just as it paid for the original allocation).
Your question doesn't have enough background to answer your mitigation question. Maybe you don't? Why are you calling RESIZE/DEALLOCATE anyway? This generally happens on BLOCKEXIT already.
I have a question about custom DispatchQueue.
I created a queue and I use it as a queue for captureOutput: method. Here's a code snippet:
//At the file header
private let videoQueue = DispatchQueue(label: "videoQueue")
//setting queue to AVCaptureVideoDataOutput
dataOutput.setSampleBufferDelegate(self, queue: videoQueue)
After I get the frame, I'm doing an expensive performance with it and I'm doing it for each frame.
When I launch the app, my expensive performance takes to 17 ms to compute and thankfully to that I have around 46-47 fps. So far so good.
But after some time (around 10-15 seconds), this expensive performance starts taking more and more time and in 1-2 minute I end up with 35-36 fps, and instead of 17 ms I have around 20-25 ms.
Unfortunately, I can't provide the code of expensive performance because there's a lot of it and at least XCode tells that I do not have any memory leaks.
I know that manually created DispatchQueue doesn't really work in its own because all tasks I put there eventually end up in iOS default thread pool (I'm talking about BACKGROUND, UTILITY, USER_INTERACTIVE, etc). And for me it looks like videoQueue looses a priority with some period of time.
If my guess is right - is there any way to influence that? The performance of my DispatchQueue is very crucial and I want to give it the highest priority all the time.
If I'm not right, I would very much appreciate if someone can give me a direction I should investigate. Thanks in advance!
First, I would suspect other parts of your code, and probably some kind of memory accumulation. I would expect you're either reprocessing some portion of the same data over and over again, or you're doing a lot of memory allocation/deallocation (which can lead to memory fragmentation). Lots of memory issues don't show up as "leaks" (because they're not leaks). The way to explore this is with Instruments.
That said, you probably don't want to run this at the default QoS. You shouldn't think of QoS as "priority." It's more complicated than that (which is why it's called "quality-of-service," not "priority"). You should assign work to a queue whose QoS matches the how it impacts the user. In this case, it looks like you are updating the UI in real-time. That matches the .userInteractive QoS:
private let videoQueue = DispatchQueue(label: "videoQueue", qos: .userInteractive)
This may improve things, but I suspect other problems in your code that Instruments will help you reveal.
There are several questions asking the exact opposite of this, and I don't understand how/why running my app in release mode works, but crashes with an EXC_BAD_ACCESS error in debug mode.
The method that crashes is recursive, and extremely!! substantial; as long as there aren't too many recursions it works fine in both debug (fewer than ~1000 on iPhone XS, unlimited on simulator) and release mode (unlimited?).
I'm at a loss as to where begin finding out how to debug the debug mode and I'm wondering if there is some kind of recursion soft-limit bundled due to the stack trace or some other unknown? Could it even be down to the cable as I'm able to successfully run in the simulator without problems?
I should note that Xcode reports crashes at seemingly random spots, such as property getters that I know are instantiated and valid; in case that helps.
I'm going to go refactor it down into smaller chunks but thought I would post here in case anybody had any ideas about what might be causing this.
See:
https://gist.github.com/ThomasHaz/3aa89cc9b7bda6d98618449c9d6ea1e1
You’re running out of stack memory.
Consider this very simple recursive function to add up integers between 1 and n:
func sum(to n: Int) -> Int {
guard n > 0 else { return 0 }
return n + sum(to: n - 1)
}
You’ll find that if you try, for example, summing the numbers between 1 and 100,000, the app will crash in both release and debug builds, but will simply crash sooner on debug builds. I suspect there is just more diagnostic information pushed on the stack in debug builds, causing it to run out of space in the stack sooner. In release builds of the above, the stack pointer advanced by 0x20 bytes each recursive call, whereas a debug builds advanced by 0x80 bytes each time. And if you’re doing anything material in your recursive function, these increments may be larger and the crash may occur with even fewer recursive calls. But the stack size on my device (iPhone Xs Max) and on my simulator (Thread.current.stackSize) is 524,288 bytes, and that corresponds to the amount by which the stack pointer is advancing and the max number of recursive calls I’m able to achieve. If your device is crashing sooner than the simulator, perhaps your device has less RAM and therefore has allotted a smaller stackSize.
Bottom line, you might want to refactor your algorithm to a non-recursive one if you want to enjoy fast performance but don’t want to incur the memory overhead of a huge call stack. As an aside, the non-recursive rendition of the above was an order of magnitude faster than the recursive rendition.
Alternatively, you can dispatch your recursive calls asynchronously, which eliminates the stack size issues, but introduces GCD overhead. An asynchronous rendition of the above was two to three orders of magnitude slower than the simple recursive rendition and, obviously, yet another order of magnitude slower than the iterative rendition.
Admittedly, my simple sum method is so trivial that the overhead of the recursive calls starts to represent a significant portion of the overall computation time, and given that your routine would appear to be more complicated, I suspect the difference will be less stark. Nonetheless, if you want to avoid running out of stack space, I’d simply suggest pursuing a non-recursive rendition.
I’d refer you to the following WWDC videos:
WWDC 2012 iOS App Performance: Memory acknowledges the different types of memory, including stack memory (but doesn’t go into the latter in any great detail);
WWDC 2018 iOS Memory Deep Dive is a slightly more contemporary version of the above video; and
WWDC 2015 Profiling in Depth touches upon tail-recursion optimization.
It’s worth noting that deeply recursive routines don’t always have to consume a large stack. Notably, sometimes we can employ tail-recursion where our recursive call is the very last call that is made. E.g. my snippet above does not employ a tail call because it’s adding n to the value returned by recursive call. But we can refactor it to pass the running total, thereby ensuring that the recursive call is a true “tail call”:
func sum(to n: Int, previousTotal: Int = 0) -> Int {
guard n > 0 else { return previousTotal }
return sum(to: n - 1, previousTotal: previousTotal + n)
}
Release builds are smart enough to optimize this tail-recursion (through a process called “tail call optimization”, TCO, also known as “tail call elimination”), mitigating the stack growth for the recursive calls. WWDC 2015 Profiling in Depth, while on a different topic, the time profiler, shows exactly what’s happening when it optimizes tail calls.
The net effect is that if your recursive routine is employing tail calls, release builds can use tail call elimination to mitigate stack memory issues, but debug (non-optimized) builds will not do this.
EXEC_BAD_ACCESS usually means that you are trying to access an object which is not in memory or probably not properly initialized.
Check in your code, if you are accessing your Dictionary variable after it is somehow removed? is your variable properly initialized? You might have declared the variable but did not initialize it and accessing it.
There could be a ton of reasons and cant say much without seeing any code.
Try to turn on NSZombieOjects - this might provide you more debug information. Refer to here How to enable NSZombie in Xcode?
IF you would like to know where and when exactly is the error occurring, you could check for memory leaks using instruments. This might be helpful http://www.raywenderlich.com/2696/instruments-tutorial-for-ios-how-to-debug-memory-leaks
What does the allocation under "VM: Dispatch continuations" signify?
(http://i.stack.imgur.com/4kuqz.png)
#InkGolem is on the right lines. This is a cache for dispatch blocks inside GCD.
#AbhiBeckert is off by a factor of 1000. 16MB is 2 million 64-bit pointers, not 2 billion.
This cache is allocated on a per-thread basis, and you're just seeing the allocation size of this cache, not what's actually in use. 16 MB is well within range, if you're doing lots of dispatching onto background threads (and since you're using RAC, I'm guessing that you are).
Don't worry about it, basically.
From what I understand, Continuations are a style of function pointer passing so that the process knows what to execute next, in your case I'm assuming those would be dispatch blocks from GCD. I'm assuming that the VM has a bunch of these that it uses over time, and that's what you're seeing in instruments. Then again, I'm not an expert on threading and I could be totally off in left field.
I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!
If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)