What does the allocation under "VM: Dispatch continuations" signify?
(http://i.stack.imgur.com/4kuqz.png)
#InkGolem is on the right lines. This is a cache for dispatch blocks inside GCD.
#AbhiBeckert is off by a factor of 1000. 16MB is 2 million 64-bit pointers, not 2 billion.
This cache is allocated on a per-thread basis, and you're just seeing the allocation size of this cache, not what's actually in use. 16 MB is well within range, if you're doing lots of dispatching onto background threads (and since you're using RAC, I'm guessing that you are).
Don't worry about it, basically.
From what I understand, Continuations are a style of function pointer passing so that the process knows what to execute next, in your case I'm assuming those would be dispatch blocks from GCD. I'm assuming that the VM has a bunch of these that it uses over time, and that's what you're seeing in instruments. Then again, I'm not an expert on threading and I could be totally off in left field.
Related
Here's a very simple demo:
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
for i in 0..<500000 {
DispatchQueue.global().async {
print(i)
}
}
}
}
When I run this demo from my simulator, the memory usage goes up to ~17MB and then drops to ~15MB in the end. However, if I comment out the dispatch code and only keeps the print() line, the memory usage is only ~10MB. The increase amount varies whenever I change the loop count.
Is there a memory leak? I tried Leaks and didn't find anything.
When looking at memory usage, one has to run the cycle several times before concluding that there is a “leak”. It could just be caching.
In this particular case, you might see memory growth after the first iteration, but it will not continue to grow on subsequent iterations. If it was a true leak, the post-peak baseline would continue to creep up. But it does not. Note that the baseline after the second peak is basically the same as after the first peak.
As an aside, the memory characteristics here are a result of the thread explosion (which you should always avoid). Consider:
for i in 0 ..< 500_000 {
DispatchQueue.global().async {
print(i)
}
}
That dispatches half a million work items to a queue that can only support 64 worker threads at a time. That is “thread-explosion” exceeding the worker thread pool.
You should instead do the following, which constraints the degree of concurrency with concurrentPerform:
DispatchQueue.global().async {
DispatchQueue.concurrentPerform(iterations: 500_000) { i in
print(i)
}
}
That achieves the same thing, but limits the degree of concurrency to the number of available CPU cores. That avoids many problems (specifically it avoids exhausting the limited worker pool thread which could deadlock other systems), but it also avoids the spike in memory, too. (See the above graph, where there is no spike after either of the latter two concurrentPerform processes.)
So, while the thread-explosion scenario does not actually leak, it should be avoided at all costs because of both the memory spike and the potential deadlock risks.
Memory used is not memory leaked.
There is a certain amount of overhead associated with certain OS services. I remember answering a similar question posed by someone using a WebView. There are global caches. There is simply code that has to be paged in from disk to memory (which is big with WebKit), and once the code is paged in, it's incredibly unlikely to ever be paged out.
I've not looked at the libdispatch source code lately, but GCD maintains one or more pools of threads that it uses to execute the blocks you enqueue. Every one of those threads has a stack. The default thread stack size on macOS is 8MB. I don't know about iOS's default thread stack size, but it for sure has one (and I'd bet one stiff drink that it's 8MB). Once GCD creates those threads, why would it shut them down? Especially when you've shown the OS that you're going to rapidly queue 500K operations?
The OS is optimizing for performance/speed at the expense of memory use. You don't get to control that. It's not a "leak" which is memory that's been allocated but has no live references to it. This memory surely has live references to it, they're just not under your control. If you want more (albeit different) visibility into memory usage, look into the vmmap command. It can (and will) show you things that might surprise you.
I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.
I am quite new to CUDA programming and there are some stuff about the memory model that are quite unclear to me. Like, how does it work? For example if I have a simple kernel
__global__ void kernel(const int* a, int* b){
some computation where different threads in different blocks might
write at the same index of b
}
So I imagine a will be in the so-called constant memory. But what about b? Since different threads in different blocks will write in it, how will it work? I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others. Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory? Or is CUDA taking care of it for me?
So I imagine a will be in the so-called constant memory.
Yes, a the pointer will be in constant memory, but not because it is marked const (this is completely orthogonal). b the pointer is also in constant memory. All kernel arguments are passed in constant memory (except in CC 1.x). The memory pointed-to by a and b could, in theory, be anything (device global memory, host pinned memory, anything addressable by UVA, I believe). Where it resides is chosen by you, the user.
I read somewhere that it was guaranteed that in the case of concurrent writes in global memory by different threads in the same block at least one would be written, but there's no guarantee about the others.
Assuming your code looks like this:
b[0] = 10; // Executed by all threads
Then yes, that's a (benign) race condition, because all threads write the same value to the same location. The result of the write is defined, however the number of writes is unspecified and so is the thread that does the "final" write. The only guarantee is that at least one write happens. In practice, I believe one write per warp is issued, which is a waste of bandwidth if your blocks contain more than one warp (which they should).
On the other hand, if your code looks like this:
b[0] = threadIdx.x;
This is plain undefined behavior.
Do I need to worry about that, ie for example have every thread in a block write in shared memory and once they are all done, have one write it all to the global memory?
Yes, that's how it's usually done.
I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!
If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)
In CUDA programming, if we want to use shared memory, we need to bring the data from global memory to shared memory. Threads are used for transferring such data.
I read somewhere (in online resources) that it is better not to involve all the threads in the block for copying data from global memory to shared memory. Such idea makes sense that all the threads are not executed together. Threads in a warp execute together. But my concern is all the warps are not executed sequentially. Say, a block with threads is divided into 3 warps: war p0 (0-31 threads), warp 1 (32-63 threads), warp 2 (64-95 threads). It is not guaranteed that warp 0 will be executed first (am I right?).
So which threads should I use to copy the data from global to shared memory?
To use a single warp to load a shared memory array, just do something like this:
__global__
void kernel(float *in_data)
{
__shared__ float buffer[1024];
if (threadIdx.x < warpSize) {
for(int i = threadIdx; i <1024; i += warpSize) {
buffer[i] = in_data[i];
}
}
__syncthreads();
// rest of kernel follows
}
[disclaimer: written in browser, never tested, use at own risk]
The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.
As block is assigned to multiprocessor it resides there until all threads within that block are finished and during this time warp scheduler is mixing among warps that have ready operands. So if there is one block on multiprocessor with three warps and just one warp is fetching data from global to shared memory and other two warps are staying idle and probably waiting on __syncthreads() barrier, you loose nothing and you are limited just by latency of global memory what you would have been anyway. As soon as fetching is finished warps can go ahead in their work.
Therefore, no guarantee that warp0 is executed first is needed and you can use any threads. The only two things to keep in mind are to ensure as much coalesced access to global memory as possible and avoidance of bank conflicts.