How should I allocate memory to many (1000+) arrays which I don't know the size of? - memory

I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!

If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)

Related

All blocks read same global memory location section. Fastest method is?

I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.

Why is primary and cache memory divided into blocks?

Why is primary and cache memory divided into blocks?
Hi just got posed with this question, I haven't been able to find a detailed explanation corresponding to both primary memory and cache memory, if you have a solution It would be greatly appreciated :)
Thank you
Cache blocks exploit locality of reference based on two types of locality. Temporal locality, after you reference location x you are likely to access location x again shorty. Spatial locality, after you reference location x you are likely to access nearby locations, location x+1, ... shortly.
If you use a value at some distant data center x, you are likely to reuse that value and so it is copied geographically closer, 150ms. If you use a value on disk block x, you are likely to reuse disk block x and so it is kept in memory, 20 ms. If you use a value on memory page x, you are like to reuse memory page x and so the translation of its virtual address to its physical address is kept in the TLB cache. If you use a particular memory location x you are likely to reuse it and its neighbors and so it is kept in cache.
Cache memory is very small, L1D on an M1 is 192kB, and DRAM is very big, 8GB on an M1 Air. L1D cache is much faster than DRAM, maybe 5 cycles vs maybe 200 cycles. I wish this table was in cycles and included registers but it gives a useful overview of latencies:
https://gist.github.com/jboner/2841832
The moral of this is to pack data into aligned structures which fit. If you randomly access memory instead, you will miss in the cache, the TLB, the virtual page cache, ... and everything will be excruciatingly slow.
Most things in computer systems are divided into chunks of fixed sizes: bytes, words, cache blocks, pages.
One reason for this is that while hardware can do many things at once, it is hardware and thus, generally can only do what it was designed for.  Making bytes of blocks of 8 bits, making words blocks of 4 bytes (32-bit systems) or 8 bytes (64-bit systems) is something that we can design hardware to do, and mostly in parallel.
Not using fixed-sized chunks or blocks, on the other hand, can make things much more difficult for the hardware, so, data structures like strings — an example of data that is highly variable in length — are usually handled with software loops.
Usually these fixed sizes are in powers of 2 (32, 64, etc..) — because division and modulus, which are very useful operations are easy to do in binary for powers of 2.
In summary, we must subdivide data into blocks because, we cannot treat all the data as one lump sum (hardware wise at least), and, treating all data as individual bits is also too cumbersome.  So, we chunk or group data into blocks as appropriate for various levels of hardware to deal with in parallel.

Does cacheline size affect memory access latency?

Intel architecture has had 64 byte caches for a long time. I am curious, if instead of 64-byte cache lines a processor had 32-byte or 16-byte cachelines, would this improve the RAM-to-register data transfer latency? if so, how much? if not, why?
Thank you.
Transferring a larger amount of data of course increases the communication time. But the increase is very small due the way memory are organized and it does it does not impact memory to register latency.
Memory access operations are done in three steps:
bitline precharge: row address is sent and the internal busses of memory are precharged (duration tRP)
row access: an internal row of a memory is read and written to internal latches. During that time, column address is sent (duration tRCD)
column access: the selected columns are read in the row latches and start to be sent to the processor (duration tCL)
Row access is a long operation.
A memory is a matrix of cell elements. To increase the capacity of memory, cells must be rendered as small as possible. And when reading a row of cells, one has to drive a very capacitive and large bus that goes along a memory column. The voltage swing is very low and there are sense amplifiers amplifiers to detect small voltage variations.
Once this operation is done, a complete row is memorized in latches and reading them can be fast and are generally sent in burst mode.
Considering a typical DDR4 memory, with a 1GHz IO cycle time, we generally have tRP/tRCD/tCL=12-15cy/12-15cy/10-12cy and the complete time is around 40 memory cycles (if processor frequency is 4GHz, this is ~160 processor cycles). Then data is sent in burst mode twice per cycle, and 2x64 bits are sent every cycle. So, data transfer adds 4 cycles for 64 bytes and it would add only 2 cycles for 32 bytes.
So reducing cache line from 64B to 32B would reduce the transfer time by ~2/40=5%
If row address do not change, precharging and reading memory row is not required and the access time is ~15 memory cycles. In that case, the relative increase of time for transferring 64B vs 32B is larger but still limited: ~2/15~15%.
Both evaluations do not take into account the extra time required to process a miss in the memory hierachy and the actual percentage will be even smaller.
Data can be sent "critical word first" by the memory. If processor requires a given word, the address of this word is sent to memory. Once the row is read, memory sends first this word, then the other words in the cache line. So, caches can serve processor request as soon as the first word is received, whatever cache line is, and decreasing line width would have no impact on cache latency. So if using this feature, memory-to-register time would not change.
In recent processors, exchanges between different caches levels are based on the cache line width and sending the critical word first does not bring any gain.
Besides that, large line sizes reduce mandatory misses thanks to spatial locality and reducing line size would have a negative impact on cache miss rate.
Last, using larger cache lines increases data transfer rate between cache and memory.
The only negative aspect of large cache lines (besides the small transfer time increase) are that the number of lines in the cache is reduced and conflict misses may increase. But with the large associativity of modern caches, this effect is limited.

Golang. Zero Garbage propagation or efficient use of memory

From time to time I face with the concepts like zero garbage or efficient use of memory etc. As an example in the section Features of well-known package httprouter you can see the following:
Zero Garbage: The matching and dispatching process generates zero bytes of garbage. In fact, the only heap allocations that are made, is by building the slice of the key-value pairs for path parameters. If the request path contains no parameters, not a single heap allocation is necessary.
Also this package shows very good benchmark results compared to standard library's http.ServeMux:
BenchmarkHttpServeMux 5000 706222 ns/op 96 B/op 6 allocs/op
BenchmarkHttpRouter 100000 15010 ns/op 0 B/op 0 allocs/op
As far as I understand the second one has (from the table) no heap memory allocation and zero average number of allocations made per repetition.
The question: I want to learn a basic understanding of memory management. When garbage collector allocates/deallocates memory. What does the benchmark numbers means (the last two columns of the table) and how people know when heap is allocating?
I'm absolutely new in memory management, so it's really difficult to understand what's going on "under the hood". The articles I've read:
https://golang.org/ref/mem
https://golang.org/doc/effective_go.html
http://gribblelab.org/CBootcamp/7_Memory_Stack_vs_Heap.html
http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
The garbage collector doesn't allocate memory :-), it just deallocates. Go's garbage collector is evolving, for the details have a look at the design document https://docs.google.com/document/d/16Y4IsnNRCN43Mx0NZc5YXZLovrHvvLhK_h0KN8woTO4/preview?sle=true and follow the discussion on the golang mailing lists.
The last two columns in the benchmark output are dead simple: How many bytes have been allocated in total and how many allocations have happened during one iteration of the benchmark code. (This allocation is done by your code, not by the garbage collector). As any allocation is a potential creation of garbage reducing these numbers might be a design goal.
When are things allocated on the heap? Whenever the Go compiler decides to! The compiler tries to allocate on the stack, but sometimes it must use the heap, especially if a value escapes from the local stack-bases scopes. This escape analysis is currently undergoing rework, so it is not easy to tell which value will be heap- or stack-allocated, especially as this is changing from compiler version to version.
I wouldn't be too obsessed with avoiding allocations until your benchmarking show too much GC overhead.

How is RAM able to acess any place in memory at O(1) speed

We are taught that the abstraction of the RAM memory is a long array of bytes. And that for the CPU it takes the same amount of time to access any part of it. What is the device that has the ability to access any byte out of the 4 gigabytes (on my computer) in the same time? As this does not seem as a trivial task for me.
I have asked colleagues and my professors, but nobody can pinpoint to the how this task can be achieved with simple logic gates, and if it isn't just a tricky combination of logic gates, then what is it?
My personal guess is that you could achieve the access of any memory in O(log(n)) speed, where n would be the size of memory. Because each gate would split the memory in two and send you memory access instruction to the next split the memory in two gate. But that requires ALOT of gates. I can't come up with any other educated guess, and I don't even know the name of the device that I should look up in Google.
Please help my anguished curiosity, and thanks in advance.
edit<
This is what I learned!
quote from yours "the RAM can send the value from cell addressed X to some output pins", here is where everyone skip (again) the thing that is not trivial for me. The way that I see it, In order to build a gate that from 64 pins decides which byte out of 2^64 to get, each pin needs to split the overall possible range of memory into two. If bit at index 0 is 0 -> then the address is at memory 0-2^64/2, else address is at memory 2^64/2-2^64. And so on, However the amout of gates (lets call them) that the memory fetch will go through will be 64, (a constant). However the amount of gates needed is N, where N is the number of memory bytes there are.
Just because there is 64 pins, it doesn't mean that you can still decode it into a single fetch from a range of 2^64. Does 4 gigabytes memory come with a 4 gigabytes gates in the memory control???
now this can be improved, because as I read furiously more and more about how this memory is architectured, if you place the memory into a matrix with sqrt(N) rows and sqrt(N) columns, the amount of gates that a fetch memory will need to go through is O(log(sqrt(N)*2) and the amount of gates that will be required will be 2*sqrt(N), which is much better, and I think that its probably a trade secret.
/edit<
What the heck, I might as well make this an answer.
Yes, in the physical world, memory access cannot be constant time.
But it cannot even be logarithmic time. The O(log n) circuit you have in mind ultimately involves some sort of binary (or whatever) tree, and you cannot make a binary tree with constant-length wires in a 3D universe.
Whatever the "bits per unit volume" capacity of your technology is, storing n bits requires a sphere with radius O(n^(1/3)). Since information can only travel at the speed of light, accessing a bit at the other end of the sphere requires time O(n^(1/3)).
But even this is wrong. If you want to talk about actual limitations of our universe, our physics friends say the absolute maximum number of bits you can store in any sphere is proportional to the sphere's surface area, not its volume. So the actual radius of a minimal sphere containing n bits of information is O(sqrt(n)).
As I mentioned in my comment, all of this is pretty much moot. The models of computation we generally use to analyze algorithms assume constant-access-time RAM, which is close enough to the truth in practice and allows a fair comparison of competing algorithms. (Although good engineers working on high-performance code are very concerned about locality and the memory hierarchy...)
Let's say your RAM has 2^64 cells (places where it is possible to store a single value, let's say 8-bit). Then it needs 64 pins to address every cell with a different number. When at the input pins of your RAM there 'appears' a binary number X the RAM can send the value from cell addressed X to some output pins, and your CPU can get the value from there. In hardware the addressing can be done quite easily, for example by using multiple NAND gates (such 'addressing device' from some logic gates is called a decoder).
So it is all happening at the hardware-level, this is just direct addressing. If the CPU is able to provide 64 bits to 64 pins of your RAM it can address every single memory cell (as 64 bit is enough to represent any number up to 2^64 -1). The only reason why you do not get the value immediately is a kind of 'propagation time', so time it takes for the signal to go through all the logic gates in the circuit.
The component responsible for dealing with memory accesses is the memory controller. It is used by the CPU to read from and write to memory.
The access time is constant because memory words are truly layed out in a matrix form (thus, the "byte array" abstraction is very realistic), where you have rows and columns. To fetch a given memory position, the desired memory address is passed on to the controller, which then activates the right column.
From http://computer.howstuffworks.com/ram1.htm:
Memory cells are etched onto a silicon wafer in an array of columns
(bitlines) and rows (wordlines). The intersection of a bitline and
wordline constitutes the address of the memory cell.
So, basically, the answer to your question is: the memory controller figures it out. Of course that, given a memory address, the mapping to column and row must be easy to calculate in a constant time.
To fully understand this topic, I recommend you to read this guide on how memory works: http://computer.howstuffworks.com/ram.htm
There are so many concepts to master that it is difficult to explain it all in one answer.
I've been reading your comments and questions until I answered. I think you are on the right track, but there is some confusion here. The random access in which you are implying doesn't exist in the same way you think it does.
Reading, writing, and refreshing are done in a continuous cycle. A particular cell in memory is only read or written in a certain interval if a signal is detected to do so in that cycle. There is going to be support circuitry that includes "sense amplifiers to amplify the signal or charge detected on a memory cell."
Unless I am misunderstanding what you are implying, your confusion is in how easy it is to read/write to a cell. It's different dependent on chip design but there IS a minimum number of cycles it takes to read or write data to a cell.
These are my sources:
http://www.doc.ic.ac.uk/~dfg/hardware/HardwareLecture16.pdf
http://www.electronics.dit.ie/staff/tscarff/memory/dram_cycles.htm
http://www.ece.cmu.edu/~ece548/localcpy/dramop.pdf
To avoid a humungous answer, I left most of the detail out but all three of these will describe the process you are looking for.

Resources