Why isnt it more widespread to implement the function call stack in hardware or at least have it closer to the CPU in the L1/L2 cache ?
Couldn't it save stacking/reading back function parameters to registers in the CPU by not having to travel to the memory each time ?
Well, its a matter of price/performance trade of. For many reasons Modern CPU's use SRAM chips as cache. Unfortunately, SRAM is 1000 times pricier then cheap DDR3 you can buy at your local PC store. So, since you probably won't buy 15,000$ i7 CPU, just because it has 1 gig of SRAM L1 cache, no one will produce such. price/performance
Actually, CPU's implements sophisticate methods to compensate the lack of memory. Look for "cache management".
Related
I sometimes read that the code segment is placed into ROM/FLASH. Others state that it is also loaded into RAM.
Is my understanding correct that it is common to place it into FLASH primary memory in case of an embedded system? And what are the advantages? I assume the startup of the program will be faster but since the FLASH memory is much slower it would be better to additionally load it from FLASH to RAM during the startup phase when RAM usage does not matter?
Sometimes you don't have enough memory for your program, so you just leave it in ROM or flash. On a system flush with memory you just load everything into RAM, it's much faster.
Some embedded CPUs have 2K of memory, but 2MB of flash. As an example the RP2040 has 264KB of SRAM (RAM) but 2MB of Flash memory for your programs. That's a lot bigger than the memory footprint.
Flash is slow compared to modern DRAM but in an embedded environment the CPU isn't always that fast either. The RP2040 only runs at 133MHz, so it won't notice the difference between flash latency and SRAM latency like a chip running in the 2GHz range might. It's clocked 15x slower.
If you want to explore this more, embedded CPUs like the RP2040 are really cheap, some less than $1, so you can experiment on them and see how it plays out in real life without having to spend much money at all.
Generally, RAM is much faster than flash. Where to run the code from however, depends on the system. On most traditional embedded systems, you don't execute from RAM.
On low-end embedded systems (8 and 16-bitters) you always keep all code in flash and there won't be a performance difference between executing from RAM or flash. Such systems typically don't have a MMU nor protection against writing to the code area, so running code from RAM is highly dangerous since bugs can write straight into physical memory. Also, these systems tend to have very limited RAM.
On mid-range embedded systems (Cortex M etc) where you start to clock the core faster than the flash can keep up with, you need to introduce wait states, where the CPU waits for the flash to read. Typically you need wait states when you go beyond somewhere around 40-50MHz system clock on modern systems. The higher the clock, the more wait states you need.
Such systems do not typically execute code from RAM either, since they usually don't need extreme performance. And they typically don't have a lot of RAM either. In some cases like mid-range Power PC, you'll have instruction cache, which helps a lot in compensating for the slower flash, since instructions can be pre-loaded from flash to cache by branch prediction.
On high-end systems (Cortex A, x86 etc) there will be lots of RAM available for the purpose of executing the code from there and then you are expected to do so. On these systems, cache rather serves the purpose of speeding up access to RAM.
Historically, RAM was also much more prone to electromagnetic interference and could also lose charge over time unless you kept writing to the cells, so you didn't want to keep code in RAM for those reasons alone. That's not much of an issue today though.
I'm dumbfounded with this matter. As far as my knowledge goes there is volatile and non-volatile memory. The question that has been given to me is to rate on a scale of 1 to 4 the volatility of each of these types of memory.
The types of memory outlined here are DRAM, CPU Cache, CPU Registers and Secondary Storage. I'm aware that DRAM, Cache and Registers are very much volatile, with some exceptions in the case of Registers. So far my answer goes as follows:
DRAM
Cache
Registers
Secondary Storage
Would this be considered a correct solution? I've researched wide and far and there is not much data on how volatile these types of memory are.
The closer to the core the memory is the more the pressure is on the manufacturers to optimize for speed. Volatility can be easily solved with refresh cycles with the cost of some extra energy consumption (nobody cares, everybody wants speeeeed).
Since there's a trade-off between speed and permanence, memory units closer to the core are more volatile.
So the order is: registers, cache, DRAM, secondary storage.
I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?
Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.
I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.
The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)
Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.
I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.