Is there any way to calculate DRAM access latency (cycles) from data size? - memory

I need to calculate DRAM access latency using given data size to be transfered between DRAM-SRAM
The data is seperated to "load size" and "store size" and "number of iteration of load and store" is given.
I think the features I need to consider are many like first DRAM access latency, transfer one word latency, address load latency etc..
Is there some popular equation to get this by given information?
Thank you in advance.

Your question has many parts, I think I can help better if I knew the ultimate goal? If it's simply to measure access latency:
If you are using an x86 processor maybe the Intel Memory Latency Checker will help
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
If not x86, I think the Gem5 Simulator has what you are looking for, here is the main page but more specifically, for your needs, I think this config for Gem5 will be the most helpful.
Now regarding a popular equation, the best I could find is this Carnegie Melon paper that goes over my head: https://users.ece.cmu.edu/~omutlu/pub/chargecache_low-latency-dram_hpca16.pdf However, it looks like your main "features" as you put it revolve around cores and memory channels. The equation from the paper:
Storagebits = C ∗MC ∗Entries∗(EntrySizebits +LRUbits)
Is used to create a cache that will ultimately (the goal of ChargeCache) reduce access latency in DRAM. I'm sure this isn't the equation you are looking for but just a piece of the puzzle. The LRUbits relate to the cache this mechanism (in the memory controller, no DRAM modification necessary) creates.
EntrySizebits is determined by this equation EntrySizebits = log2(R)+log2(B)+log2(Ro)+1 and
R, B, and Ro are the number of ranks, banks, and rows in DRAM, respectively
I was surprised to learn highly charged rows (recently accessed) will have a significantly lower access latency.
If this goes over your head as well, maybe this 2007 paper by Ulrich Drepper titled What Every Programmer Should Know About Memory will help you find the elements you need for your equation. I'm still working through this paper myself, and there is some dated references but those depend on what cpu you're working with. Hope this helps, I look forward to being corrected on any of this, as I'm new to the topic.

Related

What is "expensive" vs "inexpensive" memory?

I've been reading that accessing data from lower components in the memory hierarchy is slower but less expensive. For example, fetching data from registers is fast but expensive. Could someone explain what 'expensive' means here? Is it literally the dollar cost of components? If so, I don't understand why faster components would be more expensive. I read this answer (Memory Hierarchy - Why are registers expensive?) and it talks about how accessing data in registers requires additional data paths that aren't required by lower memory components, but I didn't understand from any of the examples why those data paths would be required when fetching from registers, but not from something like main memory.
So to summarize, my two questions are:
1) What does 'expensive' mean in this context?
2) Why are faster areas of memory like registers more expensive?
Thanks!
1) What does 'expensive' mean in this context?
Expensive has the usual meaning ($). For integrated circuit, price depends on circuit size and is directly related to the number and size of transistors. It happens that and "expensive" memory requires more area on an integrated circuit.
Actually, there are several technologies used to implement a memorizing device.
Registers that are used in processor. They are realized with some kind of logic devices called latches, and their main quality is to be fast, in order to allow two reads/one write per cycle. To that purpose, transistors are dimensioned to improve driving. It depends on the actual design, but typically a bit of memory requires ~10 transistors in a register.
Static memory (SRAM) is designed as a matrix of simplified flip-flops, with 2 inverters per cell and only requires 6 transistors per memorized bit. More, static memory is a memory and to improve the number of bits per unit area, transistors are designed to be smaller than for registers. SRAM are used in cache memory.
Dynamic memory (DRAM) uses only a unique transistor as a capacitance for memorization. The transistor is either charged or discharged to represent a 1 or 0. Though extremely economic, this technique cannot be very fast, especially when a large number of cells is concerned as in present DRAM chips. To improve capacity (number of bits on a given area), transistors are rendered as small as possible, and a complex analog circuitry is used to detect small voltage variations to speed up cell content reading. More reads destroys the cell content and requires a write. Last, there are leaks in the capacitance and data must be periodically rewritten to insure data integrity. Put altogether, it makes a DRAM a slow device, with an access time of 100-200 processor cycles, but they can provide extremely cheap physical memory.
2) Why are faster areas of memory like registers more expensive?
Processor rely on a memory hierarchy and different level of the hierarchy have specific constraints. To make a cheap memory, you need small transistors to reduce the size required to memorize a bit. But for electrical reasons, a small transistor is a poor generator that cannot provide enough current to drive rapidly its output. So, while the underlying technology is similar, design choices are different in different part of the memory hierarchy.
What does 'expensive' mean in this context?
More and larger transistors (more silicon per bit), more power to runs those transistors
Why are faster areas of memory like registers more expensive?
All memory is made as cheap as it can be for the needed speed -- there's no reason to make it more expensive if it doesn't need to be, or slower if it doesn't need to be. So it becomes a tradeoff and finding "sweet spots" in the design space -- make a particular type of circuit as fast and cheap as possible, while a different circuit is also tuned to be as fast and cheap as possible. If one design is both slower and more expensive, then there's no reason to ever use it. Its only when one design is faster while the other is cheaper, does it make sense to use both in different parts of the system.

Good memory use habits when accessing integer values in NVIDIA CUDA

I am completely new to CUDA programming and I want to make sure I understand some basic memory related principles, as I was a little confused with some of my thoughts.
I am working on simulation, using billions of one time use random numbers in range of 0 to 50.
After cuRand fill a huge array with random 0.0 to 1.0 floats, I run a kernel that convert all this float data to desired integer range. From what I learned I had a feeling that storing 5 these values on one unsigned int by using just 6 bits is better because of very low bandwidth of global memory. So I did it.
Now I have to store somewhere around 20000 read-only yes/no values, that will be accessed randomly with let's say the same probability based on random values going to the simulation.
First I thought about shared memory. Using one bit looked great until I realized that the more information in one bank, the more collisions there will be. So the solution seems to be use of unsigned short (2Byte->40KB total) to represent one yes/no information, using maximum of available memory and so minimizing probabilities of reading the same bank by different threads.
Another thought came from using constant memory and L1 cache. Here, from what I learned, the approach would be exactly opposite from shared memory. Reading the same location by more threads is now desirable, so putting 32 values on one 4B bank is now optimal.
So based on overall probabilities I should decide between shared memory and cache, with shared location probably being better as with so many yes/no values and just tens or hundreds of threads per block there will not be many bank collisions.
But am I right with my general understanding of the problem? Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?
Sorry for a wall of text, there is million of ways I can make the simulation work, but just a few ways of making it the right way. I just don't want to repeat some stupid mistake again and again only because I understand something badly.
The first two optimization priorities for any CUDA programmer are:
Launch enough threads (i.e. expose enough parallelism) that the machine has sufficient work to be able to hide latency
Make efficient use of memory.
The second item above will have various guidelines based on the types of GPU memory you are using, such as:
Strive for coalesced access to global memory
Take advantage of caches and shared memory to optimize the reuse of data
For shared memory, strive for non-bank-conflicted access
For constant memory, strive for uniform access (all threads within a warp read from the same location, in a given access cycle)
Consider use of texture caching and other special GPU features.
Your question is focused on memory:
Are the processors really that fast compared to memory that optimizing memory accesses is crucial and thoughts about extra instructions when extracting data by operations like <<, >>, |=, &= is not important?
Optimizing memory access is usually quite important to have a code that runs fast on a GPU. Most of the above recommendations/guidelines I gave are not focused on the idea that you should pack multiple data items per int (or some other word quantity), however it's not out of the question to do so for a memory-bound code. Many programs on the GPU end up being memory-bound (i.e. their usage of memory is ultimately the performance limiter, not their usage of the compute resources on the GPU).
An example of a domain where developers are looking at packing multiple data items per word quantity is in the Deep Learning space, specifically with convolutional neural networks on GPUs. Developers have found, in some cases, that they can make their training codes run faster by:
store the data set as packed half (16-bit float) quantities in GPU memory
load the data set in packed form
unpack (and convert) each packed set of 2 half quantities to 2 float quantities
perform computations on the float quantities
convert the float results to half and pack them
store the packed quantities
repeat the process for the next training cycle
(See this answer for some additional background on the half datatype.)
The point is, that efficient use of memory is a high priority for making codes run fast on GPUs, and this may include using packed quantities in some cases.
Whether or not it's sensible for your case is probably something you would have to benchmark and compare to determine the effect on your code.
Some other comments reading your question:
Here, from what I learned, the approach would be exactly opposite from shared memory.
In some ways, the use of shared memory and constant memory are "opposite" from an access-pattern standpoint. Efficient use of constant memory involves uniform access as already stated. In the case of uniform access, shared memory should also perform pretty well, because it has a broadcast feature. You are right to be concerned about bank-conflicts in shared memory, but shared memory on newer GPUs have a broadcast feature that negates the bank-conflict issue when the accesses are taking place from the same actual location as opposed to just the same bank. These nuances are covered in many other questions about shared memory on SO so I'll not go into it further here.
Since you are packing 5 integer quantities (0..50) into a single int, you might also consider (maybe you are doing this already) having each thread perform the computations for 5 quantities. This would eliminate the need to share that data across multiple threads, or have multiple threads read the same location. If you're not doing this already, there may be some efficiencies in having each thread perform the computation for multiple data points.
If you wanted to consider switching to storing 4 packed quantities (per int) instead of 5, there might be some opportunities to use the SIMD instrinsics, depending on what exactly you are doing with those packed quantities.

What is the formal definition of scalability?

When I read about the the definition of scalability on different websites. I came to know in context of CPU & software that it means that as the number of CPUs are added, the performance of the software improves.
Whereas, the description of scalability in the book on "An introduction to parallel programming by Peter Pacheco" is different which is as:
"Suppose we run a parallel program with a fixed number of processes/threads and a fixed input size, and we obtain an efficiency E. Suppose we now increase the number of processes/threads that are used by the program. If we can find a corresponding rate of increase in the problem size so that the program always has efficiency E, then the program is
scalable.
My question is what is the proper definition of scalability? and if I am performing a test for scalability of a parallel software, which definition among the two should I look be looking at?
Scalability is an application's ability to function correctly and maintain an acceptable user experience when used by a large number of clients.
Preferably, this ability should be achieved through elegant solutions in code, but where this isn't possible, the application's design must allow for horizontal growth using hardware (adding more computers, rather than increasing the performance of one computer).
Scalability is a concern which grows with the size of a business. Excellent examples are Facebook (video) and Dropbox (video). Also, here's a great explanation of various approaches to scalability from a session at Harvard.
Scalability also refers to the ability of a user interface to adapt to various screen sizes while maintaining the user experience.

Is it practical to include an adaptive or optimizing memory strategy into a library?

I have a library that does I/O. There are a couple of external knobs for tuning the sizes of the memory buffers used internally. When I ran some tests I found that the sizes of the buffers can affect performance significantly.
But the optimum size seems to depend on a bunch of things - the available memory on the PC, the the size of the files being processed (varies from very small to huge), the number of files, the speed of the output stream relative to the input stream, and I'm not sure what else.
Does it make sense to build an adaptive memory strategy into the library? or is it better to just punt on that, and let the users of the library figure out what to use?
Has anyone done something like this - and how hard is it? Did it work?
Given different buffer sizes, I suppose the library could track the time it takes for various operations, and then it could make some decisions about which size was optimal. I could imagine having the library rotate through various buffer sizes in the initial I/O rounds... and then it eventually would do the calculations and adjust the buffer size in future rounds depending on the outcomes. But then, how often to re-check? How often to adjust?
The adaptive approach is sometimes referred to as "autonomic", using the analogy of a Human's autonomic nervous system: you don't conciously control your heart rate and respiration, your autonomic nervous system does that.
You can read about some of this here, and here (apologies for the plugs, but I wanted to show that the concept is being taken seriously, and is manifesting in real products.)
My experience of using products that try to do this is that they do acually work, but can make me unhappy: that's because there is a tendency for them to take a "Father knows best" approach. You make some (you believe) small change to your app, or the environment and something unexecpected happens. You don't know why, and you don't know if it's good. So my rule for autonomy is:
Tell me what you are doing and why
Now sometimes the underlying math is quite complex - consider that some autonomic systems are trending and hence making predictive changes (number of requests of this type growing, let's provision more of resource X) so the mathematical models are non-trivial. Hence simple explanations are not always available. However some level of feedback to the watching humans can be reassuring.

Reasons for NOT scaling-up vs. -out?

As a programmer I make revolutionary findings every few years. I'm either ahead of the curve, or behind it by about π in the phase. One hard lesson I learned was that scaling OUT is not always better, quite often the biggest performance gains are when we regrouped and scaled up.
What reasons to you have for scaling out vs. up? Price, performance, vision, projected usage? If so, how did this work for you?
We once scaled out to several hundred nodes that would serialize and cache necessary data out to each node and run maths processes on the records. Many, many billions of records needed to be (cross-)analyzed. It was the perfect business and technical case to employ scale-out. We kept optimizing until we processed about 24 hours of data in 26 hours wallclock. Really long story short, we leased a gigantic (for the time) IBM pSeries, put Oracle Enterprise on it, indexed our data and ended up processing the same 24 hours of data in about 6 hours. Revolution for me.
So many enterprise systems are OLTP and the data are not shard'd, but the desire by many is to cluster or scale-out. Is this a reaction to new techniques or perceived performance?
Do applications in general today or our programming matras lend themselves better for scale-out? Do we/should we take this trend always into account in the future?
Because scaling up
Is limited ultimately by the size of box you can actually buy
Can become extremely cost-ineffective, e.g. a machine with 128 cores and 128G ram is vastly more expensive than 16 with 8 cores and 8G ram each.
Some things don't scale up well - such as IO read operations.
By scaling out, if your architecture is right, you can also achieve high availability. A 128-core, 128G ram machine is very expensive, but to have a 2nd redundant one is extortionate.
And also to some extent, because that's what Google do.
Scaling out is best for embarrassingly parallel problems. It takes some work, but a number of web services fit that category (thus the current popularity). Otherwise you run into Amdahl's law, which then means to gain speed you have to scale up not out. I suspect you ran into that problem. Also IO bound operations also tend to do well with scaling out largely because waiting for IO increases the % that is parallelizable.
The blog post Scaling Up vs. Scaling Out: Hidden Costs by Jeff Atwood has some interesting points to consider, such as software licensing and power costs.
Not surprisingly, it all depends on your problem. If you can easily partition it with into subproblems that don't communicate much, scaling out gives trivial speedups. For instance, searching for a word in 1B web pages can be done by one machine searching 1B pages, or by 1M machines doing 1000 pages each without a significant loss in efficiency (so with a 1,000,000x speedup). This is called "embarrassingly parallel".
Other algorithms, however, do require much more intensive communication between the subparts. Your example requiring cross-analysis is the perfect example of where communication can often drown out the performance gains of adding more boxes. In these cases, you'll want to keep communication inside a (bigger) box, going over high-speed interconnects, rather than something as 'common' as (10-)Gig-E.
Of course, this is a fairly theoretical point of view. Other factors, such as I/O, reliability, easy of programming (one big shared-memory machine usually gives a lot less headaches than a cluster) can also have a big influence.
Finally, due to the (often extreme) cost benefits of scaling out using cheap commodity hardware, the cluster/grid approach has recently attracted much more (algorithmic) research. This makes that new ways of parallelization have been developed that minimize communication, and thus do much better on a cluster -- whereas common knowledge used to dictate that these types of algorithms could only run effectively on big iron machines...

Resources