How do memory access patterns affect memory bandwidth - memory

I am involved in a computation that allocates about 1TByte of data in main memory.
I need to understand how much memory access patterns can affect the bandwidth to the processor.
For example I read that an Intel Xeon processor can support a memory bandwidth of 65GBytes/sec. Does this mean that this bandwidth will be achieved under all memory access patterns or only under optimal access patterns?
I know that on each request for data to main memory a whole cache line (64 Bytes) is pulled in. I consider the memory access pattern optimal, if on each request the entire cache line is used before the next request is issued. The worst case would be if on each request only one double (8 Bytes) is used before the next request is issued.
Suppose I know that on each request on average out of the 8 doubles that come in I will be using only f=1,2,...,8 of these before the next request is issued. Is there an easy way to compute the actual bandwidth I will get as a function of f?´
Thanks in advance for all replies

Related

In RAM Memory: CL are the total RAM cycles to access memory?

Well, my doubt is: When you buy a new RAM memory for your computer, you can see something like CL17 on it specifications. I know that CL is the same as CAS, but I have a question here: I've read in some posts that CAS is the amount of RAM clock cycles it takes for the RAM to output data called for by the CPU, but also I've read that we have to add RAS-to-CAS to that CAS to count the total RAM clock cycles it would take the RAM to output data requested from CPU.
So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay?
And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?
Edit: Supose that when I talk about the 17 cycles I'm refering to "17 RAM cycles between L3 misses and the reception of the first bytes of the data requested"
So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay? And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?
No. This delay is only a small part of the total delay from when a core requests some memory and the line returns to the core.
In particular, the request must make its way all the way from the core, checking the L1, L2 and L3 caches, and to the memory controller, before the DRAM (and timings like CAS) even become involved. After the read occurs, it has to go all the way back. This trip usually accounts for much more of the total latency of RAM access than the RAM access itself.
John D McCalpin has an excellent blog post about the memory latency components on an x86 system. On that system the CAS delay of ~11 ns makes up only a bit more than 20% of the total latency of ~50 ns.
John also points out in a comment that on some multi-socket systems, the memory latencies may not even matter because snopping the other cores in the system takes longer than the response from memory.
About RAS-to-CAS vs CAS alone, it depends on the access pattern. The RAS-to-CAS delay is only needed if that row wasn't already open, in that case the row must be opened, and RAS-to-CAS delay incurred. Otherwise, if the row is already opened, only the CAS delay is required. Which case applies depending your access physical address access pattern, RAM configuration and how the memory controllers maps physical addresses to RAM addresses.

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

shared memory vs texture memory in opencl

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.
Just like the code at:
https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl
As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.
Can anyone clarify my doubt ?
If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.
Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.
In general, that's true. However, even a cached read from a texture might be slower than a read from shared local memory, so for an algorithm that makes many overlapped reads from adjacent locations could still benefit somewhat from using shared local memory. However, it will make the kernel more complicated, so for many cases (and certainly during algorithm development) just rely on the cached texture reads.

Apache Thrift maximum message size

We are using Apache Thrift to exchange messages between two systems. In one of the message we are exchanging a list (c++) which can become huge in size. Can you please let me know what is the maximum message size we can exchange using Apache Thrift?
There is no defined "per se" limit (at least none that I am aware of). It mostly depends on how the data are held in memory, what load is on the server and how much resources are avilable. For the most part, contiguos blocks of memory (RAM) will very likely become the scarcest resource, so we should focus on that point.
The "how the data are held in memory" refers to the fact that for the sake of better throughput some transports (buffered, framed) tend to allocate more memory and larger blocks than others. Depending on the language's implementation this process may be implemented more or less efficient in terms of memory cost.
If you really plan to transfer large blocks of data, you should also look at other options like
chunking the data into blocks
sending/returning only an URL or LAN share through the service, instead of the whole data

Clarify: Processor operates at 800 Mhz and 200Mhz DDR RAM

I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.
The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)
Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.

Resources