High performance computing memory - memory

"a smaller main memory per node may reduces the amount of computation a node can
execute without internode communication, increases the frequency of communication, and reduces the sizes of the respective messages communicated."
I don't understand this sentence above, could you help me explain it or restate it again?

This clearly refers to distributed computing.
a smaller main memory per node may reduces the amount of computation a node can execute without internode communication,
The less memory a node have, the less data it holds and less work generally have to be done on this data. The more memory it haves, the more work it has to do before synchronizing the other nodes to request more work or send the result in order to achieve some greater common deduction.
increases the frequency of communication,
When less memory is available, the node will have to request smaller work chunks (because it can't hold any more data), and thus, have more requests.
and reduces the sizes of the respective messages communicated
When less memory is available, the node's work chunks are smaller and thus the size of the messages carrying this work will be smaller.
For example
Let's assume you want to do a distributed "grep" on a file with 1,000 lines, on a cluster with 2 machines.
First Scenario
If the machines have the capacity of 1 line, then each machine will have to communicate 500 messages (send and return).
Second Scenario
If the machines have the capacity of 500 lines, then each machine will receive one message of size 500, but will return a single message with all the lines that contain the requested word.
The first scenario had less computation per communication, had a higher frequency of message passing, but smaller messages to send/receive.

Related

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).

How do I calculate memory bandwidth on a given (Linux) system, from the shell?

I want to write a shell script/command which uses commonly-available binaries, the /sys fileystem or other facilities to calculate the theoretical maximum bandwidth for the RAM available on a given machine.
Notes:
I don't care about latency, just bandwidth.
I'm not interested in the effects of caching (e.g. the CPU's last-level cache), but in the bandwidth of reading from RAM proper.
If it helps, you may assume a "vanilla" Intel platform, and that all memory DIMMs are identical; but I would rather you not make this assumption.
If it helps, you may rely on root privileges (e.g. using sudo)
#einpoklum you should have a look at Performance Counter Monitor available at https://github.com/opcm/pcm. It will give you the measurements that you need. I do not know if it supports kernel 2.6.32
Alternatively you should also check Intel's EMON tool which promises support for kernels as far back as 2.6.32. The user guide is listed at https://software.intel.com/en-us/download/emon-user-guide, which implies that it is available for download somewhere on Intel's software forums.
I'm not aware of any standalone tool that does it, but for Intel chips only, if you know the "ARK URL" for the chip, you could get the maximum bandwidth using a combination of a tool to query ARK, like curl, and something to parse the returned HTML, like xmllint --html --xpath.
For example, for my i7-6700HQ, the following works:
curl -s 'https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz' | \
xmllint --html --xpath '//li[#class="MaxMemoryBandwidth"]/span[#class="value"]/span/text()' - 2>/dev/null
This returns 34.1 GB/s which is the maximum theoretical bandwidth of my chip.
The primary difficulty is determining the ARK URL, which doesn't correspond in an obvious way to the CPU brand string. One solution would be to find the CPU model on an index page like this one, and follow the link.
This gives you the maximum theoretical bandwidth, which can be calculated as (number of memory channels) x (trasfer width) x (data rate). The data rate is the number of transfers per unit time, and is usually the figure given in the name of the memory type, e.g., DDR-2133 has a data rate of 2133 million transfers per second. Alternately you can calculate it as the product of the bus speed (1067 MHz in this case) and the data rate multiplier (2 for DDR technologies).
For my CPU, this calculation gives 2 memory channels * 8 bytes/transfer * 2133 million transfers/second = 34.128 GB/s, consistent with the ARK figure.
Note that theoretical maximum as reported by ARK might be lower or higher than the theoretical maximum on your particular system for various reasons, including:
Fewer memory channels populated than the maximum number of channels. For example, if I only populated one channel on my dual channel system, theoretical bandwidth would be cut in half.
Not using the maximum speed supported RAM. My CPU supports several RAM types (DDR4-2133, LPDDR3-1866, DDR3L-1600) with varying speeds. The ARK figure assumes you use the fastest possible supported RAM, which is true in my case, but may not be true on other systems.
Over or under-clocking of the memory bus, relative to the nominal speed.
Once you get the correct theoretical figure, you won't actually reach this figure in practice, due to various factors including the following:
Inability to saturate the memory interface from one or more cores due to limited concurrency for outstanding requests, as described in the section "Latency Bound Platforms" in this answer.
Hidden doubling of bandwidth implied by writes that need to read the line before writing it.
Various low-level factors relating the DRAM interface that prevents 100% utilization such as the cost to open pages, the read/write turnaround time, refresh cycles, and so on.
Still, using enough cores and non-termporal stores, you can often get very close to the theoretical bandwidth, often 90% or more.

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx
The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.
There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

In OpenCL, how can __local memory be faster when work-group sizes aren't part of the architecture?

Apologies for my naiveté if this question is silly, I'm new to GPGPU programming.
My question is, since the architecture of the device can't change, how is it that __local memory can be optimized for access by items only in the local work-group, when it's the user that chooses the work-group size (subject to divisibility)?
Local memory is usually attached to a certain cluster of execution units in GPU hardware. Work group size is indeed chosen by the client application, but the OpenCL implementation will impose a limit. Your application needs to query this via clGetKernelWorkGroupInfo() using the CL_KERNEL_WORK_GROUP_SIZE parameter name.
There's some flexibility in work group size because most GPUs are designed so multiple threads of execution can be scheduled to run on a single execution unit. (A form of SMT.) Note also that the scheduled threads don't even need to be in the same work group, so if for example a GPU has 64 processors in a cluster, and supports 4-way SMT on each processor, those 256 threads could be from 1, 2, or 4, or possibly even 8 or 16 work groups, depending on hardware and compiler capabilities.
Some GPUs' processors also use vector registers and instructions internally, so threads don't map 1:1 to OpenCL work items - one processor might handle 4 work items at once, for example.
Ultimately though, a work-group must fit onto the cluster of processors that is attached to one chunk of local memory; so you've got local memory size and maximum number of threads that can be scheduled on one cluster influencing the maximum work group size.
In general, try to minimise the amount of local memory your work group uses so that the OpenCL implementation has the maximum flexibility for scheduling work groups. (But definitely do use local memory when it helps performance! Just use as little of it as possible.)

What is the largest message Aeron can process?

What is size of the largest message real-logic's Aeron can process?
I realize Aeron shines for small messages, however I would like to have a single protocol throughout our stack and some of our messages easily reach a size of 100Mb.
The documentation is not clear on what settings affect the answer to this question. My worry is that the default buffer settings don't allow messages of this size. Or do the buffer settings have no impact on maximum application message size?
We discussed this issue on a recent conference with Martin Thompson, main contributor of Aeron.
Not only it is not possible to exceed the size of the page, if you take a look at the Aeron presentation slides 52 to 54, these are main reasons not to push large files:
page faults are almost guaranteed, if message size is close to the
page size, and it will make Aeron repeat the sequence with the next
page, i.e. major slowdown
cache churn
VM pressure
Long story short, split your 100M message into smaller chunks, at most 1/4 of the page size minus header, if you use IPC; or the size of MTU minus header (maximum transfer unit), if you go via UDP. It will improve throughput, remove pipe clogging and fix other issues. The logic behind MTU is, occasionally UDP packets get lost and everything is re-sent by Aeron starting with lost packet. So, you want to have your chunks fit a single packet for performance.
For maximum performance, in your application-level protocol, I would create a first packet having file metadata and length, the receiving side then creates memory-mapped file on HDD of given length and fills it up with incoming chunks, where each chunk includes offset and file ID (in case you want to push multiple files simultaneously). This way you copy from Aeron buffer to mmap and completely avoid garbage collection. I expect the performance of such file transfer to be faster than most other options.
To answer the original question:
With default settings, maximum message size is 16MB minus 32 bytes header.
To change it, on your media driver you can try changing
aeron.term.buffer.length and aeron.term.buffer.max.length system properties (see Aeron Configuration Options). Here, term buffer corresponds to a "page" in other parts of documentation and in fact, works similarly to OS memory pages in some respects. While term.buffer.max.length configures number of buffers/pages in rotation.

Resources