How do I calculate memory bandwidth on a given (Linux) system, from the shell? - memory

I want to write a shell script/command which uses commonly-available binaries, the /sys fileystem or other facilities to calculate the theoretical maximum bandwidth for the RAM available on a given machine.
Notes:
I don't care about latency, just bandwidth.
I'm not interested in the effects of caching (e.g. the CPU's last-level cache), but in the bandwidth of reading from RAM proper.
If it helps, you may assume a "vanilla" Intel platform, and that all memory DIMMs are identical; but I would rather you not make this assumption.
If it helps, you may rely on root privileges (e.g. using sudo)

#einpoklum you should have a look at Performance Counter Monitor available at https://github.com/opcm/pcm. It will give you the measurements that you need. I do not know if it supports kernel 2.6.32
Alternatively you should also check Intel's EMON tool which promises support for kernels as far back as 2.6.32. The user guide is listed at https://software.intel.com/en-us/download/emon-user-guide, which implies that it is available for download somewhere on Intel's software forums.

I'm not aware of any standalone tool that does it, but for Intel chips only, if you know the "ARK URL" for the chip, you could get the maximum bandwidth using a combination of a tool to query ARK, like curl, and something to parse the returned HTML, like xmllint --html --xpath.
For example, for my i7-6700HQ, the following works:
curl -s 'https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz' | \
xmllint --html --xpath '//li[#class="MaxMemoryBandwidth"]/span[#class="value"]/span/text()' - 2>/dev/null
This returns 34.1 GB/s which is the maximum theoretical bandwidth of my chip.
The primary difficulty is determining the ARK URL, which doesn't correspond in an obvious way to the CPU brand string. One solution would be to find the CPU model on an index page like this one, and follow the link.
This gives you the maximum theoretical bandwidth, which can be calculated as (number of memory channels) x (trasfer width) x (data rate). The data rate is the number of transfers per unit time, and is usually the figure given in the name of the memory type, e.g., DDR-2133 has a data rate of 2133 million transfers per second. Alternately you can calculate it as the product of the bus speed (1067 MHz in this case) and the data rate multiplier (2 for DDR technologies).
For my CPU, this calculation gives 2 memory channels * 8 bytes/transfer * 2133 million transfers/second = 34.128 GB/s, consistent with the ARK figure.
Note that theoretical maximum as reported by ARK might be lower or higher than the theoretical maximum on your particular system for various reasons, including:
Fewer memory channels populated than the maximum number of channels. For example, if I only populated one channel on my dual channel system, theoretical bandwidth would be cut in half.
Not using the maximum speed supported RAM. My CPU supports several RAM types (DDR4-2133, LPDDR3-1866, DDR3L-1600) with varying speeds. The ARK figure assumes you use the fastest possible supported RAM, which is true in my case, but may not be true on other systems.
Over or under-clocking of the memory bus, relative to the nominal speed.
Once you get the correct theoretical figure, you won't actually reach this figure in practice, due to various factors including the following:
Inability to saturate the memory interface from one or more cores due to limited concurrency for outstanding requests, as described in the section "Latency Bound Platforms" in this answer.
Hidden doubling of bandwidth implied by writes that need to read the line before writing it.
Various low-level factors relating the DRAM interface that prevents 100% utilization such as the cost to open pages, the read/write turnaround time, refresh cycles, and so on.
Still, using enough cores and non-termporal stores, you can often get very close to the theoretical bandwidth, often 90% or more.

Related

In OpenCL, how can __local memory be faster when work-group sizes aren't part of the architecture?

Apologies for my naiveté if this question is silly, I'm new to GPGPU programming.
My question is, since the architecture of the device can't change, how is it that __local memory can be optimized for access by items only in the local work-group, when it's the user that chooses the work-group size (subject to divisibility)?
Local memory is usually attached to a certain cluster of execution units in GPU hardware. Work group size is indeed chosen by the client application, but the OpenCL implementation will impose a limit. Your application needs to query this via clGetKernelWorkGroupInfo() using the CL_KERNEL_WORK_GROUP_SIZE parameter name.
There's some flexibility in work group size because most GPUs are designed so multiple threads of execution can be scheduled to run on a single execution unit. (A form of SMT.) Note also that the scheduled threads don't even need to be in the same work group, so if for example a GPU has 64 processors in a cluster, and supports 4-way SMT on each processor, those 256 threads could be from 1, 2, or 4, or possibly even 8 or 16 work groups, depending on hardware and compiler capabilities.
Some GPUs' processors also use vector registers and instructions internally, so threads don't map 1:1 to OpenCL work items - one processor might handle 4 work items at once, for example.
Ultimately though, a work-group must fit onto the cluster of processors that is attached to one chunk of local memory; so you've got local memory size and maximum number of threads that can be scheduled on one cluster influencing the maximum work group size.
In general, try to minimise the amount of local memory your work group uses so that the OpenCL implementation has the maximum flexibility for scheduling work groups. (But definitely do use local memory when it helps performance! Just use as little of it as possible.)

What applications require 1GB pages?

X86 and x64 processors allow for 1GB pages when the PDPE flag is set on the cpu. In what application would this be practical or required and for what reason?
Hugepage would help in cases where you have a large memory footprint and memory access pattern spans large distance (across 4K pages).
It not only reduces TLB miss but also saves OS mm system page tables size.
A very good example is packet processing. In high throughput network applications (1Gbps or more), packets are normally stored in a packet buffer pool (i.e. pooling technique). For example, every packet buffer is 2KB in size and the pool contains 512 buffers. Access pattern of this packet buffer pool might not be sequential (buffer indexed at 1,2,3,4,5...) but rather random over time (1,104,407,45,905...). Since normal page size is 4K, normal TLB won't help here since each packet access would incur a TLB miss and there is a lot of different buffers sitting on different pages.
In contrast, if you put the pool in a 1GB hugepage, then all packet buffers share the same hugepageTLB entry thus avoiding misses.
This is used in DPDK (Data Plane Development Kit) where the packet
rate is very high that cycles wasted on TLB miss is not negligible.
Hugepage support is required for the large memory pool allocation used
for packet buffers (the HUGETLBFS option must be enabled in the
running kernel as indicated the previous section). By using hugepage
allocations, performance is increased since fewer pages are needed,
and therefore less Translation Lookaside Buffers (TLBs, high speed
translation caches), which reduce the time it takes to translate a
virtual page address to a physical page address. Without hugepages,
high TLB miss rates would occur with the standard 4k page size,
slowing performance.
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html#bios-setting-prerequisite-on-x86
Another example from Oracle:
...almost 6.8 GB of memory used for page tables when hugepages were not
configured...
...after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB
http://www.databasejournal.com/features/oracle/understanding-hugepages-in-oracle-database.html
Related links:
https://en.wikipedia.org/wiki/Object_pool_pattern
--Edit--
However, hugepage should be used carefully. Above I mentioned that memory pool would benefit from 1GB hugepage. However, if you have an access pattern even across 1GB page boundary, then it might not help. There is an excellent blog on this:
http://www.pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be/
Imagine an application that uses huge amounts of memory—Molecular modeling. Weather prediction—especially if it has no user interaction.
Large pages:
(1) reduce the amount of page table overhead memory
(2) increases the amount of memory that can be stored in the MMU cache. (The same number of cache entries references more memory).
I have LabView installed on my Dell ws with 8 cores and 16GB DDRM, driving 4 24" monitors.If I create a video processor or compositor of most any type, with a 1024px x 1024px 'drawing' display, LabView reserves 1.5GB before I even began to composite. It was built from C and C++. I often store image details in 3D arrays of 256 x 256 x 256 of U32 integers that hold each RGB pixel color, plus the alpha channel for opacity or masking. That's 64MB per each layer of buffered video. If I need to remember 128 layers, thats 8GB right there. LabView is a programming langauge structured much like a CAD program. If I need 8GB for a series of video (HDTV) buffers, that is what it will give me, with a few seconds wait for malloc to do its work. If I created a 8GB 3D array for a database, it would be no different, even if I did it in MySQL (not as an array). To me, having many gigabytes of ram to play with is the norm, not an exception.

Software memory bit-flip detection for platforms without ECC

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/
The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

How many memory latency cycles per memory access type in OpenCL/CUDA?

I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.

How to find number of memory accesses

Can anybody tell me a unix command that can be used to find the number of memory accesses that took place in a given interval. vmstat, top and sar only give the amount of physical memory space occupied/available .. But do not give the number of memory of accesses in a given interval
If I understand what you're asking, such a feature would almost certainly require hardware support at a very low level (e.g. a counter of some sort that monitors memory bus activity).
I don't think such support is available for the common architectures supported by
Unix or Linux, so I'm going to go out on a limb and say that no such Unix command exists.
The situation is somewhat different when considering memory in units of pages,
because most architectures that support virtual memory have dedicated MMU hardware
which operates at that level of granularity, and can be accessed by the operating
system. But as far as I know, the sorts of counter data you'd get from the MMU would
represent events like page faults, allocations, and releases, rather than individual
reads or writes.

Resources