I'm looking for an up-to-date reference on what typical (register)/cache/memory latencies are on modern x86_64 hardware. Preferably if something similar exists for GPU that would be great as well.
I found the following link from 2016, https://www.intel.com/content/www/us/en/developer/articles/technical/memory-performance-in-a-nutshell.html but ideally I'd like something from the 2020's.
Often, memory latency is specified in cycles. I assume this means that if the latency is 10 cycles and I have a 1GHz CPU, then the latency would be 10 cycles / 10^9 cycles per second = 10^-8 seconds?
I'm really not looking to do a deep dive into the architecture. Rough numbers are completely fine.
Related
I want to write a shell script/command which uses commonly-available binaries, the /sys fileystem or other facilities to calculate the theoretical maximum bandwidth for the RAM available on a given machine.
Notes:
I don't care about latency, just bandwidth.
I'm not interested in the effects of caching (e.g. the CPU's last-level cache), but in the bandwidth of reading from RAM proper.
If it helps, you may assume a "vanilla" Intel platform, and that all memory DIMMs are identical; but I would rather you not make this assumption.
If it helps, you may rely on root privileges (e.g. using sudo)
#einpoklum you should have a look at Performance Counter Monitor available at https://github.com/opcm/pcm. It will give you the measurements that you need. I do not know if it supports kernel 2.6.32
Alternatively you should also check Intel's EMON tool which promises support for kernels as far back as 2.6.32. The user guide is listed at https://software.intel.com/en-us/download/emon-user-guide, which implies that it is available for download somewhere on Intel's software forums.
I'm not aware of any standalone tool that does it, but for Intel chips only, if you know the "ARK URL" for the chip, you could get the maximum bandwidth using a combination of a tool to query ARK, like curl, and something to parse the returned HTML, like xmllint --html --xpath.
For example, for my i7-6700HQ, the following works:
curl -s 'https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz' | \
xmllint --html --xpath '//li[#class="MaxMemoryBandwidth"]/span[#class="value"]/span/text()' - 2>/dev/null
This returns 34.1 GB/s which is the maximum theoretical bandwidth of my chip.
The primary difficulty is determining the ARK URL, which doesn't correspond in an obvious way to the CPU brand string. One solution would be to find the CPU model on an index page like this one, and follow the link.
This gives you the maximum theoretical bandwidth, which can be calculated as (number of memory channels) x (trasfer width) x (data rate). The data rate is the number of transfers per unit time, and is usually the figure given in the name of the memory type, e.g., DDR-2133 has a data rate of 2133 million transfers per second. Alternately you can calculate it as the product of the bus speed (1067 MHz in this case) and the data rate multiplier (2 for DDR technologies).
For my CPU, this calculation gives 2 memory channels * 8 bytes/transfer * 2133 million transfers/second = 34.128 GB/s, consistent with the ARK figure.
Note that theoretical maximum as reported by ARK might be lower or higher than the theoretical maximum on your particular system for various reasons, including:
Fewer memory channels populated than the maximum number of channels. For example, if I only populated one channel on my dual channel system, theoretical bandwidth would be cut in half.
Not using the maximum speed supported RAM. My CPU supports several RAM types (DDR4-2133, LPDDR3-1866, DDR3L-1600) with varying speeds. The ARK figure assumes you use the fastest possible supported RAM, which is true in my case, but may not be true on other systems.
Over or under-clocking of the memory bus, relative to the nominal speed.
Once you get the correct theoretical figure, you won't actually reach this figure in practice, due to various factors including the following:
Inability to saturate the memory interface from one or more cores due to limited concurrency for outstanding requests, as described in the section "Latency Bound Platforms" in this answer.
Hidden doubling of bandwidth implied by writes that need to read the line before writing it.
Various low-level factors relating the DRAM interface that prevents 100% utilization such as the cost to open pages, the read/write turnaround time, refresh cycles, and so on.
Still, using enough cores and non-termporal stores, you can often get very close to the theoretical bandwidth, often 90% or more.
I have an erlang cluster where erlang:memory() 'total' is between 2-2.5GB from idle to busy time, day in day out. ets memory usage is around 440M and stays around there no matter what. The data within ets is heavily transient, completely changes throughout the day. Tomorrows data is guaranteed to have no commonality to today's.
Linux top says beam is using like 10 gigabytes. free -m 'used' agrees with that (the machine really only runs beam). The overall memory usage of the system grows regularly, like 1% per day on 16GB systems. There is some variance across nodes, but not by alot, and OS 'used' memory is always several times more than erlang:memory() total.
erlang:system_info({allocator, ets_alloc}) shows 20 allocators. Most have data that looks something like this (full output of command is here):
{mbcs_pool,[{blocks,2054},
{blocks_size,742672},
{carriers,10},
{carriers_size,17825792}]},
1) Does this mean that 742K bytes (words?) of memory are actually taking 17M of OS memory?
2) As this post suggests, should we add '+MEas bf' to the VM args, in order to reduce overhead?
3) What else can I do to avoid actually running out of memory?
This is R17.5 but we will be migrating to R19.3 in next deployment (this week). We don't have recon in the current deployment but will be adding it in the next deployment. Also, can't imagine this matters, but beam is running inside an alpine container.
In case someone else runs into this later: this was not actually leaked memory.
The default memory allocator strategy of erlang may not be optimal for your use, depending what you do, and depending on how erlang is configured to allocate blocks. Turns out, in some cases, "free" memory from erlang point of view won't necessarily be immediately released to the OS due to allocator fragmentation.
It's somewhat explained here: http://erlang.org/doc/man/erts_alloc.html
The default allocator strategy for the version of erlang we used at the time is aoffcbf (address order first fit carrier best fit). In our case, this resulted in very high memory fragmentation (10+GB overhead worth). When troubleshooting these things, erlang:system_info(allocator) and erlang:system_info({allocator, Alloc}) are your friend. Changing to aobff (address order best fit) resulted in much more efficient memory usage. In truth, as long as the machine didn't run out of physical memory, it wouldn't matter, but for us, we were getting dangerously close to the physical limit. And you do not want to start paging. With aobff, we never passed 4GB, even after the node being up 18 months. With the aoffcbf we would pass 10GB in a few weeks.
As always, YMMV, as it all depends what type, size, etc.. of blocks are allocated, and how long they live.
I augmented the memory of Dr. Racket a week ago, now I want to reduce it to the same amount as before. So I limit it back to 128 MB. But that has no effect... It is always consuming much more then 128 MB...
It's really a problem because it causes my computer to overheat.
Does someone know how I can limit Dr. Racket so that he don't exceed 128 MB?
Here's a screenshot of the problem :
There is a difference between the memory used by a program and the memory used in total by DrRacket. When I start up DrRacket and before entering or running any program I see that DrRacket uses 250MB. The interaction window states I have limited memory to 128MB too so that means that that particular program cannot go beond those bounds, but there are featrues of DrRacket that uses alot more memory on you machine than mine.
I went into the settings and removed some features I don't use (like Algiol60). When restarting after that I used 50MB less memory which indeed confirms the memory is used by DrRacket and not programs.
For a particular complex program I guess background expansion might use a lot of memory. Perhaps you can turn that off as well to see if not the current used memory goes down.
About heat
As Óscar mentioned memory usage has little to do with heat as long as you don't hear the swap is being used (heavy disk usage). Heat has to do with CPU usage. When doing calculations the OS will make available resources available and perhaps increase the frequencey of the CPU which increases the heat.
If you are making a threaded application that has loops waiting for tasks make sure you are not making an active loop. Sleep might reduce activeness and perhaps Racket has better approaches (never done threded apps in Racket)
If you are calculating something the increase of CPU is natural. It's so that you get the answer earlier. Computer settings can be changed to favor battery time. Check both OS and BIOS. (That makes this not a Racket issue)
The memory shown in the Dr Racket status bar is N/A.
Experiment:
Choose Racket | Limit Memory and specify 8 MB (the minimum).
Choose File | New Tab.
In the Interactions pane allocate 8 MB of memory. For example enter (define x (make-bytes (* 8 1024 1024))). (I recommend assigning the result to a variable, like this, because I doubt you want Dr Racket to print 8 MB of bytes.)
The result I get:
Welcome to DrRacket, version 6.1.1.6--2014-12-21(aabe9d7/a) [3m].
Language: racket [custom]; memory limit: 8 MB.
> (define x (make-bytes (* 8 1024 1024)))
out of memory
>
Assuming you get the same result, there is some other reason your computer is running hotter.
I don't think that the extra memory being consumed is the cause for your computer overheating. More likely, it's because some function is consuming the CPU. Try to optimize the code, instead.
In fact, by limiting the available memory you might end up causing more disk paging, hence slowing things down and potentially consuming more CPU … and causing more overheating.
Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/
The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.
I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.