Measuring TLB miss handling cost in x86-64

Measuring TLB miss handling cost in x86-64 - profiler

I want to estimate the performance overhead due to TLB misses on a x86-64 (Intel Nehalem) machine running Linux. I wish to get this estimate by using some performance counters. Does anybody has some pointers on what is the best way to estimate this?
Thanks
Arka

If you can get access to a "Westmere" based system the performance characteristics of your code should be quite similar to what you have on the "Nehalem", but you will have access to a new hardware performance counter event that measures almost exactly what you want.
On Westmere, the best estimate of performance lost while waiting for TLB misses to be handled is probably from the hardware performance counter Event 08H, Mask 04H "DTLB_LOAD_MISSES.WALK_CYCLES", which is described as counting "Cycles Page Miss Handler is busy with a page walk due to a load miss in the Second Level TLB".
This is described in "Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 3B: System Programming Guide, Part 2" (document number: 253669), available online at
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html
The reason this event is necessary is that TLB miss processing time is dominated by the time required to read the cache line containing the page table entry. If that cache line is in the L2 cache, then the overhead of a TLB misses will be very small (of the order of 10 cycles). If the line is in the L3 cache, then maybe 25 cycles. If the line is in memory, then ~200 cycles.
If there is also a miss in the upper-level page translation caches, it will take multiple trips to memory to find and retrieve the desired page table entry (e.g., https://stackoverflow.com/a/9674980/1264917).
On some processors the L2 cache counters can tell you how many table walks hit and missed in the L2, but not on Nehalem. (It would not help a lot in this case since TLB walks that hit in the L3 are also fairly fast and what you really want are the TLB walks that have to go to memory.)

Related

What makes a TLB faster than a Page Table if they both require two memory accesses?

Just going off wikipedia:
The page table, generally stored in main memory, keeps track of where the virtual pages are stored in the physical memory. This method uses two memory accesses (one for the page table entry, one for the byte) to access a byte. First, the page table is looked up for the frame number. Second, the frame number with the page offset gives the actual address. Thus any straightforward virtual memory scheme would have the effect of doubling the memory access time. Hence, the TLB is used to reduce the time taken to access the memory locations in the page table method.
So given that, what I'm curious about is why the TLB is actually faster because from what I know it's just a smaller, exact copy of the page table.
You still need to access the TLB to find the physical address, and then once you have that, you still need to actually access the data at the physical address, which is two lookups just like with the page table.
I can only think of two reasons why the TLB is faster:
looking up an address in the TLB or page table is not O(n) (I assumed it's O(1) like a hash table). Thus, since the TLB is much smaller, it's faster to do a lookup. Also in this case, why not just use a hash table instead of a TLB?
I incorrectly interpreted how the TLB works, and it's not actually doing two accesses.

I realize it has been three years since this question was asked, but since it is still just as relevant, and it still shows up in search engines, I'll try my best to produce a complete answer.
Accessing the main memory through the TLB rather than the page table is faster primarily for two reasons:
1. The TLB is faster than main memory (which is where the page table resides).
The typical access time is in the order of <1 ns for the TLB and 100 ns for main memory
A TLB access is part of an L1 cache hit, and modern CPUs can do 2 loads per clock if they both hit in L1d cache.
The reasons for this are twofold:
The TLB is located within the CPU, while main memory - and thus the page table - is not.
The TLB - like other caches - is made of fast and expensive SRAM, whereas main memory usually consists of slow and inexpensive DRAM (read more here).
Thus, if the supposition that both the TLB and page table require only one memory access was correct, a TLB hit would still, roughly speaking, halve memory access time. However, as we shall see next, the supposition is not correct, and the benefit of having a TLB is even greater.
2. Accessing the page table usually requires multiple memory accesses.
This really is the crux of the issue.
Modern CPUs tend to use multilevel page tables in order to save memory. Most notably, x86-64 page tables currently consist of up to four levels (and a fifth may be coming). This means that accessing a single byte in memory through the page table requires up to five memory accesses: four for the page table and one for the data. Obviously the cost would be unbearably high if not for the TLB; it is easy to see why CPU and OS engineers put in a lot of effort to minimize the frequency of TLB misses.
Finally, do note that even this explanation is somewhat of a simplification, as it ignores, among other things, data caching. The detailed mechanics of modern desktop CPUs are complex and, to a degree, undisclosed. For a more detailed discussion on the topic, refer to this thread, for instance.
Page-table accesses can and are cached by data caches on modern CPUs, but the next access in a page-walk depends on the result of the first access (a pointer to the next level of the page table), so a 4-level page walk would have about 4x 4 cycle = 16 cycle latency even if all accesses hit in L1d cache. That would be a lot more for the pipeline to hide than the ~3 to 4 cycle TLB latency that's part of an L1d cache hit load in a modern Intel CPU (which of course uses TLBs for data and instruction accesses).

You are right in your assumption that approach with TLB still requires 2 accesses. But the approach with TLB is faster because:
TLB is made of faster memory called associative memory
Usually we make 2 memory accesses to physical memory but with TLB there is 1 access to TLB and other access is to physical memory.
Associative memory is faster because it is content addressable memory but its expensive too , because of the extra logic circuits required.
You can read about the content addressable memory here.

It depends upon the specific implementation. In general, the TLB is a cache that exists within the CPU.
You still need to access the TLB to find the physical address, and then once you have that, you still need to actually access the data at the physical address, which is two lookups just like with the page table.
The CPU can access the cache much faster than it can access data through the memory bus. It is making two accesses to two different places (one faster and one slower). Also, it is possible for the memory location to be cached within the CPU as well, in which case no accesses are required to go through the memory bus.

I think #ihohen said it pretty much but as a student to future students may come here, in simple words an explanation is:
" Without a TLB in a single level paging you need 2 accesses to main memory:
1 for finding the translation of the logical adress in the page table (which is placed in main memory) and 1 another for actually accessing the memory block ".
Now with a TLB , you reduce the above only to one access (the second one) because the step of finding the translation (hopefully) will take place without needing to access main memory because you will find the translation in the TLB which placed in cpu ".
So when we say that a TLB reduces access time by 2 , we mean that approximately if we ignore the case of a TLB miss, and consider the simplest model of paging (the single level one) then is fair to say that a TLB speeds up the process by 2.
There will be many variations, because first and foremost today's computers will use advanced paging techniques (multilevel, demand paging e.t.c) but this sentence is an
intuitive explanation as to why the idea of TLB is much more helpful than a simple page table.
The book "Operating Systems " by Silberschatz states another (a little bit more detailed) math type to measure access time with a TLB:
Consider:
h : TLB hit ratio
τ : time to access main memory
e : time spend searching to find TLB registration
t = h * (e + τ) + (1-h)*(e + 2τ)

calculate worst case time to find variable x in memory

I have question from exam but I don't understand the solution, can someone explain the solution for me ?
Memory access time =2.5*10^-7 sec
second memory time = 3*10^-6
TLB time = 10^-8
Given virtual address,value x and 3 level page table, how much time it takes to read x value from memory in the worst case?
the solution is : 10^-8 + 2.5*10^-7 + 3*(3*10^-6 + 2*2.5*10^-7) + 10^-8 = 1076*10^-7

It's pretty obvious that the solution is performing 2 TLB lookups, 7 memory accesses, and 3 secondary memory accesses.
Here are the steps in the process:
1) The CPU accesses the TLB to find the memory location that the virtual address maps to.
2) The CPU accesses main memory to look for the virtual address. This step fails.
3) The CPU accesses the page file (1 memory access to get the page file, 1 more to access the page file entry).
4) The CPU reads from secondary memory to get the page referred to in the page file.
5) Repeat steps 3 & 4 for each level in the page table.

There is no formula as far as I know to calculate best and worst times of memory accesses. However, there are various factors that influence it:
The width of the access. On 32-bit x86, 8-bit and 32-bit accesses tend to be faster than 16-bit ones.
Whether the access is aligned or not. Unaligned accesses tend to be slower than aligned accesses.
Whether accessed memory is cached. Accesses to cached memory are faster than accesses to uncached memory.
The NUMA domain of the accessed memory. Accessing memory belonging to a close NUMA domain is faster than accessing memory belonging to a far NUMA domain.
Whether paging is enabled. Accessing memory when paging is enabled involves traversing paging structures and therefore is slower.
The type of memory. For example, writing to video memory is slower than writing to "normal" memory. Respectively, reading from video memory is much much much slower than reading from "normal" memory.
Other factors I forgot to mention. It's hard to memorise them all.
Furthermore, the influence of each of these factors depends on the underlaying hardware, therefore it would be really hard to invent even an approximation formula that calculates best and worst times of memory accesses.

Spreadsheet Gear -- Generating large report via copy and paste seems to use a lot of memory and processor

I am attempting to generate a large workbook based report with 3 supporting worksheets of 100,12000 and 12000 rows and a final output sheet all formula based that ends up representing about 120 entities at 100 rows a piece. I generate a template range and copy and paste it replacing the entity ID cell after pasting each new range. It is working fine but I noticed that memory usage in the IIS Express process is approx 500mb and it is taking 100% processor usage as well.
Are there any guidelines for generating workbooks in this manner?

At least in terms of memory utilization, it would help to have some comparison, maybe against Excel, in how much memory is utilized to simply have the resultant workbook opened. For instance, if you were to open the final report in both Excel and the "SpreadsheetGear 2012 for Windows" application (available in the SpreadsheetGear folder under the Start menu), what does the Task Manager measure for each of these applications in terms of memory consumption? This may provide some insight as to whether the memory utilization you are seeing in the actual report-building process is unusually high (is there a lot of extra overhead for your routine?), or just typical given the size of the workbook you are generating.
In terms of CPU utilization, this one is a bit more difficult to pinpoint and is certainly dependent on your hardware as well as implementation details in your code. Running a VS Profiler against your routine certainly would be interesting to look into, if you have this tool available to you. Generally speaking, the CPU time could potentially be broken up into a couple broad categories—CPU cycles used to "build" your workbook and CPU cycles to "calculate" it. It could be helpful to better determine which of these is dominating the CPU. One way to do this might be to, if possible, ensure that calculations don't occur until you are finished actually generating the workbook. In fact, avoiding any unnecessary calculations could potentially speed things up...it depends on the workbook, though. You could avoid calculations by setting IWorkbookSet.Calculation to Manual mode and not calling any of the IWorkbook’s "Calculate" methods (Calculate/CalculateFull/CalculateFullRebuild) until you are fished up with this process. If you don't have access to a Profiler too, maybe set some timers, Console.WriteLines and monitor the Task Manager to see how your CPU fluctuates during different parts of your routine. With any luck you might be able to better isolate what part of the routine is taking the most amount of time.

Software memory bit-flip detection for platforms without ECC

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.

The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/

The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören

So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.

I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören

Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart