How to find number of memory accesses

How to find number of memory accesses - memory

Can anybody tell me a unix command that can be used to find the number of memory accesses that took place in a given interval. vmstat, top and sar only give the amount of physical memory space occupied/available .. But do not give the number of memory of accesses in a given interval

If I understand what you're asking, such a feature would almost certainly require hardware support at a very low level (e.g. a counter of some sort that monitors memory bus activity).
I don't think such support is available for the common architectures supported by
Unix or Linux, so I'm going to go out on a limb and say that no such Unix command exists.
The situation is somewhat different when considering memory in units of pages,
because most architectures that support virtual memory have dedicated MMU hardware
which operates at that level of granularity, and can be accessed by the operating
system. But as far as I know, the sorts of counter data you'd get from the MMU would
represent events like page faults, allocations, and releases, rather than individual
reads or writes.

Related

Why is the memory address printed with {:p} much bigger than my RAM specs?

I want to print the memory location (address) of a variable with:
let x = 1;
println!("{:p}", &x);
This prints the hex value 0x7fff51ef6380 which in decimal is 140734568031104.
My computer has 16GB of RAM, so why this huge number? Does the x64 architecture use a big interval sequence instead of just simple 1 increment for accessing memory location?
In x86, usually the first location starts at 0, then 1, 2, etc. so the highest number you can have is around 4 billion, so the address number was always equals or less than 4 billion.
Why is this not the case with x64?

What you see here is an effect of virtual memory. Memory management is hard and it becomes even harder when the operating system and tens of hundreds of processes have to share the memory. In order to handle this huge complexity, the concept of virtual memory was used. I'll just briefly explain the basics here; the topic is far more complex and you should read about it somewhere else, too.
On most modern computers, each process thinks that it owns (almost) the complete memory space. But processes never deal with physical addresses, but with virtual ones. These virtual addresses are mapped to physical ones each time the process actually reads from memory. This translation of addresses is done by the so called MMU (memory management unit). The rules for how to map the addresses are setup by the operating system.
When you boot your PC, the operating system creates an initial mapping. Every time you start a process, the operating system adds a few slices of physical memory to the process and modifies the mapping appropriately. That way, the process has memory to play with.
On x86_64, the address space is 64 bit wide, so each process thinks it owns all of those 2^64 addresses. This is not true, of course:
There isn't a single PC on the world with that much memory. (In fact, most CPUs today can merely use 280 TB of RAM, since they internally can only use 48bit for addressing physical memory. And even these 280TB are enough for now, apparently.)
Even if you had that much memory, there are other processes which use part of that memory, too.
So what happens when you try to read an address which isn't mapped (which in 64bit land, are the vast majority of the addresses)? The MMU triggers a page fault. This makes the CPU notify the operating system to handle this.
What I mean is that in x86, usually first location starts at 0, then 1, 2, etc. so the highest number you can have is around 4 billion.
That is true, but it is also true if your x86 system has less than 4GB of RAM. Virtual memory exists for quite some time already.
So that's a short summary of why you see such big addresses. Again, please note that I glossed over many details here.

The pointers your program works with are in virtual address space. x86-64 uses 64-bit pointers. This was one of the major goals of AMD64, along with adding more integer and XMM registers. You are correct that i386 only has 32-bit pointers which only cover 4GB of address space in each process.
0x7fff51ef6380 looks like a stack pointer, which I guess makes sense for that code.
Linux on x86-64 (for example) puts the stack near the top of the lower canonical address range: current x86-64 hardware only implements 48-bit virtual addresses and this is the mechanism to prevent software from depending on it. This allows the address space to be extended in the future without breaking software.
The amount of phyiscal RAM in your system has nothing to do with this. You'd see (approximately) the same number on an x86-64 system with 128MB of RAM, +/- stack address space layout randomization (ASLR).

CPUs in multi-core architectures and memory access

I wondered how memory access is handled "in general" if ,for example, 2 cores of CPU try to access memory at the same time (over the memory controller)? Actually the same applies when a core and an DMA-enabled IO device try to access in the same way.
I think, memory controller is smart enough to utilise the address bus and handle those requests concurrently, however I'm not sure what happens when they try to access to same location or when the IO operation monopolises the address bus and there's no room for CPU to move on.
Thx

The short answer is "it's complex, but access can certainly potentially occur in parallel in certain situations".
I think your question is a bit too black and white: you may be looking for an answer like "yes, multiple devices can access memory at the same time" or "no they can't", but the reality is that first you'd need to describe some specific hardware configuration, including some of the low-level implementation details and optimization features to get an exact answer. Finally you'd need to define exactly what you mean by "the same time".
In general, a good first-order approximation is that hardware will make it appear that all hardware can access memory approximately simultaneously, possibly with an increase in latency and a decrease in bandwidth due to contention. At the very fine-grained timing level access one device may indeed postpone access by another device, or it may not, depending on many factors. It is extremely unlikely you would need this information to implement software correctly, and quite unlikely you need to know the details even to maximize performance.
That said, if you really need to know the details, read on and I can give some general observations on some kind of idealized latpop/desktop/server scale hardware.
As Matthias mentioned, you first have to consider caching. Caching means that any read or write operation subject to caching (which includes nearly all CPU requests and many other types of requests as well) may not touch memory at all, so in that sense many cores can "access" memory (at least the cache image of it) simultaneous.
If you then consider requests that miss in all cache levels, you need to know about the configuration of the memory subsystem. In general a RAM chips can only do "one thing" at a time (i.e., commands1 such a read and write apply to the entire module) and that usually extends to DRAM modules comprised of several chips and also to a series of DRAMs connected via a bus to a single memory controller.
So you can say that electrically speaking, the combination of one memory controller and its attached RAM is likely to be doing only on thing at once. Now that thing is usually something like reading bytes out of a physically contiguous span of bytes, but that operation could actually help handle several requests from different devices at once: even though each devices sends separate requests to the controller, good implementations will coalesce requests to the same or nearby2 area of memory.
Furthermore, even the CPU may have such abilities: when a new request occurs it can/must notice that an existing request is in progress for an overlapping region and tie the new request to an old one.
Still, you can say that for a single memory controller you'll usually be serving the request of one device at a time, absent unusual opportunities to combine requests. Now the requests themselves are typically on the order of nanoseconds, so many separate requests can be served in a small unit of time, so this "exclusiveness" fine-grained and not generally noticeable3.
Now above I was careful to limit the discussion to a single memory-controller - when you have multiple memory controllers4 you can definitely have multiple devices accessing memory simultaneously even at the RAM level. Here each controller is essentially independent, so if the requests from two devices map to different controllers (different NUMA regions) they can proceed in parallel.
That's the long answer.
1 In fact, the command stream is lower level and more complex than things like "read" or "write" and involves concepts such as opening a memory page, streaming bytes from it, etc. What every programmer should know about memory serves as an excellent intro to the topic.
2 For example, imagine two requests for adjacent bytes in memory: it is possible the controller can combine them into a single request if they fit within the bus width.
3 Of course if you are competing for memory across several devices, the overall impact may be very noticeable: a reduction in per-device bandwidth and an increase in latency, but what I mean is that the sharing is fine-grained enough that you can't generally tell the difference between finely-sliced exclusive access and some hypothetical device which makes simultaneous progress on each request in each period.
4 The most common configuration on modern hardware is one memory controller per socket, so on a 2P system you'd usually have two controllers, also other rations (both higher and lower) are certainly possible.

There are dozens of things that come into play. E.g. on the lowest level there are bus arbitration mechanisms which allow that multiple participants can access a shared address and data bus.
On a higher level there are also things like CPU caches that need to be considered: If a CPU reads from memory it might only read from it's local cache, which might not reflect that state that exists in another CPU cores local cache. To synchronize memory between cache instances in multicore systems there exist cache coherence protocols which are are implemented in the CPUs. These have to guarantee that if one CPU writes to shared memory the caches of all other CPUs (which might also contain a copy of the memory locations content) get updated.

Why do we need virtual memory?

So my understanding is that every process has its own virtual memory space ranging from 0x0 to 0xFF....F. These virtual addresses correspond to addresses in physical memory (RAM). Why is this level of abstraction helpful? Why not just use the direct addresses?
I understand why paging is beneficial, but not virtual memory.

There are many reasons to do this:
If you have a compiled binary, each function has a fixed address in memory and the assembly instructions to call functions have that address hardcoded. If virtual memory didn't exist, two programs couldn't be loaded into memory and run at the same time, because they'd potentially need to have different functions at the same physical address.
If two or more programs are running at the same time (or are being context-switched between) and use direct addresses, a memory error in one program (for example, reading a bad pointer) could destroy memory being used by the other process, taking down multiple programs due to a single crash.
On a similar note, there's a security issue where a process could read sensitive data in another program by guessing what physical address it would be located at and just reading it directly.
If you try to combat the two above issues by paging out all the memory for one process when switching to a second process, you incur a massive performance hit because you might have to page out all of memory.
Depending on the hardware, some memory addresses might be reserved for physical devices (for example, video RAM, external devices, etc.) If programs are compiled without knowing that those addresses are significant, they might physically break plugged-in devices by reading and writing to their memory. Worse, if that memory is read-only or write-only, the program might write bits to an address expecting them to stay there and then read back different values.
Hope this helps!

Short answer: Program code and data required for execution of a process must reside in main memory to be executed, but main memory may not be large enough to accommodate the needs of an entire process.
Two proposals
(1) Using a very large main memory to alleviate any need for storage allocation: it's not feasible due to very high cost.
(2) Virtual memory: It allows processes that may not be entirely in the memory to execute by means of automatic storage allocation upon request. The term virtual memory refers to the abstraction of separating LOGICAL memory--memory as seen by the process--from PHYSICAL memory--memory as seen by the processor. Because of this separation, the programmer needs to be aware of only the logical memory space while the operating system maintains two or more levels of physical memory space.
More:
Early computer programmers divided programs into sections that were transferred into main memory for a period of processing time. As higher level languages became popular, the efficiency of complex programs suffered from poor overlay systems. The problem of storage allocation became more complex.
Two theories for solving the problem of inefficient memory management emerged -- static and dynamic allocation. Static allocation assumes that the availability of memory resources and the memory reference string of a program can be predicted. Dynamic allocation relies on memory usage increasing and decreasing with actual program needs, not on predicting memory needs.
Program objectives and machine advancements in the '60s made the predictions required for static allocation difficult, if not impossible. Therefore, the dynamic allocation solution was generally accepted, but opinions about implementation were still divided.
One group believed the programmer should continue to be responsible for storage allocation, which would be accomplished by system calls to allocate or deallocate memory. The second group supported automatic storage allocation performed by the operating system, because of increasing complexity of storage allocation and emerging importance of multiprogramming.
In 1961, two groups proposed a one-level memory store. One proposal called for a very large main memory to alleviate any need for storage allocation. This solution was not possible due to very high cost. The second proposal is known as virtual memory.
cne/modules/vm/green/defn.html

To execute a process its data is needed in the main memory (RAM). This might not be possible if the process is large.
Virtual memory provides an idealized abstraction of the physical memory which creates the illusion of a larger virtual memory than the physical memory.
Virtual memory combines active RAM and inactive memory on disk to form
a large range of virtual contiguous addresses. implementations usually require hardware support, typically in the form of a memory management
unit built into the CPU.

The main purpose of virtual memory is multi-tasking and running large programmes. It would be great to use physical memory, because it would be a lot faster, but RAM memory is a lot more expensive than ROM.
Good luck!

memory management and segmentation faults in modern day systems (Linux)

In modern-day operating systems, memory is available as an abstracted resource. A process is exposed to a virtual address space (which is independent from address space of all other processes) and a whole mechanism exists for mapping any virtual address to some actual physical address.
My doubt is:
If each process has its own address space, then it should be free to access any address in the same. So apart from permission restricted sections like that of .data, .bss, .text etc, one should be free to change value at any address. But this usually gives segmentation fault, why?
For acquiring the dynamic memory, we need to do a malloc. If the whole virtual space is made available to a process, then why can't it directly access it?
Different runs of a program results in different addresses for variables (both on stack and heap). Why is it so, when the environments for each run is same? Does it not affect the amount of addressable memory available for usage? (Does it have something to do with address space randomization?)
Some links on memory allocation (e.g. in heap).
The data available at different places is very confusing, as they talk about old and modern times, often not distinguishing between them. It would be helpful if someone could clarify the doubts while keeping modern systems in mind, say Linux.
Thanks.

Technically, the operating system is able to allocate any memory page on access, but there are important reasons why it shouldn't or can't:
different memory regions serve different purposes.
code. It can be read and executed, but shouldn't be written to.
literals (strings, const arrays). This memory is read-only and should be.
the heap. It can be read and written, but not executed.
the thread stack. There is no reason for two threads to access each other's stack, so the OS might as well forbid that. Moreover, the tread stack can be de-allocated when the tread ends.
memory-mapped files. Any changes to this region should affect a specific file. If the file is open for reading, the same memory page may be shared between processes because it's read-only.
the kernel space. Normally the application should not (or can not) access that region - only kernel code can. It's basically a scratch space for the kernel and it's shared between processes. The network buffer may reside there, so that it's always available for writes, no matter when the packet arrives.
...
The OS might assume that all unrecognised memory access is an attempt to allocate more heap space, but:
if an application touches the kernel memory from user code, it must be killed. On 32-bit Windows, all memory above 1<<31 (top bit set) or above 3<<30 (top two bits set) is kernel memory. You should not assume any unallocated memory region is in the user space.
if an application thinks about using a memory region but doesn't tell the OS, the OS may allocate something else to that memory (OS: sure, your file is at 0x12341234; App: but I wanted to store my data there). You could tell the OS by touching the end of your array (which is unreliable anyways), but it's easier to just call an OS function. It's just a good idea that the function call is "give me 10MB of heap", not "give me 10MB of heap starting at 0x12345678"
If the application allocates memory by using it then it typically does not de-allocate at all. This can be problematic as the OS still has to hold the unused pages (but the Java Virtual Machine does not de-allocate either, so hey).
Different runs of a program results in different addresses for variables
This is called memory layout randomisation and is used, alongside of proper permissions (stack space is not executable), to make buffer overflow attacks much more difficult. You can still kill the app, but not execute arbitrary code.
Some links on memory allocation (e.g. in heap).
Do you mean, what algorithm the allocator uses? The easiest algorithm is to always allocate at the soonest available position and link from each memory block to the next and store the flag if it's a free block or used block. More advanced algorithms always allocate blocks at the size of a power of two or a multiple of some fixed size to prevent memory fragmentation (lots of small free blocks) or link the blocks in a different structures to find a free block of sufficient size faster.
An even simpler approach is to never de-allocate and just point to the first (and only) free block and holds its size. If the remaining space is too small, throw it away and ask the OS for a new one.
There's nothing magical about memory allocators. All they do is to:
ask the OS for a large region and
partition it to smaller chunks
without
wasting too much space or
taking too long.
Anyways, the Wikipedia article about memory allocation is http://en.wikipedia.org/wiki/Memory_management .
One interesting algorithm is called "(binary) buddy blocks". It holds several pools of a power-of-two size and splits them recursively into smaller regions. Each region is then either fully allocated, fully free or split in two regions (buddies) that are not both fully free. If it's split, then one byte suffices to hold the size of the largest free block within this block.

Checking the amount of available RAM within a running program

A friend of mine was asked, during a job interview, to write a program that measures the amount of available RAM. The expected answer was using malloc() in a binary-search manner: allocating larger and larger portions of memory until getting a failure message, reducing the portion size, and summing the amount of allocated memory.
I believe that this method will measure the amount of virtual, not physical, memory. But I got curious about the matter.
Is there a way to tell the amount of available RAM from within the program, without using exec(dmesg |grep -i memory) ?

You are correct: malloc() makes no distinction between physical or virtual memory. In fact, that's the whole point of virtual memory: to make such details irrelevant to programs.
You can find out but it is OS-specific. For example, Linux.

The only way to do this is to use some OS-specific functionality. Using malloc() is useless for a number of reasons:
it measures virtual memory
the OS may well have per-process cap on memory allocations
allocating much more memory than is physically available often degrades the platforms stability to the point where "go back one" algorithm suggested in the question probably won't work

this is OS specific and you should collect such information from the OS services unless you want to make your own memory management layer

Using malloc() will only tell you how much memory can be allocated to a single process. There may be reasons why this is lower than the total amount of virtual memory. For instance, you might have OS quota or a per-process 32-bit-limited address space.
(And, of course, virtual memory >= RAM)

Very OS specific but for Linux the information about system memory is in /proc/meminfo. You can also probably use the sysctl interface (http://www.linuxjournal.com/article/2365) to get this data in a C program.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart