cache read system memory vs cpu read system memory - memory

On an arm based SoC running Android/Linux, I observed following:
Allocate a memory area as un-cached for device DMA input. After DMA finishes, the content of this memory area is copied to another system memory area.
Alloc a memory area as cached for device DMA input. After DMA finished, invalid the memory range, then copy the content of this memory area to anther system memory area.
The size of memory area allocated is about 2MB which is larger than the cache size (the L2 cache size is 256KB).
method 2 is x10 faster than method 1
That is: the memory copy operation of method 2 is x10 faster than method 1
I speculate that method 2 using cache read by cache line size from system memory when copying and the method 1 needs cpu read by bus transaction size from system memory bypassing the cache hardware.
However, I cannot find explicit explanation. I appreciate who can help providing detailed explaination.

There are so many hardware items involved that it is difficult to give specifics. The SOC determines a lot of this. However, what you observe is typical in performance terms for modern ARM systems.
The main factor is SDRAM. All DRAM is structured with 'rows' and 'columns'.DRAM history On the DRAM chip, an entire 'row' can be read at one time. Ie, there is a matrix of transistors and there is a physical point/wiring where an entire row can be read (in fact there maybe SRAM to store the ROW on the chip). When you read another 'column', you need to 'un-charge/pre-charge' the wiring to access the new 'row'. This takes some time. The main point is that DRAM can read sequential memory very fast in large chunks. Also, there is no command overhead as the memory streams out with each clock edge.
If you mark memory as un-cached, then a CPU/SOC may issue single beat reads. Often these will 'pre-charge' consuming extra cycles during a single read/write and many extra commands must be sent to the DRAM device.
SDRAM also has 'banks'. A bank has a separate 'ROW' buffer (static RAM/multi-transistor memory) which allows you to read from one bank to another without having to recharge/re-read. The banks are often very far apart. If your OS has physically allocated the 'un-cached' memory in a different bank from the 2nd 'cached' area, then this will also add an additional efficiency. It common in an OS to manage cached/un-cached memory separately (for MMU issues). The memory pools are often distant enough to be in separate banks.

Related

Does a DMA controller copy one word of memory at a time?

A DMA controller greatly speeds up memory copy operations because the data in memory doesn't have to be read into the CPU.
From what I've read, DMA controllers can "copy a block of memory from one location to another" in one operation, but thinking about this at a low level, I'm guessing the DMA ultimately has to iterate over memory one word at a time. Is that correct? Is that one word per clock cycle? One word per two clock cycles? (one for read memory into the DMA, one for write to memory) Or does the DMA have a circuit that can somehow (I can't imagine how) copy large chunks of memory in one or two clock cycles?
If the CPU tells the DMA to copy 1024 bytes of memory from one address to another, how many clock cycles will the CPU have free to perform other tasks while waiting for the DMA to finish?
Is it possible to have an architecture where the DMA is doing a memory copy using one bus, while the CPU can access memory at the same time in a different area? Say, in a different bank?
I'm sure it's architecture dependent, so for the answers just pick one or more 8 or 16 bit home micros.
Yes this is architecture dependent. Usually there is a main memory bus, and one or more cache. All buses in the system are not required to be the same width. Memory buses are usually larger than processor words, it might be 64bits wide, so it loads 64bits at a time.
Then the destination might be the same memory bus, or PCIE, or even another bus that is memory mapped, in which case the transfers might be constrained by the destination bus width.
How many clock cycles are available again depend of how things are done. Usually in a µC the DMA triggers an interrupt when it is done, and the CPU does nothing. An other option is polling.
There are dual port memory but usually there is only 1 main memory bus. IIRC banks are usually a trick to avoid large addresses, but use the same memory bus.
Cache and bus arbitration are used to mitigate bus contention, the user really shouldn't care about that. You can have a look at your µC datasheet if you want reliable information.

calculate worst case time to find variable x in memory

I have question from exam but I don't understand the solution, can someone explain the solution for me ?
Memory access time =2.5*10^-7 sec
second memory time = 3*10^-6
TLB time = 10^-8
Given virtual address,value x and 3 level page table, how much time it takes to read x value from memory in the worst case?
the solution is : 10^-8 + 2.5*10^-7 + 3*(3*10^-6 + 2*2.5*10^-7) + 10^-8 = 1076*10^-7
It's pretty obvious that the solution is performing 2 TLB lookups, 7 memory accesses, and 3 secondary memory accesses.
Here are the steps in the process:
1) The CPU accesses the TLB to find the memory location that the virtual address maps to.
2) The CPU accesses main memory to look for the virtual address. This step fails.
3) The CPU accesses the page file (1 memory access to get the page file, 1 more to access the page file entry).
4) The CPU reads from secondary memory to get the page referred to in the page file.
5) Repeat steps 3 & 4 for each level in the page table.
There is no formula as far as I know to calculate best and worst times of memory accesses. However, there are various factors that influence it:
The width of the access. On 32-bit x86, 8-bit and 32-bit accesses tend to be faster than 16-bit ones.
Whether the access is aligned or not. Unaligned accesses tend to be slower than aligned accesses.
Whether accessed memory is cached. Accesses to cached memory are faster than accesses to uncached memory.
The NUMA domain of the accessed memory. Accessing memory belonging to a close NUMA domain is faster than accessing memory belonging to a far NUMA domain.
Whether paging is enabled. Accessing memory when paging is enabled involves traversing paging structures and therefore is slower.
The type of memory. For example, writing to video memory is slower than writing to "normal" memory. Respectively, reading from video memory is much much much slower than reading from "normal" memory.
Other factors I forgot to mention. It's hard to memorise them all.
Furthermore, the influence of each of these factors depends on the underlaying hardware, therefore it would be really hard to invent even an approximation formula that calculates best and worst times of memory accesses.

How does browser GPU memory usage works?

By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.

Why do we need virtual memory?

So my understanding is that every process has its own virtual memory space ranging from 0x0 to 0xFF....F. These virtual addresses correspond to addresses in physical memory (RAM). Why is this level of abstraction helpful? Why not just use the direct addresses?
I understand why paging is beneficial, but not virtual memory.
There are many reasons to do this:
If you have a compiled binary, each function has a fixed address in memory and the assembly instructions to call functions have that address hardcoded. If virtual memory didn't exist, two programs couldn't be loaded into memory and run at the same time, because they'd potentially need to have different functions at the same physical address.
If two or more programs are running at the same time (or are being context-switched between) and use direct addresses, a memory error in one program (for example, reading a bad pointer) could destroy memory being used by the other process, taking down multiple programs due to a single crash.
On a similar note, there's a security issue where a process could read sensitive data in another program by guessing what physical address it would be located at and just reading it directly.
If you try to combat the two above issues by paging out all the memory for one process when switching to a second process, you incur a massive performance hit because you might have to page out all of memory.
Depending on the hardware, some memory addresses might be reserved for physical devices (for example, video RAM, external devices, etc.) If programs are compiled without knowing that those addresses are significant, they might physically break plugged-in devices by reading and writing to their memory. Worse, if that memory is read-only or write-only, the program might write bits to an address expecting them to stay there and then read back different values.
Hope this helps!
Short answer: Program code and data required for execution of a process must reside in main memory to be executed, but main memory may not be large enough to accommodate the needs of an entire process.
Two proposals
(1) Using a very large main memory to alleviate any need for storage allocation: it's not feasible due to very high cost.
(2) Virtual memory: It allows processes that may not be entirely in the memory to execute by means of automatic storage allocation upon request. The term virtual memory refers to the abstraction of separating LOGICAL memory--memory as seen by the process--from PHYSICAL memory--memory as seen by the processor. Because of this separation, the programmer needs to be aware of only the logical memory space while the operating system maintains two or more levels of physical memory space.
More:
Early computer programmers divided programs into sections that were transferred into main memory for a period of processing time. As higher level languages became popular, the efficiency of complex programs suffered from poor overlay systems. The problem of storage allocation became more complex.
Two theories for solving the problem of inefficient memory management emerged -- static and dynamic allocation. Static allocation assumes that the availability of memory resources and the memory reference string of a program can be predicted. Dynamic allocation relies on memory usage increasing and decreasing with actual program needs, not on predicting memory needs.
Program objectives and machine advancements in the '60s made the predictions required for static allocation difficult, if not impossible. Therefore, the dynamic allocation solution was generally accepted, but opinions about implementation were still divided.
One group believed the programmer should continue to be responsible for storage allocation, which would be accomplished by system calls to allocate or deallocate memory. The second group supported automatic storage allocation performed by the operating system, because of increasing complexity of storage allocation and emerging importance of multiprogramming.
In 1961, two groups proposed a one-level memory store. One proposal called for a very large main memory to alleviate any need for storage allocation. This solution was not possible due to very high cost. The second proposal is known as virtual memory.
cne/modules/vm/green/defn.html
To execute a process its data is needed in the main memory (RAM). This might not be possible if the process is large.
Virtual memory provides an idealized abstraction of the physical memory which creates the illusion of a larger virtual memory than the physical memory.
Virtual memory combines active RAM and inactive memory on disk to form
a large range of virtual contiguous addresses. implementations usually require hardware support, typically in the form of a memory management
unit built into the CPU.
The main purpose of virtual memory is multi-tasking and running large programmes. It would be great to use physical memory, because it would be a lot faster, but RAM memory is a lot more expensive than ROM.
Good luck!

memory management and segmentation faults in modern day systems (Linux)

In modern-day operating systems, memory is available as an abstracted resource. A process is exposed to a virtual address space (which is independent from address space of all other processes) and a whole mechanism exists for mapping any virtual address to some actual physical address.
My doubt is:
If each process has its own address space, then it should be free to access any address in the same. So apart from permission restricted sections like that of .data, .bss, .text etc, one should be free to change value at any address. But this usually gives segmentation fault, why?
For acquiring the dynamic memory, we need to do a malloc. If the whole virtual space is made available to a process, then why can't it directly access it?
Different runs of a program results in different addresses for variables (both on stack and heap). Why is it so, when the environments for each run is same? Does it not affect the amount of addressable memory available for usage? (Does it have something to do with address space randomization?)
Some links on memory allocation (e.g. in heap).
The data available at different places is very confusing, as they talk about old and modern times, often not distinguishing between them. It would be helpful if someone could clarify the doubts while keeping modern systems in mind, say Linux.
Thanks.
Technically, the operating system is able to allocate any memory page on access, but there are important reasons why it shouldn't or can't:
different memory regions serve different purposes.
code. It can be read and executed, but shouldn't be written to.
literals (strings, const arrays). This memory is read-only and should be.
the heap. It can be read and written, but not executed.
the thread stack. There is no reason for two threads to access each other's stack, so the OS might as well forbid that. Moreover, the tread stack can be de-allocated when the tread ends.
memory-mapped files. Any changes to this region should affect a specific file. If the file is open for reading, the same memory page may be shared between processes because it's read-only.
the kernel space. Normally the application should not (or can not) access that region - only kernel code can. It's basically a scratch space for the kernel and it's shared between processes. The network buffer may reside there, so that it's always available for writes, no matter when the packet arrives.
...
The OS might assume that all unrecognised memory access is an attempt to allocate more heap space, but:
if an application touches the kernel memory from user code, it must be killed. On 32-bit Windows, all memory above 1<<31 (top bit set) or above 3<<30 (top two bits set) is kernel memory. You should not assume any unallocated memory region is in the user space.
if an application thinks about using a memory region but doesn't tell the OS, the OS may allocate something else to that memory (OS: sure, your file is at 0x12341234; App: but I wanted to store my data there). You could tell the OS by touching the end of your array (which is unreliable anyways), but it's easier to just call an OS function. It's just a good idea that the function call is "give me 10MB of heap", not "give me 10MB of heap starting at 0x12345678"
If the application allocates memory by using it then it typically does not de-allocate at all. This can be problematic as the OS still has to hold the unused pages (but the Java Virtual Machine does not de-allocate either, so hey).
Different runs of a program results in different addresses for variables
This is called memory layout randomisation and is used, alongside of proper permissions (stack space is not executable), to make buffer overflow attacks much more difficult. You can still kill the app, but not execute arbitrary code.
Some links on memory allocation (e.g. in heap).
Do you mean, what algorithm the allocator uses? The easiest algorithm is to always allocate at the soonest available position and link from each memory block to the next and store the flag if it's a free block or used block. More advanced algorithms always allocate blocks at the size of a power of two or a multiple of some fixed size to prevent memory fragmentation (lots of small free blocks) or link the blocks in a different structures to find a free block of sufficient size faster.
An even simpler approach is to never de-allocate and just point to the first (and only) free block and holds its size. If the remaining space is too small, throw it away and ask the OS for a new one.
There's nothing magical about memory allocators. All they do is to:
ask the OS for a large region and
partition it to smaller chunks
without
wasting too much space or
taking too long.
Anyways, the Wikipedia article about memory allocation is http://en.wikipedia.org/wiki/Memory_management .
One interesting algorithm is called "(binary) buddy blocks". It holds several pools of a power-of-two size and splits them recursively into smaller regions. Each region is then either fully allocated, fully free or split in two regions (buddies) that are not both fully free. If it's split, then one byte suffices to hold the size of the largest free block within this block.

Resources