Why memory alignment of 4 is needed for efficient access? - memory

I understand why data need to be aligned (and all the efforts made to accomplish it like padding) so we can reduce the number of memory accesses but this assumes that processor just can fetch addresses multiples of 4(supposing we are using a 32-bit architecture).
And because of that assumption we need to align memory.
My question is:Why we can just access addresses multiple of 4(efficiency, hardware restriction, another one)?
Which is the advantages of doing this? Why cannot we access all the addresses available?

Memory is constructed from hardware (RAM) that is attached to memory busses. The wider the bus, the fewer cycles are required to fetch data. If memory was one byte wide, you'd need four cycles to read one 32-bit value. Over time memory architectures have evolved, and depending on the class of processor (embedded, low power, high performance, etc.), and the cache design, memory may be quite wide (say, 256 bits).
Given a very wide internal bus (between RAM or cache) and registers, say twice the width of the register, you could fetch a value in one cycle regardless of alignment if you have a barrel shifter in the data path. Barrel shifters are expensive, so not all processors have them; without one in the path, multiple cycles would be needed to align the value.

Related

cache read system memory vs cpu read system memory

On an arm based SoC running Android/Linux, I observed following:
Allocate a memory area as un-cached for device DMA input. After DMA finishes, the content of this memory area is copied to another system memory area.
Alloc a memory area as cached for device DMA input. After DMA finished, invalid the memory range, then copy the content of this memory area to anther system memory area.
The size of memory area allocated is about 2MB which is larger than the cache size (the L2 cache size is 256KB).
method 2 is x10 faster than method 1
That is: the memory copy operation of method 2 is x10 faster than method 1
I speculate that method 2 using cache read by cache line size from system memory when copying and the method 1 needs cpu read by bus transaction size from system memory bypassing the cache hardware.
However, I cannot find explicit explanation. I appreciate who can help providing detailed explaination.
There are so many hardware items involved that it is difficult to give specifics. The SOC determines a lot of this. However, what you observe is typical in performance terms for modern ARM systems.
The main factor is SDRAM. All DRAM is structured with 'rows' and 'columns'.DRAM history On the DRAM chip, an entire 'row' can be read at one time. Ie, there is a matrix of transistors and there is a physical point/wiring where an entire row can be read (in fact there maybe SRAM to store the ROW on the chip). When you read another 'column', you need to 'un-charge/pre-charge' the wiring to access the new 'row'. This takes some time. The main point is that DRAM can read sequential memory very fast in large chunks. Also, there is no command overhead as the memory streams out with each clock edge.
If you mark memory as un-cached, then a CPU/SOC may issue single beat reads. Often these will 'pre-charge' consuming extra cycles during a single read/write and many extra commands must be sent to the DRAM device.
SDRAM also has 'banks'. A bank has a separate 'ROW' buffer (static RAM/multi-transistor memory) which allows you to read from one bank to another without having to recharge/re-read. The banks are often very far apart. If your OS has physically allocated the 'un-cached' memory in a different bank from the 2nd 'cached' area, then this will also add an additional efficiency. It common in an OS to manage cached/un-cached memory separately (for MMU issues). The memory pools are often distant enough to be in separate banks.

calculate worst case time to find variable x in memory

I have question from exam but I don't understand the solution, can someone explain the solution for me ?
Memory access time =2.5*10^-7 sec
second memory time = 3*10^-6
TLB time = 10^-8
Given virtual address,value x and 3 level page table, how much time it takes to read x value from memory in the worst case?
the solution is : 10^-8 + 2.5*10^-7 + 3*(3*10^-6 + 2*2.5*10^-7) + 10^-8 = 1076*10^-7
It's pretty obvious that the solution is performing 2 TLB lookups, 7 memory accesses, and 3 secondary memory accesses.
Here are the steps in the process:
1) The CPU accesses the TLB to find the memory location that the virtual address maps to.
2) The CPU accesses main memory to look for the virtual address. This step fails.
3) The CPU accesses the page file (1 memory access to get the page file, 1 more to access the page file entry).
4) The CPU reads from secondary memory to get the page referred to in the page file.
5) Repeat steps 3 & 4 for each level in the page table.
There is no formula as far as I know to calculate best and worst times of memory accesses. However, there are various factors that influence it:
The width of the access. On 32-bit x86, 8-bit and 32-bit accesses tend to be faster than 16-bit ones.
Whether the access is aligned or not. Unaligned accesses tend to be slower than aligned accesses.
Whether accessed memory is cached. Accesses to cached memory are faster than accesses to uncached memory.
The NUMA domain of the accessed memory. Accessing memory belonging to a close NUMA domain is faster than accessing memory belonging to a far NUMA domain.
Whether paging is enabled. Accessing memory when paging is enabled involves traversing paging structures and therefore is slower.
The type of memory. For example, writing to video memory is slower than writing to "normal" memory. Respectively, reading from video memory is much much much slower than reading from "normal" memory.
Other factors I forgot to mention. It's hard to memorise them all.
Furthermore, the influence of each of these factors depends on the underlaying hardware, therefore it would be really hard to invent even an approximation formula that calculates best and worst times of memory accesses.

What applications require 1GB pages?

X86 and x64 processors allow for 1GB pages when the PDPE flag is set on the cpu. In what application would this be practical or required and for what reason?
Hugepage would help in cases where you have a large memory footprint and memory access pattern spans large distance (across 4K pages).
It not only reduces TLB miss but also saves OS mm system page tables size.
A very good example is packet processing. In high throughput network applications (1Gbps or more), packets are normally stored in a packet buffer pool (i.e. pooling technique). For example, every packet buffer is 2KB in size and the pool contains 512 buffers. Access pattern of this packet buffer pool might not be sequential (buffer indexed at 1,2,3,4,5...) but rather random over time (1,104,407,45,905...). Since normal page size is 4K, normal TLB won't help here since each packet access would incur a TLB miss and there is a lot of different buffers sitting on different pages.
In contrast, if you put the pool in a 1GB hugepage, then all packet buffers share the same hugepageTLB entry thus avoiding misses.
This is used in DPDK (Data Plane Development Kit) where the packet
rate is very high that cycles wasted on TLB miss is not negligible.
Hugepage support is required for the large memory pool allocation used
for packet buffers (the HUGETLBFS option must be enabled in the
running kernel as indicated the previous section). By using hugepage
allocations, performance is increased since fewer pages are needed,
and therefore less Translation Lookaside Buffers (TLBs, high speed
translation caches), which reduce the time it takes to translate a
virtual page address to a physical page address. Without hugepages,
high TLB miss rates would occur with the standard 4k page size,
slowing performance.
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html#bios-setting-prerequisite-on-x86
Another example from Oracle:
...almost 6.8 GB of memory used for page tables when hugepages were not
configured...
...after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB
http://www.databasejournal.com/features/oracle/understanding-hugepages-in-oracle-database.html
Related links:
https://en.wikipedia.org/wiki/Object_pool_pattern
--Edit--
However, hugepage should be used carefully. Above I mentioned that memory pool would benefit from 1GB hugepage. However, if you have an access pattern even across 1GB page boundary, then it might not help. There is an excellent blog on this:
http://www.pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be/
Imagine an application that uses huge amounts of memory—Molecular modeling. Weather prediction—especially if it has no user interaction.
Large pages:
(1) reduce the amount of page table overhead memory
(2) increases the amount of memory that can be stored in the MMU cache. (The same number of cache entries references more memory).
I have LabView installed on my Dell ws with 8 cores and 16GB DDRM, driving 4 24" monitors.If I create a video processor or compositor of most any type, with a 1024px x 1024px 'drawing' display, LabView reserves 1.5GB before I even began to composite. It was built from C and C++. I often store image details in 3D arrays of 256 x 256 x 256 of U32 integers that hold each RGB pixel color, plus the alpha channel for opacity or masking. That's 64MB per each layer of buffered video. If I need to remember 128 layers, thats 8GB right there. LabView is a programming langauge structured much like a CAD program. If I need 8GB for a series of video (HDTV) buffers, that is what it will give me, with a few seconds wait for malloc to do its work. If I created a 8GB 3D array for a database, it would be no different, even if I did it in MySQL (not as an array). To me, having many gigabytes of ram to play with is the norm, not an exception.

Dynamic Array Memory Allocation Strategies

I've written a 32bit program using a dynamic array to store a list of triangles with an unknown count. My current strategy is to estimate a very large number of triangles and then trim the list when all the triangles are created. In some cases I'll only allocate memory once in others I'll need to add to the allocation.
With a very large data set I'm running out of memory when my application is memory usage is about 1.2GB and since the allocation step is so large I feel like I may be fragmenting memory.
Looking at FastMM (memory manager) I see these constants which would suggest one of these as a good size to increment by.
ChunkSize = 64 * 1024;
MaximumSmallBlockSize = 32752;
LargeBlockGranularity = 64 * 1024;
Would one of these be an optimal size for increasing the size of an array?
Eventually this program will become 64bit but we're not quite ready for that step.
Your real problem here is not that you are running out of memory, but that the memory allocator cannot find a large enough block of contiguous address space. Some simple things you can do to help include:
Execute the code in a 64 bit process.
Add the LARGEADDRESSAWARE PE flag so that your process gets a 4GB address space rather than 2GB.
Beyond that the best you can do is allocate smaller blocks so that you avoid the requirement to store your large data structure in contiguous memory. Allocate memory in blocks. So, if you need 1GB of memory, allocate 64 blocks of size 16MB, for instance. The exact block size that you use can be tuned to your needs. Larger blocks result in better allocation performance, but smaller blocks allow you to use more address space.
Wrap this up in a container that presents an array like interface to the consumer, but internally stores the memory in non-contiguous blocks.
As far as I know, dynamic arrays in Delphi use contiguous address space (at least in the virtual memory address space.)
Since you are running out of memory at 1.2 gb, I guess that's the point where the memory manager can't find a block contiguous memory large enough to fit a larger array.
One way you can work around this limitation would be to implement your array as a collection of smaller array of (lets say) 200 mb in size. That should give you some more headroom before you hit the memory cap.
From the 1.2 gb value, I would guess your program isn't compiled to be "large address aware". You can see here how to compile your application like this.
One last trick would be to actually save the array data in a file. I use this trick for one of my application where I needed to load a few GB of images to be displayed in a grid. What I did was to create a file with the attribute FILE_ATTRIBUTE_TEMPORARY and FILE_FLAG_DELETE_ON_CLOSE and saved/loaded images from the resulting file. From CreateFile documentation:
A file is being used for temporary storage. File systems avoid writing
data back to mass storage if sufficient cache memory is available,
because an application deletes a temporary file after a handle is
closed. In that case, the system can entirely avoid writing the data.
Otherwise, the data is written after the handle is closed.
Since it makes use of cache memory, I believe it allows an application to use memory beyond the 32 bits limitation since the cache is managed by the OS and (as far as I know) not mapped inside the process' virtual memory space. After doing this change, performance were still pretty good. But I can't say if performances would still be good enough for your needs.

memory segments and physical RAM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
The memory map of a process appears to be fragmented into segments (stack, heap, bss, data, and text),
I was wondering are these segments just an abstraction for the
convenience of the process and the physical RAM is just a linear array
of addresses or is the physical RAM also fragmented into these
segments?
Also if the RAM is not fragmented and is just a linear array then how
does the OS provide the process the abstraction of these segments?
Also how would programming change if the memory map to a process appeared as just a linear array and not divided into segments (with the MMU translating virtual addresses into physical ones)?
In a modern OS supporting virtual memory, it is the address space of the process that is divided into these segments. And in general case that address space of the process is projected onto the physical RAM in a completely random fashion (with some fixed granularity, 4K typically). Address space pages located next to each other do not have to be projected into the neighboring physical pages of RAM. Physical pages of RAM do not have to maintain the same relative order as the process's address space pages. This all means that there is no such separation into segments in RAM and there can't possibly be.
In order to optimize memory access an OS might (and typically will) try to map sequential pages of the process address space to sequential pages in RAM, but that's just an optimization. In general case, the mapping is unpredictable. On top of that the RAM is shared by all processes in the system, with RAM pages belonging to different processes being arbitrarily interleaved in RAM, which eliminates any possibility of having such "segments" in RAM. There's no process-specific ordering or segmentation in RAM. RAM is just a cache for virtual memory mechanism.
Again, every process works with its own virtual address space. This is where these segments can exist. The process has no direct access to RAM. The process doesn't even need to know that RAM exists.
These segments are largely a convenience for the program loader and operating system (though they also provide a basis for coarse-grained protection; execution permission can be limited to text and writes prohibited from rodata).1
The physical memory address space might be fragmented but not for the sake of such application segments. For example, in a NUMA system it might be convenient for hardware to use specific bits to indicate which node owns a given physical address.
For a system using address translation, the OS can somewhat arbitrarily place the segments in physical memory. (With segmented translation, external fragmentation can be a problem; a contiguous range of physical memory addresses may not be available, requiring expensive moving of memory segments. With paged translation, external fragmentation is not a possible. Segmented translation has the advantage of requiring less translation information: each segment requiring only a base and bound with other metadata whereas a memory section would typically have many more than two pages each of which has a base address and metadata.)
Without address translation, placement of segments would necessarily be less arbitrary. Fortunately, most programs do not care about the specific address where segments are placed. (Single address space OSes
(Note that it can be convenient for sharable sections to be in fixed locations. For code this can be used to avoid indirection through a global offset table without requiring binary rewriting in the program loader/dynamic linker. This can also reduce address translation overhead.)
Application-level programming is generally sufficiently abstracted from such segmentation that its existence is not noticeable. However, pure abstractions are naturally unfriendly to intense optimization for physical resource use, including execution time.
In addition, a programming system may choose to use a more complex placement of data (without the application programmer needing to know the implementation details). For example, use of coroutines may encourage using a cactus/spaghetti stack where contiguity is not expected. Similarly, a garbage collecting runtime might provide additional divisions of the address space, not only for nurseries but also for separating leaf objects, which have no references to collectable memory, from non-leaf objects (reducing the overhead of mark/sweep). It is also not especially unusual to provide two stack segments, one for data whose address is not taken (or at least is fixed in size) and one for other data.
1One traditional layout of these segments (with a downward growing stack) in a flat virtual address space for Unix-like OSes places text at the lowest address, rodata immediate above that, initialized data immediately above that, zero-initialized data (bss) immediately above that, heap growing upward from the top of bss, and stack growing downward from the top of the application's portion of the virtual address space.
Having heap and stack growing toward each other allows arbitrary growth of each (for a single thread using that address space!). This placement also allows a program loader to simply copy the program file into memory starting at the lowest address, groups memory by permission, and can sometimes allow a single global pointer to address all of the global/static data range (rodata, data, and bss).
The memory map to a process appears fragmented into segments (stack, heap, bss, data, and text)
That's the basic mapping used by Unix; other operating systems use different schemes. Generally, though, they split the process memory space into separate segments for executing code, stack, data, and heap data.
I was wondering are these segments are just abstraction for the processes for convience and the physical RAM is just a linear array of addresses or the physical RAM is also fragmented into these segments?
Depends.
Yes, these segments are created and managed by the OS for the benefit of the process. But physical memory can be arranged as linear addresses, or banked segments, or non-contiguous blocks of RAM. It's up to the OS to manage the total system memory space so that each process can access its own portion of it.
Virtual memory adds yet another layer of abstraction, so that what looks like linear memory locations are in fact mapped to separate pages of RAM, which could be anywhere in the physical address space.
Also if the RAM is not fragmanted and is just a linear array then how the OS provides the process the abstraction of these segments?
The OS manages all of this by using virtual memory mapping hardware. Each process sees contiguous memory areas for its code, data, stack, and heap segments. But in reality, the OS maps the pages within each of these segments to physical pages of RAM. So two identical running processes will see the same virtual address space composed of contiguous memory segments, but the memory pages comprising these segments will be mapped to entirely different physical RAM pages.
But bear in mind that physical RAM may not actually be one contiguous block of memory, but may in fact be split across multiple non-adjacent blocks or memory banks. It is up to the OS to manage all of this in a way that is transparent to the processes.
Also how the programming would change if the memory map to a process would appear just as a linear array and not divided into segments?, and then the MMU would just translate these virtual addresses into physical ones.
The MMU always operates that way, translating virtual memory addresses into physical memory addresses. The OS sets up and manages the mapping of each page of each segment for each process. Each time the process exceeds its stack allocation, for example, the OS traps a segment fault and adds another page to the process's stack segment, mapping the virtual page to a physical page selected from available memory.
Virtual memory also allows the OS to swap out process pages temporarily to disk, so that the total amount of virtual memory occupied by all of the running processes can easily exceed the actual physical memory RAM space of a system. Only the currently active executing processes actually have access to real physical RAM pages.
I was wondering are these segments are just abstraction for the
processes for convience and the physical RAM is just a linear array of
addresses or the physical RAM is also fragmented into these segments?
This in fact highly depends on architecture. Some will have hardware tools (e.g. descriptor registers for x86) to split the RAM into segments. Others just keep this information in software (OS kernel information for this process). Also some segments information are totally irrelevant on execution, they're used merely for code/data loading (e.g. relocation segments).
Also if the RAM is not fragmanted and is just a linear array then how
the OS provides the process the abstraction of these segments?
Process code never references to segments, he only knows about addresses, so the OS has nothing to abstract.
Also how the programming would change if the memory map to a process
would appear just as a linear array and not divided into segments?,
and then the MMU would just translate these virtual addresses into
physical ones
Programming would not be affected. When you program in C you don't define any of these segments, and code also doesn't reference these segments. These segments are to keep an ordered layout, and don't even need to be the same across OS.

Resources