cudaMallocManaged vs cudaMalloc - Device memory limitation scenario

cudaMallocManaged vs cudaMalloc - Device memory limitation scenario - memory

I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?

Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.

Related

use vmalloc memory to do DMA?

I'm writing a driver which need large memory. Let's say it will be 1GB at least.
The kernel version >= 3.10.
Driver needs run in X86_64 and Arm platform.
The hardware may have IOMMU.
This large memory will mmap to userspace.
Device use those memory to do DMA. Each DMA operation just write max 2KB size of data to those memory.
My Questions.
vmalloc can give me large non-physical-continus pages. Can I use vmalloc get large memory to do DMA? I'm thinking use vmalloc_to_page to get page pointer then use page_to_phys to get physical address.
I found a info about "vmalloc performance is lower than kmalloc." I'm not sure what it means.
if I do vaddr = vmalloc(2MB) and kaddr = kmalloc(2MB), the function call of vmalloc will be slower than kmalloc because of memory remap. But is memory access in range [vaddr, vaddr+2MB) will slower than [kaddr, kaddr+2MB) ? The large memory will be created during driver init, so does vmalloc memory cause performance issue?
DMA needs dma_map_single to get dma_addr_t. I'm thinking use dma_map_single to get all the page's dma address at driver init. I will just use those dma address when driver need to do DMA. Can I do this to get some performance improvment?

Why do we need virtual memory?

So my understanding is that every process has its own virtual memory space ranging from 0x0 to 0xFF....F. These virtual addresses correspond to addresses in physical memory (RAM). Why is this level of abstraction helpful? Why not just use the direct addresses?
I understand why paging is beneficial, but not virtual memory.

There are many reasons to do this:
If you have a compiled binary, each function has a fixed address in memory and the assembly instructions to call functions have that address hardcoded. If virtual memory didn't exist, two programs couldn't be loaded into memory and run at the same time, because they'd potentially need to have different functions at the same physical address.
If two or more programs are running at the same time (or are being context-switched between) and use direct addresses, a memory error in one program (for example, reading a bad pointer) could destroy memory being used by the other process, taking down multiple programs due to a single crash.
On a similar note, there's a security issue where a process could read sensitive data in another program by guessing what physical address it would be located at and just reading it directly.
If you try to combat the two above issues by paging out all the memory for one process when switching to a second process, you incur a massive performance hit because you might have to page out all of memory.
Depending on the hardware, some memory addresses might be reserved for physical devices (for example, video RAM, external devices, etc.) If programs are compiled without knowing that those addresses are significant, they might physically break plugged-in devices by reading and writing to their memory. Worse, if that memory is read-only or write-only, the program might write bits to an address expecting them to stay there and then read back different values.
Hope this helps!

Short answer: Program code and data required for execution of a process must reside in main memory to be executed, but main memory may not be large enough to accommodate the needs of an entire process.
Two proposals
(1) Using a very large main memory to alleviate any need for storage allocation: it's not feasible due to very high cost.
(2) Virtual memory: It allows processes that may not be entirely in the memory to execute by means of automatic storage allocation upon request. The term virtual memory refers to the abstraction of separating LOGICAL memory--memory as seen by the process--from PHYSICAL memory--memory as seen by the processor. Because of this separation, the programmer needs to be aware of only the logical memory space while the operating system maintains two or more levels of physical memory space.
More:
Early computer programmers divided programs into sections that were transferred into main memory for a period of processing time. As higher level languages became popular, the efficiency of complex programs suffered from poor overlay systems. The problem of storage allocation became more complex.
Two theories for solving the problem of inefficient memory management emerged -- static and dynamic allocation. Static allocation assumes that the availability of memory resources and the memory reference string of a program can be predicted. Dynamic allocation relies on memory usage increasing and decreasing with actual program needs, not on predicting memory needs.
Program objectives and machine advancements in the '60s made the predictions required for static allocation difficult, if not impossible. Therefore, the dynamic allocation solution was generally accepted, but opinions about implementation were still divided.
One group believed the programmer should continue to be responsible for storage allocation, which would be accomplished by system calls to allocate or deallocate memory. The second group supported automatic storage allocation performed by the operating system, because of increasing complexity of storage allocation and emerging importance of multiprogramming.
In 1961, two groups proposed a one-level memory store. One proposal called for a very large main memory to alleviate any need for storage allocation. This solution was not possible due to very high cost. The second proposal is known as virtual memory.
cne/modules/vm/green/defn.html

To execute a process its data is needed in the main memory (RAM). This might not be possible if the process is large.
Virtual memory provides an idealized abstraction of the physical memory which creates the illusion of a larger virtual memory than the physical memory.
Virtual memory combines active RAM and inactive memory on disk to form
a large range of virtual contiguous addresses. implementations usually require hardware support, typically in the form of a memory management
unit built into the CPU.

The main purpose of virtual memory is multi-tasking and running large programmes. It would be great to use physical memory, because it would be a lot faster, but RAM memory is a lot more expensive than ROM.
Good luck!

How to reserve memory for my application and leave a specified amount remaining?

I'm planning an application which will involve loading many pictures at one time and thus requires a large chunk of memory. For example, I might have 50 image objects created at once, taking a total of 1GB of RAM. But when the user goes to load 20 more pictures, I'd like to make sure that amount of memory is already reserved and ready.
Now this part might seem a little backwards from normal. Rather than specifying how much memory my application shall reserve, instead I need to specify how much memory to leave free for other applications, and adjust my application's memory periodically according to this specification. I must say I've never worked with reserving memory at all, and especially won't know how to leave this remaining available memory.
So for example, if the computer has 2048 MB of RAM, and the option is set to leave 50 MB free for other applications, and there is already 10MB of RAM being used by other apps, then it should reserve 2048-50-10 = 1988 MB for my app.
The trouble I foresee is suppose the user opens another application which requires 1GB. My app has to catch this and shrink its self.
Does this even sound like a feasible approach? Basically, I need to make sure there is as much memory reserved as possible at any given time, while leaving a decent amount available for other apps. Would it make a significant impact on performance if I do this, or not much at all? I might be loading and unloading images at rapid paces, and I don't want it to reserve/free this memory on demand, I want it to stay reserved.

+1 for Sertac's mentioning of how SQL Server rides the line of allocating memory it needs, but releasing memory when Windows complains.
Applications can receive Window's complaints by using the CreateMemoryResourceNotification:
hLowMemory := CreateMemoryResourceNotification(LowMemoryResourceNotification);
Applications can use memory resource notification events to scale the
memory usage as appropriate. If available memory is low, the
application can reduce its working set. If available memory is high,
the application can allocate more memory.
Any thread of the calling
process can specify the memory resource notification handle in a call
to the QueryMemoryResourceNotification function or one of the wait functions.
The state of the object is signaled when the specified
memory condition exists. This is a system-wide event, so all
applications receive notification when the object is signaled. Note
that there is a range of memory availability where neither the
LowMemoryResourceNotification or HighMemoryResourceNotification object
is signaled. In this case, applications should attempt to keep the
memory use constant.
But it's also worth mentioning that you might as well allocate memory that you need. Your operating system has a very sophisiticated set of algorithms to swap out the least used memory when memory pressure is high. You can take advantage of this by simply allocating all the memory that you need. When Windows starts to run low, it will find those pages of memory that you are using the least and swap them out to disk. (This is how a well-known reverse proxy works).
The only thing left is to decide if you want to free some images when Windows says it's running low on RAM. But if you're not using the memory, it is going to be swapped out to disk for you.

It's not realistic to account for other apps. Just ignore them. The system will page things in and out as needed. If you really wanted to do this you'd have to dynamically adapt to other processes as they start and finish. That's really not realistic. What's more it's not practical to inquire of other processes how much memory they need. Leave it all to the system.
Set a budget for your app and make sure you don't exceed it. Keep the most recently used images in memory and when you approach your memory budget throw away the least recently used images to make space.
If you are stressing the available resources then make sure you use FastMM and enable LARGE_ADDRESS_AWARE for your app so that you get 4GB address space when running on a 64 bit OS.

MIPS memory execution prevention

I'm doing some research with the MIPS architecture and was wondering how operating systems are implemented with the limited instructions and memory protection that mips offers. I'm specifically wondering about how an operating system would prevent certain addresses ranges from being executed. For example, how could an operating system limit PC to operate in a particular range? In other words, prevent something such as executing from dynamically allocated memory?
The first thing that came to mind is with TLBs, but TLBs only offer memory write protection (and not execute).
I don't quite see how it could be handled by the OS either, because that would imply that every instruction would result in an exception and then MANY cycles would be burned just checking to see if PC was in a sane address range.
If anyone knows, how is it typically done? Is it handled somehow by the hardware during initialization (e.g. It's given an address range and an exception is hit if its out of range?)

Most of protection checks are done in hardware, by the CPU itself, and do not need much involvement from the OS side.
The OS sets up some special tables (page tables or segment descriptors or some such) where memory ranges have associated read, write, execute and user/kernel permissions that the CPU then caches internally.
The CPU then on every instruction checks whether or not the memory accesses comply with the OS-established permissions and if everything's OK, carries on. If there's an attempt to violate those permissions the CPU raises an exception (a form of an interrupt similar to those from external to the CPU I/O devices) that the OS handles. In most cases the OS simply terminates the offending application when it gets such an exception.
In some other cases it tries to handle them and make the seemingly broken code work. One of these cases is support for virtual, on-disk memory. The OS marks a region as unpresent/inaccessible when it's not backed up by physical memory and it's data is somewhere on the disk. When the app tries to use that region, the OS catches an exception from the instruction that tries to access this memory region, backs the region with physical memory, fills it in with data from the disk, marks it as present/accessible and restarts the instruction that's caused the exception. Whenever the OS is low on memory, it can offload data from certain ranges to the disk, mark those ranges as unpresent/inaccessible again and reclaim the memory from those regions for other purposes.
There may also be specific hard-coded by the CPU memory ranges inaccessible to software running outside of the OS kernel and the CPU can easily make a check here as well.
This seems to be the case for MIPS (from "Application Note 235 - Migrating from MIPS to ARM"):
3.4.2 Memory protection
MIPS offers memory protection only to the extent described earlier i.e. addresses
in the upper 2GB of the address space are not permitted when in user mode.
No finer-grained protection regime is possible.
This document lists "MEM - page fault on data fetch; misaligned memory access; memory-protection violation" among the other MIPS exceptions.
If a particular version of the MIPS CPU doesn't have any more fine-grained protection checks, they can only be emulated by the OS and at a significant cost. The OS would need to execute code instruction by instruction or translate it into almost equivalent code with inserted address and access checks and execute that instead of the original code.

This is indeed done with TLBs. No Execute Bits (NX bits) became popular only a few years ago, so older MIPS processors do not support it. The latest version of the MIPS architecture (Release 3) and the SmartMIPS Application-Specific Extension support it as an optional feature under the name of XI (Execute Inhibit).
If you have a chip without this feature you are out of luck. Like Alex already said, there is no simple way to emulate this feature.

Checking the amount of available RAM within a running program

A friend of mine was asked, during a job interview, to write a program that measures the amount of available RAM. The expected answer was using malloc() in a binary-search manner: allocating larger and larger portions of memory until getting a failure message, reducing the portion size, and summing the amount of allocated memory.
I believe that this method will measure the amount of virtual, not physical, memory. But I got curious about the matter.
Is there a way to tell the amount of available RAM from within the program, without using exec(dmesg |grep -i memory) ?

You are correct: malloc() makes no distinction between physical or virtual memory. In fact, that's the whole point of virtual memory: to make such details irrelevant to programs.
You can find out but it is OS-specific. For example, Linux.

The only way to do this is to use some OS-specific functionality. Using malloc() is useless for a number of reasons:
it measures virtual memory
the OS may well have per-process cap on memory allocations
allocating much more memory than is physically available often degrades the platforms stability to the point where "go back one" algorithm suggested in the question probably won't work

this is OS specific and you should collect such information from the OS services unless you want to make your own memory management layer

Using malloc() will only tell you how much memory can be allocated to a single process. There may be reasons why this is lower than the total amount of virtual memory. For instance, you might have OS quota or a per-process 32-bit-limited address space.
(And, of course, virtual memory >= RAM)

Very OS specific but for Linux the information about system memory is in /proc/meminfo. You can also probably use the sysctl interface (http://www.linuxjournal.com/article/2365) to get this data in a C program.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart