How can I specify the memory that cpu will use? - memory

In gem5 simulator, how can i specify the memory that cpu will use like picture?
I want to make a situation that cpu0 only uses a DRAM, cpu1 only uses a NVM.
Is it possible?

Related

How to allocate GpuMat in shared (not dedicated) GPU memory? [duplicate]

So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.
Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.
CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.
I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.
Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.

What is the correlation between the CPU usage and Memory being used?

1) why does working set image name list and peak working set list is different under processes tab in task manager? What is the relation between working set and peak working set?
2) What is the correlation between the CPU usage and Memory being used?
And also I would like to know about the CPU and memory usage on physical and kernel memory.
For example if I open Internet explorer.
How does the internet explorer application impact on CPU usage and Memory, Physical and kernel memory?
Please help me out I am fully confused?
Thanks in advance.
The working set is the pages that a process has in physical memory.
The peak working set is the largest size of the working set.
The working set can grow and shrink depending upon the competing memory demands of processes running on a system
There is little correlation between CPU usage an memory usage. It's possible for a process to have high CPU usage and low memory usage or high CPU usage and high memory usage.

cudaMallocManaged vs cudaMalloc - Device memory limitation scenario

I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?
Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.

Cachegrind under Xen

I have an application written in C++ that someone else has written in a way that's supposed to maximally take advantage of cpu caches. This application runs on a guest Ubuntu OS that is using paravirtualization. I ran cachegrind and received very low cache miss rates.
Since my OS is virtualized, can I be sure that these values are in fact correct in showing that the cpu cache is being well used for my application?
Cachegrind is a simulator. A real CPU may actually perform differently (e.g. your real CPU may have a different cache hierarchy to cachegrind, different sizes of cache, a different replacement policy and so forth). You would need to watch the real CPUs performance counters to know for sure how well your program was really performing on real hardware with respect to cache.

Checking the amount of available RAM within a running program

A friend of mine was asked, during a job interview, to write a program that measures the amount of available RAM. The expected answer was using malloc() in a binary-search manner: allocating larger and larger portions of memory until getting a failure message, reducing the portion size, and summing the amount of allocated memory.
I believe that this method will measure the amount of virtual, not physical, memory. But I got curious about the matter.
Is there a way to tell the amount of available RAM from within the program, without using exec(dmesg |grep -i memory) ?
You are correct: malloc() makes no distinction between physical or virtual memory. In fact, that's the whole point of virtual memory: to make such details irrelevant to programs.
You can find out but it is OS-specific. For example, Linux.
The only way to do this is to use some OS-specific functionality. Using malloc() is useless for a number of reasons:
it measures virtual memory
the OS may well have per-process cap on memory allocations
allocating much more memory than is physically available often degrades the platforms stability to the point where "go back one" algorithm suggested in the question probably won't work
this is OS specific and you should collect such information from the OS services unless you want to make your own memory management layer
Using malloc() will only tell you how much memory can be allocated to a single process. There may be reasons why this is lower than the total amount of virtual memory. For instance, you might have OS quota or a per-process 32-bit-limited address space.
(And, of course, virtual memory >= RAM)
Very OS specific but for Linux the information about system memory is in /proc/meminfo. You can also probably use the sysctl interface (http://www.linuxjournal.com/article/2365) to get this data in a C program.

Resources