What happens when VRAM is full? - memory

I want to know the current nvidia/AMD implementation of handling VRAM resource allocation.
We already know that operating systems use swap/virtual memory when system RAM is full, then what is the equivalent of swap when it comes to VRAM? Do they fall back to system RAM or hard disk?
I thought that falling back to system RAM is rational, but from my experience video games lag horribly(1/20 of typical FPS) when they are out of video memory space, that made me doubt that they are using system RAM because I think system RAM is not that slow to make the game lag so much.
In short I would like to know what the current implementations are and what is the biggest bottleneck that causes the game to lag under out-of-memory situations.

the swapping is really done to RAM
if there is enough RAM to swap to. Swapping to file is unusable due to slow speed see next bullet
The RAM it self is not that slow (still slower) but the buses connected to it are
while swapping system memory to swap file the memory swap occur when needed (change focus of application,open new file/table,...) this is not that frequent but if you are out of VRAM then you are in trouble because usually most of gfx data is used in each frame.
This leads to swapping per frame so you need to copy usually very large data blocks very often for example swapping 256MB 20fps leads to:
256M x 2 x 20 = 10 GB/s read
256M x 2 x 20 = 10 GB/s write
which is 20GB/s bandwidth needed of coarse depending on the memory controller and architecture You can do read/write simultaneously up to a point so you can get close to 10GB/s in total theoretically but still that is huge number for only 256MB chunk of data look here:
Cache size estimation on your system?
My setup at that time has memory write only around 5GB/s which is nowhere near the needed memory transfer rate needed for such task


DirectX RenderContext RAM/VRAM

I have 8GB or Vram (Gpu) & 16GB of Normal Ram when allocating (creating) many lets say large 4096x4096 textures i eventual run out of Vram.. however from what i can see it then create it on ram instead.. When ever you need to render (with or to) it .. it seams to transfer the render-context from the ram to vram in order to do so. Running normal accessing many render-context over and over every frame (60fps etc) the pc lags out as it tries to transfer very high amounts back and forth. However so long the amount of new (not recently used render-contexts (etc still on ram not vram)) is references each second.. there should not be a issue (performance wise). The question is if this information is correct?
DirectX will allocate DEFAULT pool resources from video RAM and/or the PCIe aperture RAM which can both be accessed by the GPU directly. Often render targets must be in video RAM, and generally video RAM is faster memory--although it greatly depends on the exact architecture of the graphics card.
What you are describing is the 'over-commit' scenario where you have allocated more resources than actually fit in the GPU-accessible resources. In this case, DirectX 11 makes a 'best-effort' which generally involves changing virtual memory mapping to get the scene to render, but the performance is obviously quite poor compared to the more normal situation.
DirectX 12 leaves dealing with 'over-commit' up to the application, much like everything else about DirectX 12 where generally "runtime magic behavior" has been removed. See docs for details on this behavior, as well as this sample

What applications require 1GB pages?

X86 and x64 processors allow for 1GB pages when the PDPE flag is set on the cpu. In what application would this be practical or required and for what reason?
Hugepage would help in cases where you have a large memory footprint and memory access pattern spans large distance (across 4K pages).
It not only reduces TLB miss but also saves OS mm system page tables size.
A very good example is packet processing. In high throughput network applications (1Gbps or more), packets are normally stored in a packet buffer pool (i.e. pooling technique). For example, every packet buffer is 2KB in size and the pool contains 512 buffers. Access pattern of this packet buffer pool might not be sequential (buffer indexed at 1,2,3,4,5...) but rather random over time (1,104,407,45,905...). Since normal page size is 4K, normal TLB won't help here since each packet access would incur a TLB miss and there is a lot of different buffers sitting on different pages.
In contrast, if you put the pool in a 1GB hugepage, then all packet buffers share the same hugepageTLB entry thus avoiding misses.
This is used in DPDK (Data Plane Development Kit) where the packet
rate is very high that cycles wasted on TLB miss is not negligible.
Hugepage support is required for the large memory pool allocation used
for packet buffers (the HUGETLBFS option must be enabled in the
running kernel as indicated the previous section). By using hugepage
allocations, performance is increased since fewer pages are needed,
and therefore less Translation Lookaside Buffers (TLBs, high speed
translation caches), which reduce the time it takes to translate a
virtual page address to a physical page address. Without hugepages,
high TLB miss rates would occur with the standard 4k page size,
slowing performance.
Another example from Oracle:
...almost 6.8 GB of memory used for page tables when hugepages were not
...after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB
Related links:
However, hugepage should be used carefully. Above I mentioned that memory pool would benefit from 1GB hugepage. However, if you have an access pattern even across 1GB page boundary, then it might not help. There is an excellent blog on this:
Imagine an application that uses huge amounts of memory—Molecular modeling. Weather prediction—especially if it has no user interaction.
Large pages:
(1) reduce the amount of page table overhead memory
(2) increases the amount of memory that can be stored in the MMU cache. (The same number of cache entries references more memory).
I have LabView installed on my Dell ws with 8 cores and 16GB DDRM, driving 4 24" monitors.If I create a video processor or compositor of most any type, with a 1024px x 1024px 'drawing' display, LabView reserves 1.5GB before I even began to composite. It was built from C and C++. I often store image details in 3D arrays of 256 x 256 x 256 of U32 integers that hold each RGB pixel color, plus the alpha channel for opacity or masking. That's 64MB per each layer of buffered video. If I need to remember 128 layers, thats 8GB right there. LabView is a programming langauge structured much like a CAD program. If I need 8GB for a series of video (HDTV) buffers, that is what it will give me, with a few seconds wait for malloc to do its work. If I created a 8GB 3D array for a database, it would be no different, even if I did it in MySQL (not as an array). To me, having many gigabytes of ram to play with is the norm, not an exception.

How does browser GPU memory usage works?

By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.

Limiting Dr. Racket's memory

I augmented the memory of Dr. Racket a week ago, now I want to reduce it to the same amount as before. So I limit it back to 128 MB. But that has no effect... It is always consuming much more then 128 MB...
It's really a problem because it causes my computer to overheat.
Does someone know how I can limit Dr. Racket so that he don't exceed 128 MB?
Here's a screenshot of the problem :
There is a difference between the memory used by a program and the memory used in total by DrRacket. When I start up DrRacket and before entering or running any program I see that DrRacket uses 250MB. The interaction window states I have limited memory to 128MB too so that means that that particular program cannot go beond those bounds, but there are featrues of DrRacket that uses alot more memory on you machine than mine.
I went into the settings and removed some features I don't use (like Algiol60). When restarting after that I used 50MB less memory which indeed confirms the memory is used by DrRacket and not programs.
For a particular complex program I guess background expansion might use a lot of memory. Perhaps you can turn that off as well to see if not the current used memory goes down.
About heat
As Óscar mentioned memory usage has little to do with heat as long as you don't hear the swap is being used (heavy disk usage). Heat has to do with CPU usage. When doing calculations the OS will make available resources available and perhaps increase the frequencey of the CPU which increases the heat.
If you are making a threaded application that has loops waiting for tasks make sure you are not making an active loop. Sleep might reduce activeness and perhaps Racket has better approaches (never done threded apps in Racket)
If you are calculating something the increase of CPU is natural. It's so that you get the answer earlier. Computer settings can be changed to favor battery time. Check both OS and BIOS. (That makes this not a Racket issue)
The memory shown in the Dr Racket status bar is N/A.
Choose Racket | Limit Memory and specify 8 MB (the minimum).
Choose File | New Tab.
In the Interactions pane allocate 8 MB of memory. For example enter (define x (make-bytes (* 8 1024 1024))). (I recommend assigning the result to a variable, like this, because I doubt you want Dr Racket to print 8 MB of bytes.)
The result I get:
Welcome to DrRacket, version [3m].
Language: racket [custom]; memory limit: 8 MB.
> (define x (make-bytes (* 8 1024 1024)))
out of memory
Assuming you get the same result, there is some other reason your computer is running hotter.
I don't think that the extra memory being consumed is the cause for your computer overheating. More likely, it's because some function is consuming the CPU. Try to optimize the code, instead.
In fact, by limiting the available memory you might end up causing more disk paging, hence slowing things down and potentially consuming more CPU … and causing more overheating.

Clarify: Processor operates at 800 Mhz and 200Mhz DDR RAM

I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.
The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)
Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.
