DirectX RenderContext RAM/VRAM - memory

I have 8GB or Vram (Gpu) & 16GB of Normal Ram when allocating (creating) many lets say large 4096x4096 textures i eventual run out of Vram.. however from what i can see it then create it on ram instead.. When ever you need to render (with or to) it .. it seams to transfer the render-context from the ram to vram in order to do so. Running normal accessing many render-context over and over every frame (60fps etc) the pc lags out as it tries to transfer very high amounts back and forth. However so long the amount of new (not recently used render-contexts (etc still on ram not vram)) is references each second.. there should not be a issue (performance wise). The question is if this information is correct?

DirectX will allocate DEFAULT pool resources from video RAM and/or the PCIe aperture RAM which can both be accessed by the GPU directly. Often render targets must be in video RAM, and generally video RAM is faster memory--although it greatly depends on the exact architecture of the graphics card.
What you are describing is the 'over-commit' scenario where you have allocated more resources than actually fit in the GPU-accessible resources. In this case, DirectX 11 makes a 'best-effort' which generally involves changing virtual memory mapping to get the scene to render, but the performance is obviously quite poor compared to the more normal situation.
DirectX 12 leaves dealing with 'over-commit' up to the application, much like everything else about DirectX 12 where generally "runtime magic behavior" has been removed. See docs for details on this behavior, as well as this sample

Related

Memory allocation during startup of a C++ program

I sometimes read that the code segment is placed into ROM/FLASH. Others state that it is also loaded into RAM.
Is my understanding correct that it is common to place it into FLASH primary memory in case of an embedded system? And what are the advantages? I assume the startup of the program will be faster but since the FLASH memory is much slower it would be better to additionally load it from FLASH to RAM during the startup phase when RAM usage does not matter?
Sometimes you don't have enough memory for your program, so you just leave it in ROM or flash. On a system flush with memory you just load everything into RAM, it's much faster.
Some embedded CPUs have 2K of memory, but 2MB of flash. As an example the RP2040 has 264KB of SRAM (RAM) but 2MB of Flash memory for your programs. That's a lot bigger than the memory footprint.
Flash is slow compared to modern DRAM but in an embedded environment the CPU isn't always that fast either. The RP2040 only runs at 133MHz, so it won't notice the difference between flash latency and SRAM latency like a chip running in the 2GHz range might. It's clocked 15x slower.
If you want to explore this more, embedded CPUs like the RP2040 are really cheap, some less than $1, so you can experiment on them and see how it plays out in real life without having to spend much money at all.
Generally, RAM is much faster than flash. Where to run the code from however, depends on the system. On most traditional embedded systems, you don't execute from RAM.
On low-end embedded systems (8 and 16-bitters) you always keep all code in flash and there won't be a performance difference between executing from RAM or flash. Such systems typically don't have a MMU nor protection against writing to the code area, so running code from RAM is highly dangerous since bugs can write straight into physical memory. Also, these systems tend to have very limited RAM.
On mid-range embedded systems (Cortex M etc) where you start to clock the core faster than the flash can keep up with, you need to introduce wait states, where the CPU waits for the flash to read. Typically you need wait states when you go beyond somewhere around 40-50MHz system clock on modern systems. The higher the clock, the more wait states you need.
Such systems do not typically execute code from RAM either, since they usually don't need extreme performance. And they typically don't have a lot of RAM either. In some cases like mid-range Power PC, you'll have instruction cache, which helps a lot in compensating for the slower flash, since instructions can be pre-loaded from flash to cache by branch prediction.
On high-end systems (Cortex A, x86 etc) there will be lots of RAM available for the purpose of executing the code from there and then you are expected to do so. On these systems, cache rather serves the purpose of speeding up access to RAM.
Historically, RAM was also much more prone to electromagnetic interference and could also lose charge over time unless you kept writing to the cells, so you didn't want to keep code in RAM for those reasons alone. That's not much of an issue today though.

How does browser GPU memory usage works?

By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.

What happens when VRAM is full?

I want to know the current nvidia/AMD implementation of handling VRAM resource allocation.
We already know that operating systems use swap/virtual memory when system RAM is full, then what is the equivalent of swap when it comes to VRAM? Do they fall back to system RAM or hard disk?
I thought that falling back to system RAM is rational, but from my experience video games lag horribly(1/20 of typical FPS) when they are out of video memory space, that made me doubt that they are using system RAM because I think system RAM is not that slow to make the game lag so much.
In short I would like to know what the current implementations are and what is the biggest bottleneck that causes the game to lag under out-of-memory situations.
the swapping is really done to RAM
if there is enough RAM to swap to. Swapping to file is unusable due to slow speed see next bullet
The RAM it self is not that slow (still slower) but the buses connected to it are
while swapping system memory to swap file the memory swap occur when needed (change focus of application,open new file/table,...) this is not that frequent but if you are out of VRAM then you are in trouble because usually most of gfx data is used in each frame.
This leads to swapping per frame so you need to copy usually very large data blocks very often for example swapping 256MB 20fps leads to:
256M x 2 x 20 = 10 GB/s read
256M x 2 x 20 = 10 GB/s write
which is 20GB/s bandwidth needed of coarse depending on the memory controller and architecture You can do read/write simultaneously up to a point so you can get close to 10GB/s in total theoretically but still that is huge number for only 256MB chunk of data look here:
Cache size estimation on your system?
My setup at that time has memory write only around 5GB/s which is nowhere near the needed memory transfer rate needed for such task

Software memory bit-flip detection for platforms without ECC

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/
The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

OpenGL DisplayList using video memory

Is it possible to store the display list data on the video card memory?
I want to use only video memory like Video Buffer Object(VBO) to store DisplayList.
But when I try it, it always uses main memory instead of video memory.
I tested on nVidia geForce 8600GTS, and GTX260.
Display lists are a very old feature, that dates back to OpenGL-1.0. They have been depreceated a long time ago. Anyhow you can still use them for compatibility reasons.
The way OpenGL works, prevents display lists from being held in GPU memory only. The graphics server (as OpenGL calls it) is a purely abstract thing, and the specification warrants, that what you put in a display lists is always available. However in modern GPUs there's only a limited amount of memory, so payload data may be swapped in and out as needed.
Effectively GPU memory is a cache for data in system RAM (the same way system RAM should be treaded as cache for storage).
Even moreso, modern GPUs may crash, and the drivers will perform a full reset giving the user the impression everything works normal. But after the reset all the data on GPU memory must be reinitialized.
So it is necessary for OpenGL to keep copies of every payload data in memory to support smooth operation.
Hence it is perfectly normal for your data to show up as consuming system RAM as well. It is though very likely, that the display lists are also cached in GPU memory.
Display Lists are deprecated. You can use VBO with vertex indices to use graphics memory, and draw it with glDrawElements.

Resources