OpenGL DisplayList using video memory - memory

Is it possible to store the display list data on the video card memory?
I want to use only video memory like Video Buffer Object(VBO) to store DisplayList.
But when I try it, it always uses main memory instead of video memory.
I tested on nVidia geForce 8600GTS, and GTX260.

Display lists are a very old feature, that dates back to OpenGL-1.0. They have been depreceated a long time ago. Anyhow you can still use them for compatibility reasons.
The way OpenGL works, prevents display lists from being held in GPU memory only. The graphics server (as OpenGL calls it) is a purely abstract thing, and the specification warrants, that what you put in a display lists is always available. However in modern GPUs there's only a limited amount of memory, so payload data may be swapped in and out as needed.
Effectively GPU memory is a cache for data in system RAM (the same way system RAM should be treaded as cache for storage).
Even moreso, modern GPUs may crash, and the drivers will perform a full reset giving the user the impression everything works normal. But after the reset all the data on GPU memory must be reinitialized.
So it is necessary for OpenGL to keep copies of every payload data in memory to support smooth operation.
Hence it is perfectly normal for your data to show up as consuming system RAM as well. It is though very likely, that the display lists are also cached in GPU memory.

Display Lists are deprecated. You can use VBO with vertex indices to use graphics memory, and draw it with glDrawElements.

Related

Need for CPU-GPU sync if write and read happens on different memory pages of MTLBuffer?

I am using an MTLBuffer in Metal that I created by allocating several memory pages (using vm_allocate) with
device.makeBuffer(bytesNoCopy:length:options:deallocator:).
I write the buffer with CPU and the GPU only reads it. I know that generally I need to synchronise between CPU and GPU.
However, I have more knowledge about where in the MTLBuffer write (by CPU) and read (by GPU) happens and in my case writing is into different memory pages than the read (in a given time interval).
My question: Do I need to sync between CPU and GPU even if the relevant data that is written and read are on different memory pages (but in the same MTLBuffer)? Intuitively, I would think not, but then MTLBuffer is a bit opaque and I don't really know what kind of processing/requirement the GPU actually does/has with the MTLBuffer.
Additional info: This is a question for iOS and MTLStorageMode is shared.
Thank you very much for help!
Assuming the buffer was created with MTLStorageModeManaged, you can use the function didModifyRange to sync CPU to GPU for only a portion (a page for example) of the buffer.

DirectX RenderContext RAM/VRAM

I have 8GB or Vram (Gpu) & 16GB of Normal Ram when allocating (creating) many lets say large 4096x4096 textures i eventual run out of Vram.. however from what i can see it then create it on ram instead.. When ever you need to render (with or to) it .. it seams to transfer the render-context from the ram to vram in order to do so. Running normal accessing many render-context over and over every frame (60fps etc) the pc lags out as it tries to transfer very high amounts back and forth. However so long the amount of new (not recently used render-contexts (etc still on ram not vram)) is references each second.. there should not be a issue (performance wise). The question is if this information is correct?
DirectX will allocate DEFAULT pool resources from video RAM and/or the PCIe aperture RAM which can both be accessed by the GPU directly. Often render targets must be in video RAM, and generally video RAM is faster memory--although it greatly depends on the exact architecture of the graphics card.
What you are describing is the 'over-commit' scenario where you have allocated more resources than actually fit in the GPU-accessible resources. In this case, DirectX 11 makes a 'best-effort' which generally involves changing virtual memory mapping to get the scene to render, but the performance is obviously quite poor compared to the more normal situation.
DirectX 12 leaves dealing with 'over-commit' up to the application, much like everything else about DirectX 12 where generally "runtime magic behavior" has been removed. See docs for details on this behavior, as well as this sample

iOS memory shared between CPU and GPU and what that means for reading

I have a MTLBuffer that is using memory that is allocated by the cpu and thus shared by both the cpu and the GPU.
Per Apple's suggestion I am using triple buffering to remove latency that might be caused by one processor waiting on the other to finish.
My vertex data changes every frame so every frame I am writing to one section of the array with the CPU and reading a different section with the GPU.
What I would like to do is read some of the values that the GPU is currently also reading as they save me some time doing calculations for the section of the buffer the CPU is writing to.
Essentially this is because the current frame's data is dependent on the previous frames data.
Is this valid? Can the CPU and the GPU be reading from the same portion of memory at once since memory is shared on iOS?
I think that's valid and safe, for two reasons. First, CPUs actually often have to read in order to write. Things like caches and memory buses don't allow for access to RAM at the granularity we usually think of (byte or even register size). In order to write, it usually has to read a larger chunk from memory, modify just the part written, and then (eventually) write the larger chunk back to memory. So, even the approach where you don't explicitly read from parts of the buffer that the GPU is reading and you only write to parts that the GPU isn't accessing can, in theory, still be implicitly reading from parts of the buffer that the GPU is reading. Since we're not given the info we'd need to reliably avoid that, I'd say it isn't considered a problem.
Second, no warning is given about what you describe in Apple's docs. There's the "Maintaining Coherency Between CPU and GPU Memory" section in the article about resource objects. That only discussed the case where either the CPU or GPU are modifying shared data, not where both are just reading.
Then there's the "Resource Storage Modes and Device Memory Models" section describing the new storage modes introduced with iOS 9 and macOS 10.11. And the docs for MTLResourceStorageModeShared itself. Again, there's mention of reading vs. writing, but none about reading vs. reading.
If there were a problem with simultaneous reading, I think Apple would have discussed it.

How does browser GPU memory usage works?

By pressing F12 and then Esc on Chrome, you can see a few options to tick. One of them is show FPS meter, which allows us to see GPU memory usage in real time.
I have a few questions regarding this GPU memory usage:
This GPU memory means the memory the webpage needs to store its code: variables, methods, images, cached videos, etc. Is this right to affirm?
Is there a reason as to why it has an upper bound of 512 Mb? Is there a way to reduce or increase it?
How much GPU memory usage is enough to see considerable slowdown on browser navigation?
If I have an array with millions of elements (just hypothetically), and I splice all the elements in the array, will it free the memory that was in use? Or will it not "really" free the memory, requiring an additional step to actually wipe it out?
1. What is stored in GPU memory
Although there are no hard-set rules on the type of data that can be stored in GPU-memory, the bulk of GPU memory generally contains single-frame resources like textures, multi-frame resources like vertex buffers and index buffer data, and programmable-shader compiled code fragments. So while in theory it is possible to store video's in GPU memory, as well as all kinds of other bulk data, in practice, for every streamed video only a bunch of frames will ever be in GPU-ram.
The main reason for this soft-selection of texture-like data sets is that a GPU is a parallel hardware architecture, and it expects the data to be compatible with that philosophy, which means that there are no inter-dependencies between sets of data (i.e. pixels). Decoding images from a video stream is more or less the same as resolving interdependence between data-blocks.
2. Is 512MB enough for everyone?
No. It's probably based on your hardware.
3. When does GPU memory become slow?
You have to know that some parts of the GPU memory are so fast you can't even start to appreciate the speed. There is nothing wrong with the speed of a GPU card. What matters is the time it takes to get the data IN that memory in the first place. That is called bandwidth, and the operations usually need to be synchronized. In that case, the driver will lock the Northbridge bus so that data can flow from main memory into GPU memory, and this locking + transfer takes quite some time.
So to answer the question, once it is uploaded, the GUI will remain fast, no matter how much more memory is used on the GPU card. The only thing that can slow it down, are changes to the GUI, and other GPU processes taking time to complete that may interfere with rendering operations.
4. Splicing ram memory frees it up?
I'm not quite sure what you mean by splicing. GPU memory is freed by applications that release that memory by using the API calls to do that. If you want to render you GPU memory blank, you'd have to grab the GPU handles of the resources first, upload 'clear' data into them, and then release the handles again, but (for normal single-threaded GPU applications) you can only do that in your own process context.

shared memory vs texture memory in opencl

I am writing deinterlacing code in Opencl. I am reading the pixels using read_imageui() API in the local memory.
Just like the code at:
https://opencl-book-samples.googlecode.com/svn-history/r29/trunk/src/Chapter_19/oclFlow/lkflow.cl
As per my understanding when we read pixels using this API we are reading from the Texture memory. I am doubtful that using the pixels first in shared memory will help me gaining any speed as Texture memory already acts as cache and provides fast access to data.
Can anyone clarify my doubt ?
If you can fit all your data in private memory after reading it with read_imageui, you should definitely do that. Keep in mind that you only have 256 bytes of private memory per work item if your kernel compiles SIMD16 and 512 bytes if it compiles SIMD8.
Whether you should use local memory or not really depends on the access pattern. Indeed, Samplers have their own L1 and L2 caches, so if your data accesses always hit the caches, you should be fine. Remember, that local memory is banked, so you have 16 banks from which you can fetch 4 bytes at a time, which means that you get full bandwidth if you hit all 16 banks from all work items in one hardware thread (typically 16 or 8 of them). So, you might have a situation where you are better off reading image data into local memory first and then accessing local memory in an orderly fashion. Good example of this are algorithms like SIFT or SURF, where you access image in such a way that sampler cache really does not help much (you still get sampler interpolation benefits), but then you place all that data in local memory and access it repeatedly in a fairly regular pattern.
In general, that's true. However, even a cached read from a texture might be slower than a read from shared local memory, so for an algorithm that makes many overlapped reads from adjacent locations could still benefit somewhat from using shared local memory. However, it will make the kernel more complicated, so for many cases (and certainly during algorithm development) just rely on the cached texture reads.

Resources