Determining available video memory - memory

When developing an OpenGL program, is there a way to poll from the system to find out just how many megabytes are available to store textures, etc?
Or is the standard approach these days just allocate memory and forget about everything?

Although the official stance remains "you don't need to know, you don't want to know, and it would not help you anyway", luckily at least two IHVs have shown a little more insight lately and offer extensions to query that information:
NVX_gpu_memory_info
ATI_meminfo
One nice thing about these extensions is that they have a least common denominator which is just what most people need, and you don't need to query extension support or do anything special, as they both work via glGetIntegerv.
In the easiest case, you can just initialize an array of 4 integers to zero (or some minimum default value that you'll assume in case the extensions don't work), then you call glGetIntegerv twice (with GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX and TEXTURE_FREE_MEMORY_ATI, respectively), and finally call glGetError to clear the error state. glGetIntegerv does not modify the pointed-to memory if it fails, nor does it crash or any other bad thing -- it merely sets the error state to GL_INVALID_ENUM.
Both extensions return a value in the first array position, the ATI one returns some values in the other 3 too.
n.b.: glAreTexturesResident has not been supported for almost a decade on mainstream hardware, in the same manner as texture priorities. The common mantra is that the driver writer knows much better than you anyway.

OpenGL doesn't give you this information. And frankly: There's only little benefit, simply because today we have multitasking operating systems. The OpenGL driver is responsible for swapping in texture data to/from system memory, if there's demand for it.
What OpenGL can do for you, is tell, if the textures you've uploaded are still resident in fast memory. The function is called "glAreTexturesResident". You can use this to gradually upload stuff to the GPU until you've filled up the GPU's memory. But keep in mind that you're not the only user of the GPU.

Related

vulkan pushConstant vs uniform buffer update

So I am reading the vulkan book now and got a problem about the push Constant and ubo update.
After I set up all the pipeline and descriptor stuff. Basically I just need the copy the buffer to the UBO buffer such as memcpy then I am done.
Basically I can understand the issue about the whole pipeline needs to wait for this "buffer" ready then change it's content. So it will be slow.
On the other hand, when I use push constant, there is no such a problem. Although it's small (say 256 bytes big).
So far so good.
However, on the second thought, I find that if I am updating the UBO, I don't need to change the command buffer, or re-record it, I can submit the old CB since it's still the same.
Then if I want to update by using Push Constant, I have to reset the CB and record it again then submit it.
So won't this be an issue? How to make sure which one is faster?
Thanks.
Lots of people get confused on this issue, because the Vulkan Tutorial pre-records commands and Vulkan Guide re-record commands every frame.
When people say to use push constants for per-frame changing data like transform matrices and time data, there's the implicit assumption that you are recording command buffer per frame. Push constants essentially hitch a ride with the rest of your commands when submitted, which is also how they avoid synchronization and cache flushing to operate.
Now, in a lot of scenarios, re-recording command buffers can be easier and not significantly more costly than re-use. And indeed, re-using command buffers when things change can be a real pain to manage. Command buffers are meant to be fast to record. Still, the Vulkan tutorial went with pre-recording everything, which is also a valid approach though potentially harder to maintain at scale.
At the time the tutorial was created, the Vulkan Tutorial was essentially one of the only resources available to learn vulkan in a structured manner. Even though command buffers are quick to record, pre-recording command buffers eliminates even more CPU overhead and exemplifies Vulkan's "Never be draw call limited again" mantra to eliminating CPU overhead in graphics applications.
As for the speed comparison, you'll have to benchmark, but I would not necessarily choose one or the other for "speed" reasons. If you pre-record, you don't want to re-orient your entire rendering architecture just to take advantage of push constants. If you don't pre-record, there's no reason not to use push constants, they are just straight up easier to deal with.
It seems like currently you are pre-recording. I would not bother with push constants at all for this kind of data. I would also not focus on these kinds of issues until you get more familiar with vulkan, as it is very easy to get caught in the weeds with optimization in vulkan, strategies for optimization are no where near as uniform as in the CPU space.

opencl - use image object with local memory

i'm trying to program with opencl.
There are two types of memory object.
one is buffer and another one is image.
some blogs and web site,white papers say 'image object is little bit faster that buffer because of cache'.
i'm trying to use image object and the reason for that is 'clamp', it will make kernel code more simpler and faster(my opinion)
my question is 'is it possible to use image object and local memory and is it faster(than using buffer object with local memory)?"
Data-> image object-> copy to local memory -> operations -> write back to other image object.
As far as i understood, i cannot use async_work_group_copy instruction for local memory in this case.
so i have to copy and synchronize manually for local memory. it will make overhead a lot.
The only real answer to that is "it depends". Most implementations don't really have a value in doing async_work_group_copy. Image reads may be slightly higher latency than buffer reads when there is a cache hit, but you may get better cache behaviour from them on some architectures. Clamping, address calculation and filtering are effectively free operations performed by dedicated hardware, that you'd have to shift into shader code when using buffers, so that reduces your read latency and may increase throughput.
If you are going to get big caching benefits from images, local memory may just get in the way. The extra cost of writing to it, synchronizing, reading from it, calculating addresses and so on may cost you.
Sadly this is just one of those things you'll have to experiment with on your target architectures.

[ARM CortexA]Difference between Strongly-ordered and Device Memory Type

I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types:
Strongly-ordered
Device
Normal
I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me.
According to the Cortex-A Series Programmer's Guide, the only difference is that:
A write to Strongly-ordered memory can complete only when it reaches the peripheral or memory component accessed by the write.
A write to Device memory is permitted to complete before it reaches the peripheral or memory component accessed by the write.
I am not quite sure about what the real implification of this. I am guessing that, the order of the access to the memory typed with Strongly-ordered or Device should be coherent with programmers' codes (no out-of-order access). But the CPU will potentially execute the next instruction while accessing the memory if typed Device, and it will simply wait untill the access to be complete if typed Strongly-ordered.
Correct me if I am wrong and please tell me what is the meaning of doing this.
Thanks in advance.
One important bit to understand is that memory types have no guaranted effect on the instruction stream as a whole - they affect only the ordering of memory accesses. (They may have a specific effect on a specific processor integrated in a specific way with a specific interconnect - but that can never be relied on by software.)
Another important thing to understand is that even Strongly-ordered memory provides implicit guarantees of ordering only with regards to accesses to the same peripheral. Any ordering requirements more strict than that require use of explicit barrier instructions.
A third important point is that any implicit memory access ordering that takes place due to memory types does not affect the ordering of accesses to other memory types. Again, if your application has dependencies like this, explicit barrier instructions are required.
Now, against that background - a simpler way of describing the difference between Device and Strongly-ordered memory is that Device memory accesses can be buffered - in the processor itself or in the interconnect. The difference being that a buffered access can be signalled as complete to the processor before it has completed (or even initiated) at the end point.
This provides better performance at the cost of losing the synchronous reporting of any error condition.

Short question about CUDA memory access

hey there,
assuming I have a problem where each thread calculates something (reading some parameters out of the constant memory and using them for calculation) and than stores it to a global memory matrix. this matrix gets never read, just writing access... is there now any sense of using shared memory first to store all the calculated values in and than later write them to the global memory? I think no because the writes to global memory stay the same in complete, so the writes to shared memory just add to the writes which I had before already....
Thanks!
There can be, depending on the access patterns in the kernel code. Using a shared memory buffer to "stage" output can be a useful way of ensure writes are coalesced, when the naive write would not be coalesced. This was pretty crucial for performance in the first couple of generations of CUDA compatible hardware (G80/G90). In newer hardware, the case for this is a lot less strong. Fermi cards have a pretty effective L1 and L2 cache scheme which can (within reason) get close to what used to be only achievable using shared memory without any extra code.
There isn't really a general answer to this question, because it depends a lot of the specifics of what any given code does, and what target hardware it is expected to run well on.

Coping with, and minimizing, memory usage in Common Lisp (SBCL)

I have a VPS with not very much memory (256Mb) which I am trying to use for Common Lisp development with SBCL+Hunchentoot to write some simple web-apps. A large amount of memory appears to be getting used without doing anything particularly complex, and after a while of serving pages it runs out of memory and either goes crazy using all swap or (if there is no swap) just dies.
So I need help to:
Find out what is using all the memory (if it's libraries or me, especially)
Limit the amount of memory which SBCL is allowed to use, to avoid massive quantities of swapping
Handle things cleanly when memory runs out, rather than crashing (since it's a web-app I want it to carry on and try to clean up).
I assume the first two are reasonably straightforward, but is the third even possible?
How do people handle out-of-memory or constrained memory conditions in Lisp?
(Also, I note that a 64-bit SBCL appears to use literally twice as much memory as 32-bit. Is this expected? I can run a 32-bit version if it will save a lot of memory)
To limit the memory usage of SBCL, use --dynamic-space-size option (e.g.,sbcl --dynamic-space-size 128 will limit memory usage to 128M).
To find out who is using memory, you may call (room) (the function that tells how much memory is being used) at different times: at startup, after all libraries are loaded and then during work (of cource, call (sb-ext:gc :full t) before room not to measure the garbage that has not yet been collected).
Also, it is possible to use SBCL Profiler to measure memory allocation.
Find out what is using all the memory
(if it's libraries or me, especially)
Attila Lendvai has some SBCL-specific code to find out where an allocated objects comes from. Refer to http://article.gmane.org/gmane.lisp.steel-bank.devel/12903 and write him a private mail if needed.
Be sure to try another implementation, preferably with a precise GC (like Clozure CL) to ensure it's not an implementation-specific leak.
Limit the amount of memory which SBCL
is allowed to use, to avoid massive
quantities of swapping
Already answered by others.
Handle things cleanly when memory runs
out, rather than crashing (since it's
a web-app I want it to carry on and
try to clean up).
256MB is tight, but anyway: schedule a recurring (maybe 1s) timed thread that checks the remaining free space. If the free space is less than X then use exec() to replace the current SBCL process image with a new one.
If you don't have any type declarations, I would expect 64-bit Lisp to take twice the space of a 32-bit one. Even a plain (small) int will use a 64-bit chunk of memory. I don't think it'll use less than a machine word, unless you declare it.
I can't help with #2 and #3, but if you figure out #1, I suspect it won't be a problem. I've seen SBCL/Hunchentoot instances running for ages. If I'm using an outrageous amount of memory, it's usually my own fault. :-)
I would not be surprised by a 64-bit SBCL using twice the meory, as it will probably use a 64-bit cell rather than a 32-bit one, but couldn't say for sure without actually checking.
Typical things that keep memory hanging around for longer than expected are no-longer-useful references that still have a path to the root allocation set (hash tables are, I find, a good way of letting these things linger). You could try interspersing explicit calls to GC in your code and make sure to (as far as possible) not store things in global variables.

Resources