Profile detailed GPU Memory Usage - memory

So I have a GPU memory leak in certain scenarios in my application. However, I am not aware of any detailed memory profiler for the GPU like those for the CPU. Are there anything out there that can achieve this? I am using D3D (since its WPF, there are d3d9, d3d10, d3d11 components...)
Thanks!

Are you using the debug setting in Dx control panel? This helps you dump the id of the leaking allocation. You can then proceed to set a HKLM registry value and break on the leaking allocation, as is explained here:
http://legalizeadulthood.wordpress.com/2009/06/28/direct3d-programming-tip-5-use-the-debug-runtime/
http://www.gamedev.net/topic/313718-tracking-down-a-directx-leak/
You can also try NSight, which you can download for free from NVidia. For Maximus cards there is also a specific GPU Debugger, and otherwise you can use the Graphics Debugger and try to isolate the memory bump there. In the Performance Debugger you can detect both OpenGl and DirectX events, though this is more performance oriented.

Depending on your GPU's vendor (As you have not provided us with the information), here are the possible solutions:
Intel: Use the Intel Media SDK 's GPU Utilization Utility. This comes packaed in the Intel INDE (Integrated Developer Environment).
AMD: CodeXL provides an on-the-fly debugger and an extensive memory profiling tool, and is now provided as part of their GPUOPen initiative.
NVIDIA: Use the Nvidia Visual Profiler (NVVP) combined with traces from Nvidia Nsight, and these utilities are provided with the standard Nvidia CUDA installer.
Notes:
With Nvidia, you must also install the provided GPU driver (~from the CUDA SDK) to enable any form of GPU-based driver profiling and debugging. Take note of the above limitation if you use your development rig for other purposes such as gaming, as the bundled driver is often much, much older than the stock, Game-ready drivers.
Thanks and regards,
Brainiarc7.

Related

Vulkan API : max MSAA samples supported is VK_SAMPLE_COUNT_8_BIT

I am writing Vulkan API based renderer. Currently I am trying to add MSAA for color attachment.
I was pretty sure I could use VK_SAMPLE_COUNT_16_BIT ,but limits.framebufferColorSampleCounts returns bit flags that allow MSAA level up to VK_SAMPLE_COUNT_8_BIT (inclusive)
I run on a brand new NVIDIA QUADRO RTX 3000 card. I also use latest NVIDIA driver: 441.28
I checked the limits in OpenGL and GPU caps viewer shows
GL_MAX_FRAMEBUFFER_SAMPLES = 32
How does it make sense? is the limit dictated by the Vulkan API only? And if the hardware doesn't support more than x8 then does it mean OpenGL driver simulates it on CPU, e.g via stuff like supersampling? That's what I was said by several rendering developers at khronosdev.slack ? Does it make sense? Doesn't a vendor have to compile with the standard and either implement MSAA the right way or not to implement at all?
Is that possible that OpenGL doesn't "really" support more than x8 MSAA ,but the drivers simulate it via stuff like supersampling?
UPDATE
This page explains the whole state of MSAA implmentation for OpenGL and actually it becomes clear from it why Vulkan doesn't provide more than x8 samples on my card. Here is the punch line:
Some NVIDIA drivers support multisample modes which are internally
implemented as a combination of multisampling and automatic
supersampling in order to obtain a higher level of anti-aliasing than
can be directly supported by hardware.
framebufferColorSampleCounts is flags, not a count. See this enum for the values: https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkSampleCountFlagBits.html
15 offers VK_SAMPLE_COUNT_1_BIT, VK_SAMPLE_COUNT_2_BIT, VK_SAMPLE_COUNT_4_BIT or VK_SAMPLE_COUNT_8_BIT.
This answers why you get 15, rather than a power of two, but it still begs the question why the NVidia driver is limiting you more than the OpenGL driver. Perhaps a question for the NVidia forums. You should double-check that your driver is up to date and that you're actually picking your NVidia card and not an integrated one.
I've also come across a similar problem (not Vulkan though, but OpenGL, but also NVidia): on my NVidia GeForce GTX 750 Ti, the Linux driver nvidia reports GL_MAX_SAMPLES=32, but anything higher than 8 samples results in ugly blurring of everything including e.g. text, even with glDisable(GL_MULTISAMPLING) for all rendering.
I remember seeing the same blurring problems when I enabled FXAA globally (via nvidia-settings --assign=fxaa=1) and ran KWin (KDE's compositing window manager) with this setting on. So I suspect this behavior with samples>=9 is because the driver enables FXAA in addition to (or instead of) MSAA.

Opencl migration internals

I am interested in how OpenCL memory transferring functions operate underneath (migration, reading/writing the buffer, mapping/unmapping). I could not find any open source implementation for OpenCL (for me Intel's one could be fine) and just explanations in the documentation don't give me any idea what is happening, for example, when I call clEnqueueMigrateMemObjects: what calls happen during this migration, what modules are active, how this migration happens, what mechanisms it uses underneath, does it use some cache mechanisms.
Is there a good source to read about it?
I am now exploring how OpenCL passes data to FPGAs. Xilinx currently uses native OpenCL implementation, present on a machine, plus some extensions.
If you're looking for low-level information (how a particular implementation implements those calls), probably the only source is the implementation.
There are a few opensource OpenCL on GPU implementations:
Raspberry Pi 3 (beta): https://github.com/doe300/VC4CL
OpenCL on Vulkan (beta): https://github.com/kpet/clvk
Mesa Clover (supports only 1.1): https://cgit.freedesktop.org/mesa/mesa/log/?qt=grep&q=clover
AMD ROCm: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime
Intel sources of NEO (their new OpenCL implementation) here: https://github.com/intel/compute-runtime
I'm not aware of Xilinx providing sources for their implementation, so if you want to know what exactly happens on Xilinx, your best chance is probably to ask on Xilinx forums or via some official support.

IDirect3D9::GetAdapterCount wont find my second video card

My laptop has two video cards, a high powered NVIDIA one and an onboard Intel one. When I call IDirect3D9::GetAdapterCount however, it only finds the onboard Intel one, probably because the high powered one is being hidden.
I'm able to go into my laptop settings and tell it 'force choose' the NVIDIA card, and then it works, but this is not an acceptable solution for my end-users. I've also noticed that when I run Battlefield3, it's able to properly find the NVIDIA card even without 'force choose' enabled. Maybe there's a special white-list that has Battlefield listed? Or some other secret method?
Any ideas how to acquire that elusive card?
Are you sure the intel chip is enumeratable? Quite often its not. By sticking in a discrete GPU the sandybridge (and older) chipset is generally disabled. You probably want to check the Nvidia optimus test tool.
GetAdapterCount will actually returns count of the monitors in system, not videocards. And as far as I know there is no way to force choose it programmatically.
If you talking about nVidia optimus technology, it choose videochip using driver settings.

What language and compiler to write ATI GPU code for?

I know Nvidia has CUDA, but what does ATI have? I dont want to use OpenCL because I want to keep as low level to the hardware as possible.
Is it brook, or stream?
The documentation available is pretty pathetic! CUDA seems easy to get programming, but I want to use ATI specifically because of their hardware.
OpenCL is AMD's currently preferred GPU/compute language.
Brook is deprecated.
However, you can write code at a very low level, using AMD's
shader and kernel analyzer
http://developer.amd.com/tools/shader/Pages/default.aspx.
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx
E.g. http://developer.amd.com/tools/shader/PublishingImages/GSA.png
shows OpenCL code, and the Radeon 5870 assembly produced.
You can actually code directly in several forms of "assembly".
Or at least you could - the webpages no longer mention this.
(I used to have this installed for tuning and testing, but do not at the moment.)
More usually, you can code in any of several forms of AMD IL, Intermediate Language,
which is closer to the machine than OpenCL. The kernel analyzer web page says
"If your kernel is an IL kernel Stream, KernelAnalyzer will automatically compile the IL..."
I would recommend that you use OpenCL, and then look at the disassembly and tweak the OpenCL code to be better tuned. But you can work in IL, and probably still can work at an even lower level.

Can I use openCL in a application that I distribute to non developer machine?

I recently started to learn how to use openCL to speed up some part of my code. So far the speed gain is impressive. In one case the code ran up to 50X faster than on the CPU. However I wonder if can start using this code in a production environnement. The reason is that the first time that I tried to run the example code, nothing worked. I was able to make it run by downloading the driver on the Nvidia openCL SDK download page (I have a Geforce GTX260). It gave me a blue during installation but after that I was able to run the example program and create my own code.
Does the fact that it didn't work "out of the box" for me mean that the mainstream drivers does not yet support it, despite the fact that it is specifically written that it does on the driver download page? What about ATI support? Will everyone have to download the special driver that gave me a blue screen on install?
In short, is openCL ready for production code?
If someone can give me some details, I'd like to know. Does anyone has been able to run a simple program on a number of different device without installing anything SDK related?
You may find an accurate answer on the OpenCL forums on the Khronos Group message boards. The OpenCL work group hangs out there regularly.
Does anyone has been able to run a
simple program on a number of
different device without installing
anything SDK related?
Nop. For instance, on ATI's GPUs end-users need to install ATI Stream SDK in order to run OpenCL code (just having an up-to-date graphics driver is not sufficient).
You may want to consider trying DirectCompute (Microsoft's version of GPU programming) or doing your OpenCL work on a Snow Leopard Mac. Those are the two ways (that I know of) that you can deliver a GPU programming solution to another user without any driver or other installation hassle.

Resources