DirectX 11 - 2nd GPU is slower? - directx

I have two GPUs, both the same (nvidia 680s) with SLI disabled. If I create the device and device context with the enum 0 adapter (GPU) my simple clear surface program runs at 0.05ms/frame. However, if I run it with the enum 1 adapter instead (other GPU) it runs at over 1ms/frame.
How is it that one of my GPUs can be so much slower than the other? They are both installed in the correct PCI 3.0 16x slots as according to the motherboard.
Am I missing something? I've looked over the code 1000x and have virtually ruled out a mistake in coding - I simply swapped between the adapter used in creating the device and swap chain.

You're not providing enough information as to what your code does, specifically with respect to display.
But my guess would be that one GPU controls the display you're using for your output, and the other does not. So displaying your render is immediate on the first GPU, but requires a full framebuffer copy between the 2 GPUs on the other case.
This will have to go over PCIe. 16x PCIe 3 is still 15.6GB/s, so that would imply ~10MB of transfer, which is about what a 1920x1200 display is.
Can you give more details as to what your resolution and display look like? was this in full-screen?

Related

Enforce use of independent flip mode with DXGI FLIP SwapChain

I currently face a problem with DXGI Swapchains (DirectX 11). My C++ application shows (live) video and my goal is to minimize latency. I have no user input to process.
In order to decrease latency I switched to a DXGI_SWAP_EFFECT_FLIP_DISCARD swapchain (I used BitBlt before - see For best performance, use DXGI flip model for further details). I use the following flags:
//Swapchain Init:
sc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
sc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT | DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING;
//Present:
m_pDXGISwapChain->Present(0, DXGI_PRESENT_ALLOW_TEARING);
On one computer the swapchain (windowed) goes into the "Hardware: Independent Flip" mode and I have perfectly low latency, as long as I have no other windows in front. On all other computers I tried, I am stuck in the "Composed: Flip" mode with higher latency (~ 25-30 ms more). The software (binary) is exactly the same. I am checking this with PresentMon.
What I find interesting is that on the computer where the independent flip mode works, it is also active without the ALLOW_TEARING flags - from what I understood they should be required for it. I btw also see tearing in this case, but that is a different problem.
I already tried to compare Windows 10 versions, graphic drivers and driver settings. GPU is a Quadro RTX 4000 for all systems. I couldn't spot any difference between the systems.
I would really appreciate any hints on additional preconditions for the independent flip mode I might have missed in the docs. Thanks for your help!
Update1: I updated the "working" system from Nvidia driver 511.09 to 473.47 (latest stable). After that I got the same behavior like on the other systems (no ind. flip). After going back to 511.09 it worked again. So the driver seems to have influence. The other systems also had 511.09 for my original tests though.
Update2: After dealing with all DirectX debug outputs, it still does not work as desired. I manage to get into independent flip mode only in real full screen mode or in windowed mode where the window has no decorations and actually covers the whole screen. Unfortunately, using the Graphics Tools for VS I never enter the independent flip and cannot do further analysis here. But it is interesting that when using the Graphics Tools debug, PresentMon shows Composed Flip, but the Graphics Analyzer from the Graphics Tools shows only DISCARD as SwapEffect for the SwapChain. I would have expected FLIP_DISCARD as I explicitly used DXGI_SWAP_EFFECT_FLIP_DISCARD.

DirectX RenderContext RAM/VRAM

I have 8GB or Vram (Gpu) & 16GB of Normal Ram when allocating (creating) many lets say large 4096x4096 textures i eventual run out of Vram.. however from what i can see it then create it on ram instead.. When ever you need to render (with or to) it .. it seams to transfer the render-context from the ram to vram in order to do so. Running normal accessing many render-context over and over every frame (60fps etc) the pc lags out as it tries to transfer very high amounts back and forth. However so long the amount of new (not recently used render-contexts (etc still on ram not vram)) is references each second.. there should not be a issue (performance wise). The question is if this information is correct?
DirectX will allocate DEFAULT pool resources from video RAM and/or the PCIe aperture RAM which can both be accessed by the GPU directly. Often render targets must be in video RAM, and generally video RAM is faster memory--although it greatly depends on the exact architecture of the graphics card.
What you are describing is the 'over-commit' scenario where you have allocated more resources than actually fit in the GPU-accessible resources. In this case, DirectX 11 makes a 'best-effort' which generally involves changing virtual memory mapping to get the scene to render, but the performance is obviously quite poor compared to the more normal situation.
DirectX 12 leaves dealing with 'over-commit' up to the application, much like everything else about DirectX 12 where generally "runtime magic behavior" has been removed. See docs for details on this behavior, as well as this sample

ARM memory mapping: INT15 equivalent? Standard way to query memory map?

On PC-architectures (where the presence of the BIOS and the usage of it is pretty much standardized), you can discover the size of the RAM memory, as well as its reserved/free for use regions by using the INT15 BIOS interrupt, function 0xE820.
Since I'm passionate about low-level programming and after programming Intel architectures for approximately 6 months, I decided I should try and learn how other architectures work. So I've started digging into ARM development. I've got 2 boards I'm currently working on: Olimex A20 OlinuXino-MICRO and Samsung Arndale's Exynos 5250. What I'm trying to do is to port a hypervisor I've developed for Intel architectures to these two boards. I am now in the stage of trying to programmatically detect the memory map of the system in a reliable and acceptably standardized way (I would prefer not to write entirely different code for different ARM boards). But so far, I find the relevant documentation to be a little bit confusing.
On the Olimex A20 I've got a Cortex-A7 ARM CPU.
The PDF found here: http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf , which applies to Cortex-A7 and other CPUs, states at page 14 that the memory addressing space from 1GB-to-2GB is reserved for Memory-Mapped I/O devices, whereas the Olimex-A20 documentation found at this link https://github.com/OLIMEX/OLINUXINO/blob/master/HARDWARE/A20-PDFs/A20%20User%20Manual%202013-03-22.pdf?raw=true states at page 21 that the memory addressing space from 1GB-to-3GB is DDR-II/DDR-III memory.
Am I simply confused or is there an inconsistency between these two documents?
The memory maps on ARM chips are highly chip-specific. There is also usually nothing like a BIOS, so your bootloader or hypervisor will have to figure out the memory layout on its own.
Generally you'd need to work with the SDRAM controller to query and initialize the installed SDRAM chips. This is a non-trivial and, once again, very chip-specific process. You should check the code of bootloaders (e.g. U-Boot) you have available for your chips and look for the memory init code.
However, in many of cases, the memory "map" (start of RAM and its size) is simply hardcoded for each board the bootloader is ported to, since it's unlikely to change at every boot.
Historically ARM boot-loaders pass information to the Linux kernel using ATAG structures as desribed in Booting ARM Linux. At a minimum the boot-loader is expected initialize the RAM in the system and pass ATAG_MEM structures to describe where the RAM lives in the address space. Interpreting these structures would give you some of the information you need but it doesn't tell you anything about any peripheral devices. In this booting method the machine type is used to trigger platform specific code to initialize the rest of the hardware.
The new way of doing this is through the Flattened Device Tree. The device tree originated with OpenFirmware and besides describing the RAM mapping can also describe the rest of the hardware and peripherals.

Super slow Image processing on Android tablet

I am trying to implement SLIC superpixel algorithm in Android tablet (SLIC)
I port the code which in C++ to work with android environment using stl-lib and all. What application doing is taking an image from camera and send data to process in native code.
I got the app running but the problem is that it took 20-30 second to process a single frame (640 x 400) while in my notebook running with visual studio application would be almost instantly finish!
I check the memory leak, their isn't any... is their anything that might cause computation time to be way more expensive than VS2010 in notebook?
I know this question might be very open and not really specific but I'm really in the dark too. Hope you guys can help.
Thanks
PS. I check running time for each process, I think that every line of code execution time just went up. I don't see any specific function that take way longer than usual.
PSS. Do you think follow may cause the slow?
Memory size : investigated, during native not much of paused time show from GC
STL-library : not investigate yet, is it possible that function like vector, max and min running in STL may cause significant slow?
Android environment it self?
Lower hardware specification of Android Tablet (Acer Iconia tab - 1GHz Nvidia Tegra 250 dual-core processor and has 1GB of RAM)
Would be better to run in Java?
PSSS. If you have time please check out the code
I've taken a look to your code and can make the following recommendations:
First of all, you need to add the line APP_ABI := armeabi-v7a into your Application.mk file. Otherwise your code is compiled for old armv5te architecture where you have no any FPU (all floating point arithmetic is emulated), have less registers available and so on.
Your SLIC implementation intensively uses double floating-point values for computation. You should replace them with float wherever possible because ARM still misses hardware support for double type.

Cachegrind under Xen

I have an application written in C++ that someone else has written in a way that's supposed to maximally take advantage of cpu caches. This application runs on a guest Ubuntu OS that is using paravirtualization. I ran cachegrind and received very low cache miss rates.
Since my OS is virtualized, can I be sure that these values are in fact correct in showing that the cpu cache is being well used for my application?
Cachegrind is a simulator. A real CPU may actually perform differently (e.g. your real CPU may have a different cache hierarchy to cachegrind, different sizes of cache, a different replacement policy and so forth). You would need to watch the real CPUs performance counters to know for sure how well your program was really performing on real hardware with respect to cache.

Resources