Enforce use of independent flip mode with DXGI FLIP SwapChain - directx

I currently face a problem with DXGI Swapchains (DirectX 11). My C++ application shows (live) video and my goal is to minimize latency. I have no user input to process.
In order to decrease latency I switched to a DXGI_SWAP_EFFECT_FLIP_DISCARD swapchain (I used BitBlt before - see For best performance, use DXGI flip model for further details). I use the following flags:
//Swapchain Init:
sc.SwapEffect = DXGI_SWAP_EFFECT_FLIP_DISCARD;
sc.Flags = DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT | DXGI_SWAP_CHAIN_FLAG_ALLOW_TEARING;
//Present:
m_pDXGISwapChain->Present(0, DXGI_PRESENT_ALLOW_TEARING);
On one computer the swapchain (windowed) goes into the "Hardware: Independent Flip" mode and I have perfectly low latency, as long as I have no other windows in front. On all other computers I tried, I am stuck in the "Composed: Flip" mode with higher latency (~ 25-30 ms more). The software (binary) is exactly the same. I am checking this with PresentMon.
What I find interesting is that on the computer where the independent flip mode works, it is also active without the ALLOW_TEARING flags - from what I understood they should be required for it. I btw also see tearing in this case, but that is a different problem.
I already tried to compare Windows 10 versions, graphic drivers and driver settings. GPU is a Quadro RTX 4000 for all systems. I couldn't spot any difference between the systems.
I would really appreciate any hints on additional preconditions for the independent flip mode I might have missed in the docs. Thanks for your help!
Update1: I updated the "working" system from Nvidia driver 511.09 to 473.47 (latest stable). After that I got the same behavior like on the other systems (no ind. flip). After going back to 511.09 it worked again. So the driver seems to have influence. The other systems also had 511.09 for my original tests though.
Update2: After dealing with all DirectX debug outputs, it still does not work as desired. I manage to get into independent flip mode only in real full screen mode or in windowed mode where the window has no decorations and actually covers the whole screen. Unfortunately, using the Graphics Tools for VS I never enter the independent flip and cannot do further analysis here. But it is interesting that when using the Graphics Tools debug, PresentMon shows Composed Flip, but the Graphics Analyzer from the Graphics Tools shows only DISCARD as SwapEffect for the SwapChain. I would have expected FLIP_DISCARD as I explicitly used DXGI_SWAP_EFFECT_FLIP_DISCARD.

Related

Multiple boards Windows Device Manager prompts resource conflict, error code is 12

I inserted multiple same boards into the system. Device PCIe is implemented with Xilinx IP core. After each FPGA program is programed, manually refresh the device manager to check whether the device and driver are working properly.
My confusion is that it seems that this approach can only work on two boards at the same time. After the third board is programmed , and then refresh the task manager, the system prompts insufficient resources, "This device cannot find enough free resources that it can use (Code 12)"
I tried to disable the other two boards, but the device still prompted conflicts. I don't know how to query conflicting resources.
My boards have 2 BARs(BAR0: 2KB, BAR1:16MB), and 1 IRQ.
I did a few experiments and felt that it was caused by the conflict of memory resources. The conflicting party was the AMD integrated graphics card that came with the motherboard.
1st, 2nd, 3rd all indicate my board number.
3rd is not recognized because of conflicts.
After I shut it down, I plugged in all three boards and then turned it on again. As a result, during the startup process, the system restarted again after a sudden power failure, and then all three boards were normal. At this time, it is found that the memory address of the graphics card has changed
I want to know how to resolve the conflict? modify my driver code or FPGA configuration?
PCIe enumeration should resolve memory allocations issues, however there are a couple implementation issues to be aware of. Case in point, I have used Xilinx XDMA's with a 64 bit BAR of 2GB in size and I have literally bricked a DELL XPS motherboard. But I have done the same with an IBM system and it just worked. The point here is that enumeration can be done with Firmware, Hardware or OS driven event. If you doing hardware manager, that sounds like OS driven, but when I toasted the XPS board, that was some kind of FW issue that was related to BAR size that resulted in a permanent failure. 16MB isn't big and it should not be a problem, but I would recommend going with the Xilinx defaults first and show reliability there. I think it's one BAR at 1M. I have run 3x 64 bit BARS at 1MB a piece without issue, but keep it simple and show reliability. Then move up. This will help isolate if it is a system flakiness or not.
I have seen some system use FW based enumeration that comes up really fast, before the FPGA has configured, in which case there is no PCIe target to ID. If you frequently find that your FPGA is not detected on power up, but detected on a rescan, this can be a symptom. How to resolve this is a bit of a pain. We ended up using partial reconfiguration. Start with the PCIe interface, then have a reconfig to load the remaining image. Let's hope it isn't this problem
The next thing is to be aware of is your reset mechanism within the FPGA. You probably hooked the PCIe IP reset to the bus reset which is great, however, I have in the past also hooked that reset to internal PLL locked signals that may not be up. For troubleshooting purposes, keep that reset simple and get rid of everything else to show that just the PCIe IP by itself is reliable first.
You also have to be careful here too. If you strip things down, make sure it is clean. If you ignore the PLL lock and try to use a Xilinx driver, such as the XDMA driver, it has a routine where it tries to identify the XDMA with data transactions. It is looking for the DMA BAR one BAR at a time. But when it does this, the transaction it attempts may go out on the AXI bus if the BAR isn't the XDMA control BAR. If the AXI bus isn't out of reset or clocked when this happens you will lock the AXI bus and I have locked up a Linux box this way on several occasions. AXI requires that transactions complete, otherwise it just sits there waiting.
BTW, on a Linux box, you can look at the enumeration output in the kernel log. I'm not sure if Windows shows you the same thing or not. But this can be helpful if you see that the device was initially probed but then something invalid was detected in the config register, versus not being seen at all.
So a couple things to look at.

How can I investigate failing calibration on Spartan 6 MIG DDR

I’m having problems with a Spartan 6 (XC6SLX16-2CSG225I) and DDR (IS43R86400D) memory interface on some custom hardware. I've tried on a SP601 dev board and all works as expected.
Using the example project, when I enable soft_calibration, it never completes and calib_done stays low.
If I disable calibration I can write to the memory perfectly as far as I can see. But when I try to read from it, I get a variable number of successful read commands before the Xilinx memory controller stops implementing the commands. Once this happens, the command fifo fills up and stays full. The number of successful commands varies from 8 to 300.
I'm fairly convinced it's a timing issue, probably related to DQS centering. But because I can't get calibration to complete when enabled, I don't have continuous DQS Tuning. So I'm assuming it works with calibration disabled until the timing drifts.
Is there any obvious places I should be looking for why calibration fails?
I know this isn't a typical stack overflow question, so if it's an inappropriate place then I'll withdraw.
Thanks
Unfortunately, the calibration process just tries to write and read content successively while adjusting taps internally. It finds one end of success then goes the other direction and identifies that successful tap and then final settles on some where in the middle.
This is probably more HW centric as well, so I post what I think and let someone else move the thread.
Is it just this board? Or is it all of them that are doing it? Have you checked? If it's one board, and the RAM is BGA style, it could be a bad solider job. Push you finger down slightly on the chip and see if you get different results... After this is gets more HW centric
Does the FPGA image you are running on your custom board, have the ability to work on your devkit? A lot of times, that isn't practical I know, but I thought I would ask as it rules out that the image you are using on the devkit has FPGA constraints you aren't getting in your custom image.
Check your length tolerances on the traces. There should have been a length constraint. Plus or minus 50 mils something like that. No one likes to hear they need a board re-spin, but if those are out, it explains a lot.
Signal integrity. Did you get your termination resistors in there and are they the right values? Don't supposed you have an active probe?
Did you get the right DDR memory. Sometimes they use a different speed grade and that can cause all sorts of issue.
Slowing down the interface will usually help items 4 and 5. so if you are just trying to work done, you might ask for a new FPGA image with a slower clock.

Creating synchronized stereo videos using webcams

I am using OpenCV to capture video streams from two USB webcams (Microsoft LifeCam Studio) in Ubuntu 14.04. I am using very simple VideoCapture code (source here) and am trying to at least view two videos that are synchronized against each other.
I used Android stopwatch apps (UltraChron Stopwatch Lite and Stopwatch Timer) on my Samsung Galaxy S3 mini to realize that my viewed images are out of sync (show different time on stopwatch).
The frames are synced maybe in 50% of the time. The frame time differences I get are from 0 to about 300ms with an average about 120ms. It seems that the amount of timeout used has very little effect on sync (same for 1000ms or 2000ms). I tried to minimize the timeout (waitKey(1) for the OpenCV loop to work at all) and read every Xth iteration of the loop - this gave worse results that waitKey(1000). I run in FullHD but lowering resolution to 640x480 had no effect.
An ideal result would be a 100% synchronized stereo video stream that has X FPS. As I said I so far use OpenCV to view video still images, but I do not mind using anything else to get the desired result (can be on Windows too).
Thanks for help in advance!
EDIT: In my search for low-cost hardware I fount that it is probably possible to do some commodity hardware hacking (link here) and inject a single clock signal into multiple camera modules simultaneously to get the desired sync. The guy who did that seems to have developed his GENLOCKed camera board (called NerdCam1) and even a synced stereo camera board that he now sells for about €200.
However, I have almost zero ability of hardware hacking. Also I am not sure if such clock injection is possible for resolutions above NTSC/PAL standard (as it seems to be an "analog" solution?). Also, I would prefer a variable baseline option where both cameras would not be soldered on a single board.
It is not possible to stereo sync two common webcams because webcams lack external trigger feature that lets one precisely sync multiple cams using a common trigger signal. Such trigger may be done both in SW or HW but the latter will give better precision. Webcams only support "free-running" mode and let you stream whatever FPS they support but you can not influence when exactly the frame integration/exposure is done.
There are USB cameras with a dedicated external trigger feature (usually scientific cameras like Point Grey) - they are more expensive (starting at about $300/piece) than webcams but can be synced. If you really are on low budget you can try to hack the PS3 Eye camera to get the ext. trigger feature.

DirectX 11 - 2nd GPU is slower?

I have two GPUs, both the same (nvidia 680s) with SLI disabled. If I create the device and device context with the enum 0 adapter (GPU) my simple clear surface program runs at 0.05ms/frame. However, if I run it with the enum 1 adapter instead (other GPU) it runs at over 1ms/frame.
How is it that one of my GPUs can be so much slower than the other? They are both installed in the correct PCI 3.0 16x slots as according to the motherboard.
Am I missing something? I've looked over the code 1000x and have virtually ruled out a mistake in coding - I simply swapped between the adapter used in creating the device and swap chain.
You're not providing enough information as to what your code does, specifically with respect to display.
But my guess would be that one GPU controls the display you're using for your output, and the other does not. So displaying your render is immediate on the first GPU, but requires a full framebuffer copy between the 2 GPUs on the other case.
This will have to go over PCIe. 16x PCIe 3 is still 15.6GB/s, so that would imply ~10MB of transfer, which is about what a 1920x1200 display is.
Can you give more details as to what your resolution and display look like? was this in full-screen?

Super slow Image processing on Android tablet

I am trying to implement SLIC superpixel algorithm in Android tablet (SLIC)
I port the code which in C++ to work with android environment using stl-lib and all. What application doing is taking an image from camera and send data to process in native code.
I got the app running but the problem is that it took 20-30 second to process a single frame (640 x 400) while in my notebook running with visual studio application would be almost instantly finish!
I check the memory leak, their isn't any... is their anything that might cause computation time to be way more expensive than VS2010 in notebook?
I know this question might be very open and not really specific but I'm really in the dark too. Hope you guys can help.
Thanks
PS. I check running time for each process, I think that every line of code execution time just went up. I don't see any specific function that take way longer than usual.
PSS. Do you think follow may cause the slow?
Memory size : investigated, during native not much of paused time show from GC
STL-library : not investigate yet, is it possible that function like vector, max and min running in STL may cause significant slow?
Android environment it self?
Lower hardware specification of Android Tablet (Acer Iconia tab - 1GHz Nvidia Tegra 250 dual-core processor and has 1GB of RAM)
Would be better to run in Java?
PSSS. If you have time please check out the code
I've taken a look to your code and can make the following recommendations:
First of all, you need to add the line APP_ABI := armeabi-v7a into your Application.mk file. Otherwise your code is compiled for old armv5te architecture where you have no any FPU (all floating point arithmetic is emulated), have less registers available and so on.
Your SLIC implementation intensively uses double floating-point values for computation. You should replace them with float wherever possible because ARM still misses hardware support for double type.

Resources