Is PIX replay using actual driver? - directx

If I run a 3D application (like a benchmark tool or game) using PIX and replay the capture later, is the replay actually calling the same API (and thus invoking the actual driver and GPU, rather than running a punt back or emulated 3D using CPU) the same way it was when running the original 3D application? I'm focusing only on the Direct3D API part.
Is there any other way I can do the capture, because for some application, PIX fails to capture them.
Is there a way for me to capture only a subset of the rendering, say only the middle 50 frames?

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Access the whole video memory

I'm looking for a way to read the whole video memory that a video card outputs to a display. That includes also hardware accelerated output, video playback and output in fullscreen mode (that somehow I feel could be different from windowed mode).
In short: I want to be able to capture everything that is going to be represented on a display.
I suppose that IF that's possible it would be os-dependant. The targets I'm interested in are Windows OSX and Linux.
Do you have any hint?
For windows I guess you could take CamStudio, strip it down and use it to record the screen then do whatever you want with the output, other than that you could look into forensic kernel drivers for accessing RAM. It's not exactly as simple as a pointer pointing to the video memory anymore, haha.
Digital Rights Management, requested feature of Windows, attempts to block your access to blocks of graphics-card frame buffer memory. Using an open-source driver under Linux would seem to be the only way to access this memory, or as mentioned earlier, some 3rd party software that knows some back doors or hacks or ways to locate other program's frame buffer space.
Unless of course, you are trying to capture output from your own program (ie you are calling the video/graphics creation functions yourself), there are APIs to manipulate display frames in DirectX and OpenGL.
I think I found some resources that can help to capture the display memory in Windows
Fastest method of screen capturing
How to save backbuffer to file in DirectX 10?
http://betterlogic.com/roger/2010/07/fast-screen-capture/

Creating synchronized stereo videos using webcams

I am using OpenCV to capture video streams from two USB webcams (Microsoft LifeCam Studio) in Ubuntu 14.04. I am using very simple VideoCapture code (source here) and am trying to at least view two videos that are synchronized against each other.
I used Android stopwatch apps (UltraChron Stopwatch Lite and Stopwatch Timer) on my Samsung Galaxy S3 mini to realize that my viewed images are out of sync (show different time on stopwatch).
The frames are synced maybe in 50% of the time. The frame time differences I get are from 0 to about 300ms with an average about 120ms. It seems that the amount of timeout used has very little effect on sync (same for 1000ms or 2000ms). I tried to minimize the timeout (waitKey(1) for the OpenCV loop to work at all) and read every Xth iteration of the loop - this gave worse results that waitKey(1000). I run in FullHD but lowering resolution to 640x480 had no effect.
An ideal result would be a 100% synchronized stereo video stream that has X FPS. As I said I so far use OpenCV to view video still images, but I do not mind using anything else to get the desired result (can be on Windows too).
Thanks for help in advance!
EDIT: In my search for low-cost hardware I fount that it is probably possible to do some commodity hardware hacking (link here) and inject a single clock signal into multiple camera modules simultaneously to get the desired sync. The guy who did that seems to have developed his GENLOCKed camera board (called NerdCam1) and even a synced stereo camera board that he now sells for about €200.
However, I have almost zero ability of hardware hacking. Also I am not sure if such clock injection is possible for resolutions above NTSC/PAL standard (as it seems to be an "analog" solution?). Also, I would prefer a variable baseline option where both cameras would not be soldered on a single board.
It is not possible to stereo sync two common webcams because webcams lack external trigger feature that lets one precisely sync multiple cams using a common trigger signal. Such trigger may be done both in SW or HW but the latter will give better precision. Webcams only support "free-running" mode and let you stream whatever FPS they support but you can not influence when exactly the frame integration/exposure is done.
There are USB cameras with a dedicated external trigger feature (usually scientific cameras like Point Grey) - they are more expensive (starting at about $300/piece) than webcams but can be synced. If you really are on low budget you can try to hack the PS3 Eye camera to get the ext. trigger feature.

Fastest way to get frames from webcam

I have a little wee of a problem developing one of my programs in C++ (Visual studio) - Right now im struggling with connection of multiple webcams (connected via usb cables), creating for each of them separate thread to capture frames, and separate frame for processing image.
I use OpenCV to process frames, but the problem is that i dont get a peak of webcam possibilities (it supports 25 fps, i get only 18) is there some library that i could use to get frames, than process them with OpenCV that would made frames be captured faster?
I was researching a bit and the most popular way is to use directshow to get frames and OpenCV to process them.
Do You agree? Or do You have another solution?
I wouldn't be offended by some links :)
DirectShow is only used, if you open your capture using the
CV_CAP_DSHOW flag, like:
VideoCapture capture( CV_CAP_DSHOW + 0 ); // 0,1,2, your cam id there
(without it, it defaults to vfw )
the capture already runs in a separate thread, so wrapping it with more threads won't give you any gain.
another obstacle with multiple cams is the usb bandwidth, so if you got ports on the back & the front of your machine, dont plug all your cams into the same port/controller else you just saturate it
OpenCV uses DirectShow. Using DirectShow (primary video capture API in Windows) directly will obviously get you par or better performance (and even more likely so if OpenCV is set to use Video for Windows). USB cams typically hit USB bandwidth and hence frame rate limit, using DirectShow to capture in compressed formats or in formats with less bits/pixel is the way to reach higher frame rates within the same USB bandwidth limit.
Another typical problem causing low frame rates is slow synchronous processing delaying the capture. You typically identify this by putting trivial processing into the same capture loop and seeing higher FPS compared to processing-enabled operation.

VMR9 Allocator and SwapEffect.Discard

I'm developing an application using a VMR9 Allocator.
The allocator allow me to draw the directshow filter graph output to a texture.
I noticed that if I don't use the SwapEffect.Copy in exclusive mode my video rate is bellow 25 (the nominal rate).
I need to use SwapEffect.Discard in order to activate multisampling.
Is there any workaround to use vmr9 allocator with SwapEffect.Discard?

Resources