Alternatives to D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS? - nvidia

This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?

Related

iOS app slows down after a few seconds, speeds back up if paused and resumed in debugger

First of all, there's not a lot of detail I can offer, so I realize this question may seem incomplete. At this stage, I'm really looking for any ideas. Frankly, I'm just baffled by this one.
I'm building a graphics-heavy app that really maxes out the CPU. CPU utilization on the devices tends to be around 150% according to XCode (I know that sounds weird, it seems to be of a possible 200% because of the device having two cores). I've instrumented the tasks that do the most processing so I can see how long they take in the debug output. Also note that I am compiling with -Ofast (aggressive optimizations), even for debug builds.
Here's the weird thing. About 5-10 seconds into running the CPU intensive mode of the app, everything slows down. It's very visible. Because of my instrumentation, I can see that suddenly everything takes about 3 times as long as it did before. It's pretty uniform across all tasks, and it doesn't speed back up. Here's the really weird thing. If I break execution in the debugger and resume, I get another 5-10 seconds of fast execution before it slows down again.
Looking at the CPU and memory usage reported by XCode, everything stays about the same. The app uses no more than 90MB of memory at any point.
Is there a feature of iOS that slows down CPU intensive apps or underclocks the device to conserve battery life? I realize I'm sharing resources with the OS, but this is behavior I can reliably reproduce every time.
Again, I realize my question is vague, and there's no relevant code I can post. Any ideas about causes or even debugging methods are welcome.
First of all, thank you #thst for trying really hard to help me out. My question was kind of dumb since it really could have been anything.
I eventually solved the problem by rendering (via OpenGL, a detail I forgot to include, again showing how bad my question was) only when there is actually a change to the state of the objects and textures being rendered. It used to render at 60FPS all the time.
The app also uses CIDetector to detect faces. I think, but I am not sure, that CIDetector uses the GPU to perform its detection. If so, there might have been some contention for GPU resources. CIDetector blocking on a wait may have caused slowdown throughout the app.
Thanks everyone.

Is "Running Time", "CPU Usage" a useful metric under Instruments to draw any conclusions?

Have profiled an app on an iPhone 4 using "Time Profiler" and "CPU Monitor" and trying to make sense of it.
Given that execution time is 8 minutes, CPU "Running Time" is around 2 minutes.
About 67% of that is on the main thread, out of which 52% is coming from "own code".
Now, I can see the majority of time being spent in enumerating over arrays (and associated work), UIKit operations, etc.
The problem is, how do I draw any meaningful conclusions out of this data? i.e. there is something wrong going on here that needs fixing.
I can see a lot of CPU load over that running time (median at 70%) that isn't "justifiable" given the nature of the app.
Having said that, there are some things that do stand out. Parsing HTTP responses on the main thread, creating objects eagerly (backed up by memory profiling as well).
However, what I am looking for here is offending code along with useful conclusions solely based on CPU running time. i.e. spending "too much" time here.
Update
Let me try and elaborate in order to give a better picture.
Based on the functional requirements of this app, I can't see why it shouldn't be able to run on an iPhone 3G. A median CPU usage of around 70%, with a peak of 97% only looks like a red flag on an iPhone 4.
The most obvious response to this is to investigate the code and draw conclusions from that.
What I am hoping for is a categorical answer of the following form
if you spend anywhere between 25% - 50% of your time on CA, there is something wrong with your animations
if you spend 1000ms on anything related to UIKit, better check your processing
Then again, maybe there aren't any answers only indications of things being off when it comes to running time and CPU usage.
Answer for question "is there something wrong going on here that needs fixing" is simple: do you see the problem while using application? If yes (you see glitches in animation, or app hang for a while), you probably want to fix it. If not, you may be looking for premature optimization.
Nonetheless, parsing http responses in main thread, may be a bad idea.
In dev presentations Apple have pointed out that whilst CPU usage is not an accurate indicator in the simulator it is something to hold stock of when profiling on device. Personally I would consider any thread that takes significant CPU time without good reason a problem that needs to be resolved.
Find the time sinks, prioritise by percentage, and start working through them. These may not be visible problems now but they will begin to, if they have not already, degrade the user's experience of the app and potentially the device too.
Check out their documentation on how to effectively use CPU profiling for some handy hints.
If enumeration of arrays is taking a lot of time then I would suggest that dictionaries or other more effective caches could be appropriate, assuming you can spare some memory to ease CPU.
An effective approach may be to remove all business logic from the main thread (a given) and make a good boundary layer between the app and the parsing / business logic. From here you can better hook in some test suites that could better tell you if the code is at fault or if it's simply the significant requirements of the app UI itself...
Eight minutes?
Without beating around the bush, you want to make your application faster, right?
Forget looking at CPU load and wondering if it's the right amount.
Forget guessing if it's HTTP parsing. Maybe it is, but guessing won't tell you.
Forget rummaging around in the code timing things in hopes that you will find the problem(s).
You can find out directly why it is spending so much time.
Here's the method I use,
and here's an (amateurish) video of it.
Here's what will happen if you do that.
First you will find something you would never have guessed, and when you fix it you will lop a big chunk off that 8 minutes, like maybe down to 6 minutes.
Then you do it again, and lop off another big chunk.
You repeat until you can't find anything to fix, and then it will be much faster than your 8 minutes.
OK, now the ball is in your court.

How to improve accuracy of profiling

I want to improve the running time of some code.
In order to that I first time the running time of all relevant code, using code like this:
before:= rdtsc;
myobject.run;
after:= rdtsc;
Then I zoom in and time a relevant part, like so:
procedure myobject.part;
begin
StartTime:= rdtsc;
...
EndTime:= rdtsc;
inc(TotalTime, (EndTime- StartTime));
end;
I have some code to copy paste the timings into Excel, a typical outcome would look like:
(the 89.8% and 10.2% adding up to 100% is a coincidence and has nothing to do with the data or the question)
(when the data shows 1 it means 0 to avoid divide by zero errors)
Note the difference between run A and run B.
I have not changed anything yet so run A and B should give the same running time.
Further note that I know that on both runs procedure part was invoked exactly the same number of times (the data is the same and the algorithm is deterministic).
The running time of procedure part is very short (it is just called many times).
If there was some way to block out other processes during these short bursts of runtime (less than 700 CPU cycles) my timings would be much more accurate.
How do I get these timings to be more reliable?
Is there a way to monopolize the CPU to only run my task when timing and nothing else?
Note that I'm not looking for obvious answers like:
- Close other running programs
- Disable the virusscanner etc...
I've tagged the question Delphi because I'm using Delphi right now (and there may be some Delphi specific option to achieve this result).
I've also tagged it language-agnostic because there may be some more general way.
Update
Because I'm using the CPU instruction RDTSC I'm not affected by CPU throttling. If the CPU slows down, the number of cycles stays the same.
Update2
I have 2 answers, but neither answers the question...
The question is how do I prevent these changes in running time?
Do I have to run the code 20x and always compare the lowest running time out of the 20 runs?
Or to I set my program priority to realtime?
Or is there some other trick to use so my code sample does not get interrupted?
To want to improve the running time of some code.
In order to that I first time the running time of all relevant code, ...
OK, I'm a bit of a stuck record on this subject, but lots of people think that to improve running time requires first measuring it accurately.
Not So.
Improving running time requires finding out what's taking a large fraction of time (the exact fraction does not matter) and doing it differently or maybe not at all.
What it's doing is often not revealed by timing individual routines.
Here's the method I use,
and here's a very amateur video of it.
The problem with profiling your code like that, by sticking special statements into it, is that those special statements themselves take time to run. And since the things taking the most time are likely to be things happening in tight loops, the more they run, the more they distort your timings. What you need for good information is something that will observe your program from outside, without modifying the executing code.
In other words, you need a sampling profiler. And there just happens to be a very good one for Delphi available for free, by the rather descriptive name of Sampling Profiler. It runs your program and watches what it's doing, then correlates that against the map file (make sure to set up your project options to generate a Detailed map file) to give you an intelligible readout on what your program is spending its time on.
And if you want to narrow things down, you can use OutputDebugString to output profiling commands to make it only pay attention to specific parts of your code. It's got instructions in the help file.
I've used a lot of different methods, and this is the most useful way I've found to figure out what Delphi programs are spending their time on. And it's free. Give it a try.

Avoid profiling pthread functions

I am profiling a multi threaded application that uses pthread library, to get information about its most executed Basic Blocks (BBLs).
The problem I have encountered is that there are a lot of blocks belonging to __pthread_mutex_lock function, which are not only in the TOP 100 most executed BBLs, but in the TOP 10. This is really annoying because I am not interested in BBLs belonging to pthread library functions at all.
My mentor and I were speculating if busy wait had something to do with this, because in MPI there are ways to change locks to behave from busy waiting mode to blocking mode, so when profiling apps these kind of pthread functions don't appear that much, not even in the top 10. I have never worked with MPI so excuse me if what I say is not very accurate.
What I wanted to ask here is if anyone knows a way to do something like that to pthreads. Changing the code of the application is not an option, but if there is a compiling knob to compile the application, appart from -lpthread, that can change its behaviour not to get so many BBLs from its libraries, would be great.
Thank you for your time.

Hunting down EOutOfResources

Question:
Is there an easy way to get a list of types of resources that leak in a running application? IOW by connecting to an application ?
I know memproof can do it, but it slows down so much that the application won't even last a minute. Most taskmanager likes can show the number, but not the type.
It is not a problem that the check itself is catastrophic (halts the app process), since I can check with a taskmgr if I'm getting close (or at least I hope)
Any other insights on resource leak hunting (so not memory) is also welcomed.
Background:
I've an Delphi 7/2006/2009 app (compiles with all three) and after about a few week it starts acting funny. However only on one of the places it runs, on several other systems it runs till the power goes out.
I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day).
I have tried to reason out memory leaks (with fastmm), but since the dataflow is quite high (60MByte/s from gigabit industrial camera's), I can only rule out "creeping" memory leaks with fastmm, not quick flashes of memoryleaks that exhaust memory around the time it happens. If something goes wrong, the app fills memory in under half a minute.,
Main suspects are filehandles that are somehow left on some error and TMetafiles (which are streamed to these files). Minor suspects are VST, popupmenu and tframes
Updates:
Another possible tip: It ran fine for two years with D7, and now the problems are with Turbo Explorer (which I use for stable projects not converted to D2009 ).
Paul-Jan: Since it only happens once a week (and that can happen at night), information acquisition is slow. Which is why I ask this question, need to combine stuff for when I'm there thursday. In short: no I don't know 100% sure. I intend to bring the entire Systemtools collection to see if I can find something (because then it will be running for days). There is also a chance that I see open files. (maybe should try to find some mingw lsof and schedule it)
But the app sees very little GUI action (it is an machine vision inspection app), except screen refresh +/- 15/s which is tbitmap stretchdraw + tmetafile, but I get this error when saving to disk (TFileStream) handles are probably really exhausted. However in the same stream, TMetafile is also savetostreamed, something which later apps don't have anymore, and they can run from months.
------------------- UPDATE
I've searched and searched and searched, and managed to reproduce the problems in-vitro two or three times. The problems happened when memusage was +/- 256MB (the systems have 2GB), user objects 200, gdi objects 500, not one file more open than expected ).
This is not really exceptional. I do notice that I leak small amounts of handles, probably due to reparenting frames (something in the VCL seems to leak HPalette's), but I suspect the core cause is a different problem. I reuse TMetafile, and .clear it inbetween. I think clearing the metafile doesn't really (always?) resize the resource, eventually each metafile in the entire pool of tmetafile at maximum size, and with 20-40+ tmetafiles (which can be several 100ks each) this will hit the desktop heap limit.
That's theory, but I'll try to verify this by setting the desktop limit to 10MB at the customers, but it will be several weeks before I have confirmation if this changes anything. This theory also confirms why this machine is special (it's possible that this machine naturally has slightly larger metafiles on average). Occasionally freeing and recreating a tmetafile in the pool might also help.
Luckily all these problems (both tmetafile and reparenting) have already been designed out in newer generations of the apps.
Due to the special circumstances (and the fact that I have very limited test windows), this is going to be a while, but I decided to accept the desktop heap as an example for now (though the GDILeaks stuff was also somewhat useful).
Another thing that the audit revealed GDI-types usage in a thread (though only saving tmetafiles (that weren't used or connected otherwise) to streams.
------------- Update 2.
Increasing the desktop limit only seemed to minorly increase the time till the problem occurred.
Unfortunately, I won't be able to follow up on this further, since the machines were updated to a newer version of the framework that doesn't have the problem.
In summary I can only state what the three core modifications were going from the old to the new framework:
I no longer change screens by reparenting frames. I now work with forms that I hide and show. I changed this since I also had very rare crashes or exceptions (that could be clicked away) due to this. The crashes were all while operating the GUI though, not spontaneously like the main problem
The routine where the crash happened dealt with TMetafile. TMetafile has been designed out, and replace by a simpler own made format. (basically arrays with Opengl vertices)
Drawing no longer happened with tbitmap with a tmetafile overlay strechdrawn over it, but using OpenGL.
Of course it could be something else too, that got changed in the rewrite of the above parts, fixing some very nasty detail bug. It would have to be an extremely bad one, since I analysed the above system as much as I could.
Updated nov 2012 after some private mail discussion: In retrospect, the next step would have been adding a counter to the metafiles objects, and simply reinstantiate them every x * 1000 uses or so, and see if that changes anything. If you have similar problems, try to see if you can somewhat regularly destroy and reinitialize long living resources that are dynamically allocated.
There is a slim chance that the error is misleading. The VCL naively reports EOutOfResources if it is unable to obtain a DC for a window (see TWinControl.GetDeviceContext in Controls.pas).
I say "naively" because there are other reasons why GetDC() might return a NULL handle and the VCL should report the OS error, not assume an out of resources condition (there is a Windows version check required for this to be reliably possible, but the VCL could and should take of that too).
I had a situation where I was getting the EOutOfResources error as the result of a window handle becoming invalid. Once I'd discovered the true problem, finding the cause and fixing it was simple, but I wasted many, many hours trying to find a non-existent resource leak.
If possible I would examine the stack trace leading to this exception - if it is coming from TWinControl.GetDeviceContext then the problem may not be what you think (it's impossible to say what it might be of course, but eliminating the impossible is always the first step toward discovering the solution, no matter how improbable).
If they are GDI handle leaks you can have a look at MSDN Magazine January 2003 which uses the tool GDILeaks. Other tools are GDIObj or GDIView. Also see here.
Another source of EOutOfResources could be that the Desktop Heap is full. I've had that issue on busy terminal servers with large screens.
If there are lots of file handles you are leaking you could check out Process Explorer and have a look at the open file handles of your process and see any out of the ordinary. Or use WinDbg with the !htrace command.
I've run into this problem before. From what I've been able to tell, Delphi may throw an EOutOfResources any time the Windows API returns ERROR_NOT_ENOUGH_MEMORY, and (as the other answers here discuss) Windows may return ERROR_NOT_ENOUGH_MEMORY for a variety of conditions.
In my case, EOutOfResources was being caused by a TBitmap - in particular, TBitmap's call to CreateCompatibleBitmap, which it uses with its default PixelFormat of pfDevice. Apparently Windows may enforce fairly strict systemwide limits on the memory available for device-dependent bitmaps (see, e.g, this discussion), even if your system otherwise has plenty of memory and plenty of GDI resources. (These systemwide limits are apparently because Windows may allocate device-dependent bitmaps in the video card's memory.)
The solution is simply to use device-independent bitmaps (DIBs) instead (although these may not offer quite as good of a performance). To do this in Delphi, set TBitmap.PixelFormat to anything other than pfDevice. This KB article describes how to pick the optimal DIB format for a device, although I generally just use pf32Bit instead of trying to determine the optimal format for each of the monitors the application is displayed on.
Most of the times I saw EOutOfResources, it was some sort of handle leak.
Did you try something like MadExcept?
--jeroen
"I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day)."
I'm shooting in the dark here, but could it be that you're using the Windows API to (GetTempFileName) create a temp file and you're blowing out some file system indexes or forgetting to close a file handle?
Either way, I do agree that with your supposition about it being a file handle problem. That seems to be the most likely thing given your symptoms and diagnosis.
Also try to check handle count for the application with Process Explorer from SysInternals. Handle leaks can be very dangerous and they build slowly through time.
I am currently having this problem, in software that is clearly not leaking any handles in my own code, so if there are leaks they could be happening in a component's source code or the VCL sourcecode itself.
The handle count and GDI and user object counts are not increasing, nor is anything being created. Deltic's answer shows corner cases where the message is kind of a red-herring, and Allen suggests that even a file write can cause this error.
So far, The best strategy I have found for hunting them down is to use either JCL JCLDEBUG stack tracebacks, or the exception report save features in MadExcept to generate the context information to find out what is actually failing.
Secondly, AQTime contains many tools to help you, including a resource profiler that can keep the links between where the code that created the resources is, and how it was called, along with counts of the total numbers of handles. It can grab results MID RUN and so it is not limited to detecting unfreed resources after you exit. So, run AQTime, do a results capture in mid run, wait several hours, and capture again, and you should have two points in time to compare handle counts. Just in case it is the obvious thing. But as Deltics wisely points out, this exception class is raised in cases where it probably shouldn't have been.
I spent all of today chasing this issue down. I found plenty of helpful resources pointing me in the direction of GDI, with the fact that I'm using GDI+ to produce high-speed animations directly onto the main form via timer/invalidate/onpaint (animation performed in separate thread). I also have a panel in this form with some dynamically created controls for the user to make changes to the animation.
It was extremely random and spontaneous. It wouldn't break anywhere in my code, and when the error dialog appeared, the animation on the main form would continue to work. At one point, two of these errors popped up at the same time (as opposed to sequential).
I carefully observed my code and made sure I wasn't leaking any handles related to GDI. In fact, my entire application tends to keep less than 300 handles, according to Task Manager. Regardless, this error would randomly pop up. And it would always correspond with the simplest UI related action, such as just moving the mouse over a standard VCL control.
Solution
I believe I have solved it by changing the logic to performing the drawing within a custom control, rather than directly to the main form as I had been doing before. I think the fact that I was rapidly drawing on the same form canvas which shared other controls, somehow they interfered. Now that it has its own dedicated canvas to draw on, it seems to be perfectly fixed.
That is with about 1 hour of vigorous testing at least.
[Fingers crossed]

Resources