I’m working on some code that has a grid view (~20 child views on screen at once). Each child view draws its content in GL, and has its own drawing thread and EAGLContext.
The advantage of this is that each view is relatively insulated from other GL usage in the app, though with 20 such views on screen, we have to glFlush+setCurrentContext: 20 times per frame. My gut tells me this is not the most efficient use of GL.
My questions:
What's the cost of switching contexts?
Does having to glFlush for each context actually slow it down, or does glFlush only stall the current context?
• Does having to glFlush for each context actually slow it down, or does glFlush only stall the current context?
Contexts have their own individual command streams.
All of this stuff eventually has to be serialized for drawing on a single GPU, so flushing the command stream for 20 concurrent contexts is going to put some pressure on whatever part of the driver does that.
Luckily, GL does not guarantee any sort of synchronization between different contexts so GL itself is not going to spend a whole lot of effort making sure commands from different contexts are executed in a particular order relative to one another. However, if you were waiting on a fence sync. object associated with another context in one of the command streams then it would introduce some interesting GL-related overhead.
• What's the cost of switching contexts?
Why are you switching contexts?
You said that each view has its own thread and context, so I am having trouble understanding why you would ever change the context current to a thread.
The cost of switching contexts is very hardware dependent. Newer hardware generations tend to have more efficient context switching support. But it's generally a pretty heavy weight operation in any case.
The cost of a glFlush is neither very small nor very large. Not something you want to do more often than needed, but not very harmful when used in moderation. I would be much more worried about the context switch than the flush. As Andon mentioned in his response, a glFlush will not be enough if you need synchronization between the contexts/threads. That will require either a glFinish, or some kind of fence.
With your setup, you'll also have the price for thread switches on the CPU.
I think your gut feeling is absolutely right. The way I understand your use case, it would probably be much more efficient to render your sub-views in sequence, with a single rendering thread and context. It might make the state management a little more cumbersome, but you should be able to handle that fairly cleanly.
To make this answer more self contained, with credit to Andon: You don't have to make calls to set the current context, since the current context is maintained per thread. But on the GPU level, there will still be context switches as soon as work from multiple contexts is submitted.
Related
I have 2 tasks that need to be performed:
One is synchronized with refresh rate via the Present call; does fancy graphics.
The other does a bunch of computations on a virtually infinite workload; does not need to be synchronous with the first task; really does not like being interrupted (encourages coarser workload granularity).
Is there a way to optimally use the GPU in this situation with DirectX?
Perhaps the solution would:
issue Dispatch (or Draw) calls in a way that allows them to run/finish asynchronously.
signal the current shader to stop.
use hardware or driver scheduling.
Right now my soultion is to try and predict how long it would take to run the shaders, which is unreliable, unless I add a bunch of downtime...
Trying to avoid the th**ad word as it means a different thing on GPUs
Create two separate D3D11 devices. Use one for the rendering, and another one (driven from another CPU thread with lower priority) for the computations.
Rework your low-priority computations making each Dispatch() to take a couple milliseconds of GPU time to complete. Don’t submit many compute calls at once: use 2 queries or a single fence to never dispatch more than 2 pending compute calls. Dispatch 2 calls initially, when the first is complete dispatch the 3-rd one, etc.
While 3D rendering on your main thread, lock an std::mutex, release once you rendered the scene before Present. On the background thread, lock that mutex when submitting more compute tasks, but keep it unlocked while waiting for a query or fence.
You still gonna have some interference between these two tasks, but it might be good enough for your use case.
Ideally, consider using timestamp queries to measure GPU time spent computing your background tasks. Then adjust size of the single task dynamically based on these numbers, this should allow to achieve ideal granularity of these tasks regardless on the GPU performance. Don’t forget to apply some rolling average over the last 5-10 completed tasks before using the number for these adjustments.
So I am reading the vulkan book now and got a problem about the push Constant and ubo update.
After I set up all the pipeline and descriptor stuff. Basically I just need the copy the buffer to the UBO buffer such as memcpy then I am done.
Basically I can understand the issue about the whole pipeline needs to wait for this "buffer" ready then change it's content. So it will be slow.
On the other hand, when I use push constant, there is no such a problem. Although it's small (say 256 bytes big).
So far so good.
However, on the second thought, I find that if I am updating the UBO, I don't need to change the command buffer, or re-record it, I can submit the old CB since it's still the same.
Then if I want to update by using Push Constant, I have to reset the CB and record it again then submit it.
So won't this be an issue? How to make sure which one is faster?
Thanks.
Lots of people get confused on this issue, because the Vulkan Tutorial pre-records commands and Vulkan Guide re-record commands every frame.
When people say to use push constants for per-frame changing data like transform matrices and time data, there's the implicit assumption that you are recording command buffer per frame. Push constants essentially hitch a ride with the rest of your commands when submitted, which is also how they avoid synchronization and cache flushing to operate.
Now, in a lot of scenarios, re-recording command buffers can be easier and not significantly more costly than re-use. And indeed, re-using command buffers when things change can be a real pain to manage. Command buffers are meant to be fast to record. Still, the Vulkan tutorial went with pre-recording everything, which is also a valid approach though potentially harder to maintain at scale.
At the time the tutorial was created, the Vulkan Tutorial was essentially one of the only resources available to learn vulkan in a structured manner. Even though command buffers are quick to record, pre-recording command buffers eliminates even more CPU overhead and exemplifies Vulkan's "Never be draw call limited again" mantra to eliminating CPU overhead in graphics applications.
As for the speed comparison, you'll have to benchmark, but I would not necessarily choose one or the other for "speed" reasons. If you pre-record, you don't want to re-orient your entire rendering architecture just to take advantage of push constants. If you don't pre-record, there's no reason not to use push constants, they are just straight up easier to deal with.
It seems like currently you are pre-recording. I would not bother with push constants at all for this kind of data. I would also not focus on these kinds of issues until you get more familiar with vulkan, as it is very easy to get caught in the weeds with optimization in vulkan, strategies for optimization are no where near as uniform as in the CPU space.
I am making a real-time image processing app on IOS with my team. I am handling the custom computation kernel (mostly on CPU rather than GPU) and my teammates deals with the GUI. When I tested my kernel on a toy app, the core (ignoring any IO overhead ) runs steadily at 100ms per image. However, when put into the full-functioning one, it is slowed down to 500ms per image.
I have checked that the data is pretty much the same and I am only measuring time consumed within the kernel, on the same iphone6. There are hardly any other computation in the full-functioning app so I am not sure what is pulling behind. Though GPU-processing is definitely an alternative and I am working on it, I would like to know if there is any tricks to use for now.
Currently, there is no explicit multi-threading in the computation part, so my simple guess is: should I programly put the computation part on a separate thread so the second core can be utilized?
[Update]
It turns out that I made some mistakes in packing my code as library, as the copying over the source code works out nicely. I have not figured out my problem yet and am going to post it on a separate question.
GPU Acceleration
This massively depends on the tasks you're performing, the GPU is good a specific subset of tasks and simply utilising it can sometimes even slow things down. Check this out
A lot of image based tasks that are part of the Quartz framework e.t.c are GPU accelerated (like blurring). Also if you use a library like OpenCV you get GPU acceleration on certain tasks out the box.
Unless you're a real pro I would avoid using the GPU specifically and let the frameworks and libraries you use do that for you.
Concurrency
It will certainly help to put intensive tasks on a background thread. Just be aware of what it entails (i.e. you can't make any UIKit calls from a background thread.
The answer heavily depends on how you do the processing. Some methods in the SDK perform their job in a background thread, while others require the caller to create and use one.
In general, in the case of drawing, most methods require you to create one explicitly. This is important especially for the ones that perform their work on the CPU (e.g. using CoreGraphics to draw within a drawRect method). If you're using methods that use GPU for the processing, then creating threads won't be much of use since CPU won't be the cause of the bottleneck.
If you want to determine why your app slows down, use Instruments. (Time Profiler for CPU and Core Animation for drawing)
I am trying to understand under what circumstances and how iOS may throttle my application threads due to excessive CPU consumption. The results I'm getting are kind of strange.
I have an application with OpenGL / GLKViewController rendering a view and a separate logic thread, started in the background using NSThread.detachNewThreadSelector, performing calculations. I find that if I (for purposes of discussion) let my computation thread run flat out as fast as it can, iOS quickly throttles it down. e.g. I monitor the FPS of both the view and my thread and I see that the view maintains e.g. 60fps and my logic thread is humming along but then suddenly drops after a few seconds.
So that makes sense to me that perhaps iOS tries to limit thread consumption. What is weird is that it doesn't just slow down gradually but it seems to "quantize" my logic thread's FPS at approximately some multiple of the GPU frame rate (i.e. 30 or 60fps)!
Now, keep in mind that there is no synchronization between these threads and the logic loop is self contained hard loop equivalent to while(true) so I have no idea how it's even possible for iOS to accomplish this magic unless it is somehow aware of my top level loop and interjecting itself into it.
In case you don't believe me that there is no synchronization point I will tell you that I have created a test case that literally just has an empty GLKViewController loop and an dumb logic thread that churns some numbers and it exhibits the behavior. Screenshots are below and I can post the code if anyone is interested.
The screenshots below are for two different "loads" of the logic thread, printed at intervals of a second, running on an iPad Air with iOS 8.
What's even stranger is - sometimes setting a lower preferred GLK frame rate (e.g. 30fps) can actually make my logic thread run slower. I'd have expected that reducing the work done by the GPU would free up (resources / heat dissipation) and reduce the need for throttling, but it doesn't always seem to be the case.
Does anyone have an explanation for this behavior? Is it documented? Thanks.
EDIT: My only guess at this point is that if the GPU runs to hot they shut down the second core and migrate threads back to the first... and then somehow thread prioritization accounts for the implicit synchronization, although I still can't envision exactly how this happens.
We're currently developing an iPad application using Air for iOS and from time to time experience crashes (on iPad1 with ios 5 only) which seem to be because the application is using up too much memory.
How to catch/handle such errors in the application? how to be notified when memory is low? trying to catch flash.errors.MemoryError doesn't seem to work. Any tips?
I've done some work in this area and here are some tips that I can give you.
Get Flash Builder 4.6 Premium.
Get it if only for the profiler alone. It has one of the best profilers available for diagnosing things like this. With that said, there are other Flash profilers around, that have varying degrees of usefulness.
This alone will help you find and diagnose where most of your memory is going in terms of raw memory usage, but also help you find how many objects you are creating and destroying and how long they are hanging around before the garbage collector is finally getting around to letting them go.
Pool smaller trivial objects
Rather than constantaly creating and destorying smaller objects, create object pools. This will save you the cost of spinning up new objects constantly, and keep you from having to wait until the garbage collector to run before releasing the memory.
There are a lot of examples and patterns to look at for creating object pools in actionscript. It would be easier if AS supported generics, but even without them its still pretty straight forward.
Eagerly dispose of huge objects
This goes directly against the advice in the previous point, but for huge objects, you don't want them hanging around in memory forever. I'm referring to things like BitmapData, when you are done with them (for the foreseeable future), tear them down and null them out, and let the garbage collector clean it up.
When you need them again, rebuild them. Yes, you will take a slight performance hit, but memory on mobile devices is precious and don't waste it by keeping around a 2mb bitmapdata object that only ever appears on the loading screen. Throw it away.
Null out references you don't need anymore
Take some time and try to really understand what the garbage collector needs to do its work, and how its decides which objects can and cannot be thrown away. Try to avoid self referential objects/circular references, while the CG can normally figure it out, sometimes it might need a litle hand holding.
Evaluate every time you use new [Related to 2]
Again using a memory profiler will help for this step, but make sure that every time you instantiate a new object, you need to instantiate a new object. It can be very easy to get lazy when developing for a PC, just throwing new objects into the pool and letting the CG sort it out. See if there are good caching strategies (object pooling, or just reference caching) if its small. And if its a HUGE object that you are building up and tearing down often, it might be time to try to come up with a better architectural solution.
As far as I know, if you get to the point where iOS thinks the memory is low, its already too late. Last time I checked, the framework will try to run the CG when it thinks its running out of memory, and if it can't free up enough memory to continue, it fails out. Do your best to try to avoid getting to the point where the operating system thinks the only safe option is to terminate your thread.