Practically speaking, how much overhead does chaining shaders have compared to if a single shader is used to do the same work?
In other words, is it preferable to chain shaders versus developing one monster shader? Or does the overhead from chaining them dictate to use as few shaders as possible?
As an example, consider #warrenm sample "Image Processing" project. There is a adjust_saturation shader chained to a gaussian_blur_2d shader. Would combining both shaders into a single shader significantly improve performance, or would it practically be the same?
I would expect a significant amount of performance gain in your example of combining adjust_saturation to gaussian_blur_2d (assuming they do what their name's suggest).
From the GPU's point of view, both operations are pretty trivial in terms of the maths that need to be done, performance is going to be totally dominated by texture fetching and writing out results. I'd imagine that gaussian blur is doing bit more work because it presumably does multiple texture samples per output fragment. By combining the two shaders, you can eliminate entirely the texture fetching and writing cost of adjusting saturation.
I think by combining the two operations you could expect to make siginificant performance gains, somewhere around 10%-40% faster than chaining them. Bear in mind you might not see a difference in framerate because iOS is very active in managing the CPU/GPU clock speed, so it's really hard to measure things accurately.
It depends on the size of texture and the size of cache. If you absolutely have to optimize it, it probably worths to combine them into a single shader. If you want to reuse your code, it makes sense to create a set of simpler shaders and combine them (just like my VideoShader project, https://github.com/snakajima/vs-metal).
By the way, when you combine multiple shaders, you'd better to create a single command buffer and encode all your shaders into that command buffer (instead of creating a command buffer for each shader). It allows Metal to do a certain set of optimizations.
Related
I have been maintaining my own custom 2D library -written in Objective-C / OpenGL ES 2.0- for a while now, to use in my personal projects (not work). I have also tried cocos2d and SpriteKit now and then, but eventually settled for "reinventing the wheel" because
It's fun,
Knowledge-wise, I'd rather be the guy who can code a graphics library than just a guy who can use one,
Unlimited possibilities for customization.
Now, I am transitioning my code base to Swift and (besides all the design differences that arise when moving to a language where class inheritance takes a back seat to protocols, etc) I was thinking that while I'm at it, I should consider transitioning to Metal as well. If anything, for the sake of future-proofness (also, I'm all for learning new technologies, and to be sincere OpenGL/OpenGL ES are a terribly cluttered bag of "legacy" and backwards compatibility).
My library is designed around all sorts of OpenGL (ES)-specific performance bottlenecks: Use of texture atlases and mesh consolidation to reduce draw calls, rendering opaque sprites first, and semitransparent ones last (ordered back to front), etc.
My question is: Which of these considerations still apply to Metal, and which ones should I not even bother implementing (because they're not a performance issue anymore)?
Metal is only available on the subset of IOS devices which support OpenGLES3, so te be fair you need to compare Metal to GLES3.
Texture atlases & mesh consolidation:
With Metal, CPU cost of draw calls is lower than with GLES3, and you can parallelize draw call setup on multiple threads.
So this could allow you to skip atlasing & consolidation ... but those are good practices so it would be even better if you kept those with Metal and use the extra CPU time to do more things !
Note that with GLES3 by using instancing & texture arrays you should also be able to get rid of atlasing and keep a low draw call count.
Rendering opaque sprites first, and semitransparent ones last
Metal will change absolutely nothing to this, this is a constraint from the PowerVR GPUs tile based defered renderer, whatever driver you use this will not change the GPU hardware. And anyway rendering opaques before semi transparent is the recommended way to proceed when you do 3D, wheter you use DirectX, OpenGL or Metal ...
Metal will not help if you are fillrate bound !
In general, Metal will only give you improvements on the CPU side.
If your performance is limited by fillrate (fragment shaders too complex, too much transparent overdraw, resolution too high etc.) then you will get the exact same result in Metal and GLES3 (assuming that you have carefully optimized shaders for each platform).
Using the PVRUniScoEditor to profile our pixel shaders, I'm finding that our frag shaders are coming in at around 20 cycles for most polys and 6-8 for are particles. This seems to be our butter zone in terms of getting decent performance, but I am now wondering if I am masking other problems by making these shaders as simple as possible. I'd be nice to have a bit more functionality in these guys. We're rendering a scene with around 120k polys and making the vertex shaders heavier has little to no effect on performance.
So I guess I'm wondering how much is too much for a very heavily used frag shader and how much is too much poly-wise for 30fps.
There is no one right answer to this. While the PVRUniScoEditor is a great tool for relative estimates of shader performance, you can't just say that a fragment shader which consumes X estimated cycles will lead to Y framerate on a given device.
How heavy a particular shader might be is just one piece in the puzzle. How many fragments will it cover onscreen? Is your performance bottleneck on the fragment processing side (Renderer Utilization in the OpenGL ES Driver instrument near 100%)? Is blending enabled? All of these factors will affect how long it takes to render a frame.
The tile-based deferred renderers on iOS also have some interesting performance characteristics, where adjusting the cycle count for a particular fragment shader does not lead to a linear change in rendering time, even for a fill-rate-limited application. You can see an example of this in this question of mine, where I encountered sudden performance changes with slight variations of a fragment shader. In that case, adjusting the shader wasn't the the primary solution, preventing the blending of unnecessary fragments was.
In addition to straight cycle counts reported by the profiler are the limitations for texture bandwidth, and the severe effect that I've found cache misses can have in these shaders.
What I'm trying to say is that the only real way to know what the performance will be for your shaders in your application is to run them and see. There are general hints that can be used to tune something that isn't fast enough, but there are so many variables that every solution will be application-specific.
I am new to directx, but have been surprised that most examples I have seen the world matrix and view matrix are multiplied as part of the vertex shader, rather than being multiplied by the CPU and the result being passed to the shader.
For rigid objects this means you multiply the same two matrices once for every single vertex of the object. I know that the GPU can do this in parallel over a number of vertices (don't really have an idea how many), but isn't this really inefficient, or am I just missing something? I am still new and clueless.
In general, you want to do it on the CPU. However, DirectX 9 has the concept of "preshaders", which means that this multiplication will be done on the CPU up-front. This has been removed for newer APIs, but it might be very well relevant for the examples you're looking at.
Moreover, modern GPUs are extremely fast when it comes to ALU operations compared to memory access. Having a modestly complex vertex shader (with a texture fetch maybe) means that the math required to do the matrix multiplication comes for free, so the authors might have not even bothered.
Anyway, the best practice is to pre-multiply everything constant on the CPU. Same applies for moving work from the pixel shaders into the vertex shaders (if something is constant across a triangle, don't compute it per-pixel.)
Well, that doesn't sound clueless to me at all, you are absolutely right!
I don't know exactly what examples you have been looking at, but in general you'd pass precalculated matrices as much as possible, that is what semantics like WORLDVIEW (and even more appropriate for simple shaders, WORLDVIEWPROJECTION) are for.
Exceptions could be cases where the shader code needs access to the separate matrices as well (but even then I'd usually pass the combined matrices as well)... or perhaps those examples where all about illustrating matrix multiplication. :-)
I've got a question regarding ComputeShader compared to PixelShader.
I want to do some processing on a buffer, and this is possible both with a pixel shader and a compute shader, and now I wonder if there is any advantage in either over the other one, specifically when it comes to speed. I've had issues with either getting to use just 8 bit values, but I should be able to work-around that.
Every data point in the output will be calculated from using in total 8 data points surrounding it (MxN matrix), so I'd think this would be perfect for a pixel shader, since the different outputs don't influence each other at all.
But I was unable to find any benchmarkings to compare the shaders, and now I wonder which one I should aim for. Only target is the speed.
From what i understand, shaders are shaders in the sense that they are just programs run by alot of threads on data. Therefore, in general there should not be any diffrence in terms of computing power/speed doing calculations in the pixel shader as opposed to the compute shader. However..
To do calculations on the pixelshader you have to massage your data so that it looks like image data, this means you have to draw a quad first of all, but also that your output must have the 'shape' of a pixel (float4 basically). This data must then be interpreted by you app into something useful
if you're using the computeshader you can completly control the number of threads to use where as for pixel shaders they have to be valid resolutions. Also you can input and output data in any format you like and take advantage of accelerated conversion using UAVs (i think)
i'd recommend using computeshaders since they are ment for doing general purpose computation and are alot easier to work with. Your over all application will probably be faster too, even if the actual shader computation time is about the same, just because you can avoid some of the hoops you have to jump through just through to get pixel shaders to do what you want.
I'm implementing billboards for vegetation where a billboard is of course a single quad consisting of two triangles. The vertex data is stored in a vertex buffer, but should I bother with indices? I understand that the savings on things like terrain can be huge in terms of vertices sent to the graphics card when you use indices, but using indices on billboards means that I'll have 4 vertices per quad rather than 6, since each quad is completely separate from the others.
And is it possible that the use of indices actually reduces performance because there is an extra level of indirection? Or isn't that of any significance at all?
I'm asking this because using indices would slightly complicate matters and I'm curious to know if I'm not doing extra work that just makes things slower (whether just in theory or actually noticeable in practice).
This is using XNA, but should apply to DirectX.
Using indices not only saves on bandwidth, by sending less data to the card, but also reduces the amount of work the vertex shader has to do. The results of the vertex shader can be cached if there is an index to use as a key.
If you render lots of this billboarded vegetation and don't change your index buffer, I think you should see a small gain.
When it comes to very primitive gemotery then it might won't make any sense to use indices, I won't even bother with performance in that case, even the modest HW will render millions of triangles a seconds.
Now, technically, you don't know how the HW will handle the data internally, it might convert them to indices anyway because that's the most popular form of geometry presentation.