How many pixel shader cycles is to heavy for iPad2 - ios

Using the PVRUniScoEditor to profile our pixel shaders, I'm finding that our frag shaders are coming in at around 20 cycles for most polys and 6-8 for are particles. This seems to be our butter zone in terms of getting decent performance, but I am now wondering if I am masking other problems by making these shaders as simple as possible. I'd be nice to have a bit more functionality in these guys. We're rendering a scene with around 120k polys and making the vertex shaders heavier has little to no effect on performance.
So I guess I'm wondering how much is too much for a very heavily used frag shader and how much is too much poly-wise for 30fps.

There is no one right answer to this. While the PVRUniScoEditor is a great tool for relative estimates of shader performance, you can't just say that a fragment shader which consumes X estimated cycles will lead to Y framerate on a given device.
How heavy a particular shader might be is just one piece in the puzzle. How many fragments will it cover onscreen? Is your performance bottleneck on the fragment processing side (Renderer Utilization in the OpenGL ES Driver instrument near 100%)? Is blending enabled? All of these factors will affect how long it takes to render a frame.
The tile-based deferred renderers on iOS also have some interesting performance characteristics, where adjusting the cycle count for a particular fragment shader does not lead to a linear change in rendering time, even for a fill-rate-limited application. You can see an example of this in this question of mine, where I encountered sudden performance changes with slight variations of a fragment shader. In that case, adjusting the shader wasn't the the primary solution, preventing the blending of unnecessary fragments was.
In addition to straight cycle counts reported by the profiler are the limitations for texture bandwidth, and the severe effect that I've found cache misses can have in these shaders.
What I'm trying to say is that the only real way to know what the performance will be for your shaders in your application is to run them and see. There are general hints that can be used to tune something that isn't fast enough, but there are so many variables that every solution will be application-specific.

Related

Performance of chained Metal shaders versus a single shader?

Practically speaking, how much overhead does chaining shaders have compared to if a single shader is used to do the same work?
In other words, is it preferable to chain shaders versus developing one monster shader? Or does the overhead from chaining them dictate to use as few shaders as possible?
As an example, consider #warrenm sample "Image Processing" project. There is a adjust_saturation shader chained to a gaussian_blur_2d shader. Would combining both shaders into a single shader significantly improve performance, or would it practically be the same?
I would expect a significant amount of performance gain in your example of combining adjust_saturation to gaussian_blur_2d (assuming they do what their name's suggest).
From the GPU's point of view, both operations are pretty trivial in terms of the maths that need to be done, performance is going to be totally dominated by texture fetching and writing out results. I'd imagine that gaussian blur is doing bit more work because it presumably does multiple texture samples per output fragment. By combining the two shaders, you can eliminate entirely the texture fetching and writing cost of adjusting saturation.
I think by combining the two operations you could expect to make siginificant performance gains, somewhere around 10%-40% faster than chaining them. Bear in mind you might not see a difference in framerate because iOS is very active in managing the CPU/GPU clock speed, so it's really hard to measure things accurately.
It depends on the size of texture and the size of cache. If you absolutely have to optimize it, it probably worths to combine them into a single shader. If you want to reuse your code, it makes sense to create a set of simpler shaders and combine them (just like my VideoShader project, https://github.com/snakajima/vs-metal).
By the way, when you combine multiple shaders, you'd better to create a single command buffer and encode all your shaders into that command buffer (instead of creating a command buffer for each shader). It allows Metal to do a certain set of optimizations.

OpenGL ES to Metal - Performance Bottleneck Differences

I have been maintaining my own custom 2D library -written in Objective-C / OpenGL ES 2.0- for a while now, to use in my personal projects (not work). I have also tried cocos2d and SpriteKit now and then, but eventually settled for "reinventing the wheel" because
It's fun,
Knowledge-wise, I'd rather be the guy who can code a graphics library than just a guy who can use one,
Unlimited possibilities for customization.
Now, I am transitioning my code base to Swift and (besides all the design differences that arise when moving to a language where class inheritance takes a back seat to protocols, etc) I was thinking that while I'm at it, I should consider transitioning to Metal as well. If anything, for the sake of future-proofness (also, I'm all for learning new technologies, and to be sincere OpenGL/OpenGL ES are a terribly cluttered bag of "legacy" and backwards compatibility).
My library is designed around all sorts of OpenGL (ES)-specific performance bottlenecks: Use of texture atlases and mesh consolidation to reduce draw calls, rendering opaque sprites first, and semitransparent ones last (ordered back to front), etc.
My question is: Which of these considerations still apply to Metal, and which ones should I not even bother implementing (because they're not a performance issue anymore)?
Metal is only available on the subset of IOS devices which support OpenGLES3, so te be fair you need to compare Metal to GLES3.
Texture atlases & mesh consolidation:
With Metal, CPU cost of draw calls is lower than with GLES3, and you can parallelize draw call setup on multiple threads.
So this could allow you to skip atlasing & consolidation ... but those are good practices so it would be even better if you kept those with Metal and use the extra CPU time to do more things !
Note that with GLES3 by using instancing & texture arrays you should also be able to get rid of atlasing and keep a low draw call count.
Rendering opaque sprites first, and semitransparent ones last
Metal will change absolutely nothing to this, this is a constraint from the PowerVR GPUs tile based defered renderer, whatever driver you use this will not change the GPU hardware. And anyway rendering opaques before semi transparent is the recommended way to proceed when you do 3D, wheter you use DirectX, OpenGL or Metal ...
Metal will not help if you are fillrate bound !
In general, Metal will only give you improvements on the CPU side.
If your performance is limited by fillrate (fragment shaders too complex, too much transparent overdraw, resolution too high etc.) then you will get the exact same result in Metal and GLES3 (assuming that you have carefully optimized shaders for each platform).

Why do we implement lighting in the Pixel Shader?

I am reading Introduction to 3D Game Programing with DirectX 11 by Frank D. Luna, and can't seem to understand why do we implement lighting in Pixel Shader? I would be grateful if you could send me some reference pages on the subject.
Thank you.
Lighting can be done many ways. There are hundreds of SIGGRAPH papers on the topic.
For games, there are a few common approaches (or more often, games will employ a mixture of these approaches)
Static lighting or lightmaps: Lighting is computed offline, usually with a global-illumination solver, and the results are baked into textures. These lightmaps are blended with the base diffuse textures at runtime to create the sense of sophisticated shadows and subtle lighting, but none of it actually changes. The great thing about lightmaps is that you can capture very interesting and sophisticated lighting techniques that are very expensive to compute and then 'replay' them very inexpensively. The limitation is that you can't move the lights, although there are techniques for layering a limited number of dynamic lights on-top.
Deferred lighting: In this approach, the scene is rendered many times to encode information into offscreen textures, then additional passes are made to compute the final image. Here often there is one rendering pass per light in the scene. See deferred shading. The good thing about deferred shading is that it is very easy to make the renderer scale with art-driven content without as many hard limits--you can just do more passes for more lights for example which are simply additive. The problem with deferred shading is that each pass tends to do little computation, and the many passes really push hard on the memory bandwidth of modern GPUs which have a lot more compute power than bandwidth.
Per-face Forward lighting: This is commonly known as flat shading. Here the lighting is performed once per triangle/polygon using a face-normal. On modern GPUs, this is usually done on the programmable vertex shader but could also use a geometry shader to compute the per-face normal rather than having to replicate it in vertices. The result is not very realistic, but very cheap to draw since the color is constant per face. This is really only used if you are going for a "Tron look" or some other non-photorealistic rendering technique.
Vertex Forward lighting: This is classic lighting where the light computation is performed per vertex with a per-vertex normal. The colors at each vertex are then interpolated across the face of the triangle/polygon (Gouraud shading). This lighting is cheap, and on modern GPUs would be done in the vertex shader, but the result can be too smooth for many complex materials, and any specular highlights tend to get blurred or missed.
Per-pixel Forward lighting: This is the heart of your question: Here the lighting is computed once per pixel. This can be something like classic Phong or Blinn/Phong shading where the normal is interpolated between the vertices or normal maps where a second texture provides the normal information for the surface. In a modern GPU, this is done in the pixel shader and can provide much more surface information, better specular highlights, roughness, etc. at the expensive of more pixel shader computation. On modern GPUs, they tend to have a lot of compute power relative to the memory bandwidth, so per-pixel lighting is very affordable compared to the old days. In fact, Physically Based Rendering techniques are quite popular in modern games and these tend to have very long and complex pixel shaders combining data from 6 to 8 textures for every pixel on every surface in the scene.
That's a really rough survey and as I said there's a ton of books, articles, and background on this topic.
The short answer to your question is: because we can!

iOS OpenGL ES 2.0 VBO vertex count limit: Once exceeded, CPU bound

I am testing the rendering of extremely large 3d meshes, and I am currently testing on an iPhone 5 (I also have an iPad 3).
I have here two screenshots of Instruments with a profiling run. The first one is rendering a 1.3M vertex mesh, and the second is rendering a 2.1M vertex mesh.
The blue histogram-bar at the top shows CPU load, and it can be seen that for the first mesh is hovering at around ~10% CPU load so the GPU is doing most of the heavy lifting. The mesh is very detailed and my point-light-with-specular shader makes it look quite impressive if I say so myself, as it is able to render consistently above 20 frames per second. Oh, and 4x MSAA is enabled as well!
However, once I step up to a 2 million+ vertex mesh, everything goes to crap as we see here a massive CPU bound situation, and all instruments report 1 frame per second performance.
So, it's pretty clear that somewhere between these two assets (and I will admit that they are both tremendously large meshes to be loading in under one single VBO), whether it is the vertex buffer size or the index buffer size that is over the limit, some limit is being surpassed by the 2megavertex (462K tris) mesh.
So, the question is, what is this limit, and how can I query it? It would really be very preferable if I can have some reasonable assurance that my app will function well without exhaustively testing every device.
I also see an alternative approach to this problem, which is to stick to a known good VBO size limit (I have read about 4MB being a good limit), and basically just have the CPU work a little bit harder if the mesh being rendered is monstrous. With a 100MB VBO, having it in 4MB chunks (segmenting the mesh into 25 draw calls) does not really sound that bad.
But, I'm still curious. How can I check the max size, in order to work around the CPU fallback? Could I be running into an out of memory condition, and Apple is simply applying a CPU based workaround (oh LORD have mercy, 2 million vertices in immediate mode...)?
In pure OpenGL, there are two implementation-defined attributes: GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. When exceeded performance can drop off a cliff in some implementations.
I spent a while looking through the OpenGL ES specification for the equivalent and could not find it. Chances are it's burried in one of the OES or vendor-specific extensions on OpenGL ES. Nevertheless, there is a very real hardware limit to the number of elements you can draw and the number of vertices. After a point with too many indices, you can exceed the capacity of the post-T&L cache. 2 million is a lot for a single draw call, in lieu of being able to query the OpenGL ES implementation for this information, I'd try successively lower powers-of-two until you dial it back to the sweet spot.
65,536 used to be a sweet spot on DX9 hardware. That was the limit for 16-bit indices and was always guaranteed to be below the maximum hardware vertex count. Chances are it'll work for OpenGL ES class hardware too...

DirectX world view matrix multiplications - GPU or CPU the place

I am new to directx, but have been surprised that most examples I have seen the world matrix and view matrix are multiplied as part of the vertex shader, rather than being multiplied by the CPU and the result being passed to the shader.
For rigid objects this means you multiply the same two matrices once for every single vertex of the object. I know that the GPU can do this in parallel over a number of vertices (don't really have an idea how many), but isn't this really inefficient, or am I just missing something? I am still new and clueless.
In general, you want to do it on the CPU. However, DirectX 9 has the concept of "preshaders", which means that this multiplication will be done on the CPU up-front. This has been removed for newer APIs, but it might be very well relevant for the examples you're looking at.
Moreover, modern GPUs are extremely fast when it comes to ALU operations compared to memory access. Having a modestly complex vertex shader (with a texture fetch maybe) means that the math required to do the matrix multiplication comes for free, so the authors might have not even bothered.
Anyway, the best practice is to pre-multiply everything constant on the CPU. Same applies for moving work from the pixel shaders into the vertex shaders (if something is constant across a triangle, don't compute it per-pixel.)
Well, that doesn't sound clueless to me at all, you are absolutely right!
I don't know exactly what examples you have been looking at, but in general you'd pass precalculated matrices as much as possible, that is what semantics like WORLDVIEW (and even more appropriate for simple shaders, WORLDVIEWPROJECTION) are for.
Exceptions could be cases where the shader code needs access to the separate matrices as well (but even then I'd usually pass the combined matrices as well)... or perhaps those examples where all about illustrating matrix multiplication. :-)

Resources