Why do we implement lighting in the Pixel Shader? - directx

I am reading Introduction to 3D Game Programing with DirectX 11 by Frank D. Luna, and can't seem to understand why do we implement lighting in Pixel Shader? I would be grateful if you could send me some reference pages on the subject.
Thank you.

Lighting can be done many ways. There are hundreds of SIGGRAPH papers on the topic.
For games, there are a few common approaches (or more often, games will employ a mixture of these approaches)
Static lighting or lightmaps: Lighting is computed offline, usually with a global-illumination solver, and the results are baked into textures. These lightmaps are blended with the base diffuse textures at runtime to create the sense of sophisticated shadows and subtle lighting, but none of it actually changes. The great thing about lightmaps is that you can capture very interesting and sophisticated lighting techniques that are very expensive to compute and then 'replay' them very inexpensively. The limitation is that you can't move the lights, although there are techniques for layering a limited number of dynamic lights on-top.
Deferred lighting: In this approach, the scene is rendered many times to encode information into offscreen textures, then additional passes are made to compute the final image. Here often there is one rendering pass per light in the scene. See deferred shading. The good thing about deferred shading is that it is very easy to make the renderer scale with art-driven content without as many hard limits--you can just do more passes for more lights for example which are simply additive. The problem with deferred shading is that each pass tends to do little computation, and the many passes really push hard on the memory bandwidth of modern GPUs which have a lot more compute power than bandwidth.
Per-face Forward lighting: This is commonly known as flat shading. Here the lighting is performed once per triangle/polygon using a face-normal. On modern GPUs, this is usually done on the programmable vertex shader but could also use a geometry shader to compute the per-face normal rather than having to replicate it in vertices. The result is not very realistic, but very cheap to draw since the color is constant per face. This is really only used if you are going for a "Tron look" or some other non-photorealistic rendering technique.
Vertex Forward lighting: This is classic lighting where the light computation is performed per vertex with a per-vertex normal. The colors at each vertex are then interpolated across the face of the triangle/polygon (Gouraud shading). This lighting is cheap, and on modern GPUs would be done in the vertex shader, but the result can be too smooth for many complex materials, and any specular highlights tend to get blurred or missed.
Per-pixel Forward lighting: This is the heart of your question: Here the lighting is computed once per pixel. This can be something like classic Phong or Blinn/Phong shading where the normal is interpolated between the vertices or normal maps where a second texture provides the normal information for the surface. In a modern GPU, this is done in the pixel shader and can provide much more surface information, better specular highlights, roughness, etc. at the expensive of more pixel shader computation. On modern GPUs, they tend to have a lot of compute power relative to the memory bandwidth, so per-pixel lighting is very affordable compared to the old days. In fact, Physically Based Rendering techniques are quite popular in modern games and these tend to have very long and complex pixel shaders combining data from 6 to 8 textures for every pixel on every surface in the scene.
That's a really rough survey and as I said there's a ton of books, articles, and background on this topic.
The short answer to your question is: because we can!

Related

Compute Shader, Buffer or texture

I'm trying to implement fluid dynamics using compute shaders. In the article there are a series of passes done on a texture since this was written before compute shaders.
Would it be faster to do each pass on a texture or buffer? The final pass would have to be applied to a texture anyways.
I would recommend using whichever dimensionality of resource fits the simulation. If it's a 1D simulation, use a RWBuffer, if it's a 2D simulation use a RWTexture2D and if it's a 3D simulation use a RWTexture3D.
There appear to be stages in the algorithm that you linked that make use of bilinear filtering. If you restrict yourself to using a Buffer you'll have to issue 4 or 8 memory fetches (depending on 2D or 3D) and then more instructions to calculate the weighted average. Take advantage of the hardware's ability to do this for you where possible.
Another thing to be aware of is that data in textures is not laid out row by row (linearly) as you might expect, instead it's laid in such a way that neighbouring texels are as close to one another in memory as possible; this can be called Tiling or Swizzling depending on whose documentation you read. For that reason, unless your simulation is one-dimensional, you may well get far better cache coherency on reads/writes from a resource whose layout most closely matches the dimensions of the simulation.

Phong Shading vs Tessellation

I ran across Phong Shading while looking at the Source Engine. The desription sounds very much like Tessellation. But when I looked it up, I didn't really find anything directly comparing the two. Now in DirectX Tessellation isn't used like Phong Shading in HLSL. What's the difference? And which one should I use?
Phong shading is not directly related to DX11 tessellation, but because they both can smooth lighting details I can see how you could be confused.
Tessellation dynamically increases geometric detail based on some parameters (often camera distance). This can increase lighting quality (maybe this is the relationship to phong?) as well as silhouette detail. The shading advantages (not silhouette detail) can actually be simulated entirely in pixel shaders without tessellation.
Phong shading is a pixel shading technique. It does not affect geometric detail. It is similar to standard OpenGL Gouraud shading, except instead of interpolating a lighting value across the pixels of a surface, the normal is interpolated across the surface and renormalized at each pixel. This gives more accurate lighting results often called "per pixel lighting" as opposed to "per vertex lighting"
You could reasonably (and probably commonly) use both effects at the same time at different parts of the pipeline.
As Justin mentioned Phong Shading is a shading routine used for more accurate lighting per pixel. Tessellation is used to alter the geometric detail in a mesh by dynamically generating more triangles to achieve a higher surface detail and a smoother result. It can be used successfully for dynamic level of detail depending on distance to camera or size on screen.
To add to this topic I thought I should mention that there is a Tessellation algorithm called Phong Tessellation that takes inspiration from Phong Shading and applies this algorithm to Tessellation. This means that the vertices are modified with a similar normal interpolation and achieves high detail silhouettes as well as better surface detail. Phong Tessellation has a simpler shader than the common other local tessellation algorithm PN-Triangles and I used this to achieve higher detail heads in one of the games that I worked on.
Phong Tessellation

How many pixel shader cycles is to heavy for iPad2

Using the PVRUniScoEditor to profile our pixel shaders, I'm finding that our frag shaders are coming in at around 20 cycles for most polys and 6-8 for are particles. This seems to be our butter zone in terms of getting decent performance, but I am now wondering if I am masking other problems by making these shaders as simple as possible. I'd be nice to have a bit more functionality in these guys. We're rendering a scene with around 120k polys and making the vertex shaders heavier has little to no effect on performance.
So I guess I'm wondering how much is too much for a very heavily used frag shader and how much is too much poly-wise for 30fps.
There is no one right answer to this. While the PVRUniScoEditor is a great tool for relative estimates of shader performance, you can't just say that a fragment shader which consumes X estimated cycles will lead to Y framerate on a given device.
How heavy a particular shader might be is just one piece in the puzzle. How many fragments will it cover onscreen? Is your performance bottleneck on the fragment processing side (Renderer Utilization in the OpenGL ES Driver instrument near 100%)? Is blending enabled? All of these factors will affect how long it takes to render a frame.
The tile-based deferred renderers on iOS also have some interesting performance characteristics, where adjusting the cycle count for a particular fragment shader does not lead to a linear change in rendering time, even for a fill-rate-limited application. You can see an example of this in this question of mine, where I encountered sudden performance changes with slight variations of a fragment shader. In that case, adjusting the shader wasn't the the primary solution, preventing the blending of unnecessary fragments was.
In addition to straight cycle counts reported by the profiler are the limitations for texture bandwidth, and the severe effect that I've found cache misses can have in these shaders.
What I'm trying to say is that the only real way to know what the performance will be for your shaders in your application is to run them and see. There are general hints that can be used to tune something that isn't fast enough, but there are so many variables that every solution will be application-specific.

Which is faster: creating a detailed mesh before execution or tessellating?

For simplicity of the problem let's consider spheres. Let's say I have a sphere, and before execution I know the radius, the position and the triangle count. Let's also say the triangle count is sufficiently large (e.g. ~50k triangles).
Would it be faster generally to create this sphere mesh before hand and stream all 50k triangles to the graphics card, or would it be faster to send a single point (representing the centre of the sphere) and use tessellation and geometry shaders to build the sphere on the GPU?
Would it still be faster if I had 100 of these spheres in different positions? Can I use hull/geometry shaders to create something which I can then combine with instancing?
Tessellation is certainly valuable. Especially when combined with displacement from a heightmap. The isolated environment described in your question is bound not to fully answer your question.
Before using tessellation you would need to know that you will become CPU poly/triangle bound and therefore need to start utilizing the GPU to help you increase the overall triangles of your game/scene. Calculations are very fast on the GPU so yes using tessellation multiple subdivision levels is advisable if you are going to do it...though sometimes I've been happy with just subdividing 3-4 times from a 200 tri plane.
Mainly tessellation is used for environmental/static mesh scene objects so that you can spend your tri's on characters and other moving/animated models without becoming CPU bound.
Checkout engines like Unity3D and CryEngine for tessellation examples to help the learning curve.
I just so happen to be working with this at the same time.
In terms of FPS, the pre-computed method would be faster in this situation since you can
dump one giant 50K triangle sphere payload (like any other model) and
draw it in multiple places from there.
The tessellation method would be slower since all the triangles would
be generated from a formula, multiple times per frame.

DirectX world view matrix multiplications - GPU or CPU the place

I am new to directx, but have been surprised that most examples I have seen the world matrix and view matrix are multiplied as part of the vertex shader, rather than being multiplied by the CPU and the result being passed to the shader.
For rigid objects this means you multiply the same two matrices once for every single vertex of the object. I know that the GPU can do this in parallel over a number of vertices (don't really have an idea how many), but isn't this really inefficient, or am I just missing something? I am still new and clueless.
In general, you want to do it on the CPU. However, DirectX 9 has the concept of "preshaders", which means that this multiplication will be done on the CPU up-front. This has been removed for newer APIs, but it might be very well relevant for the examples you're looking at.
Moreover, modern GPUs are extremely fast when it comes to ALU operations compared to memory access. Having a modestly complex vertex shader (with a texture fetch maybe) means that the math required to do the matrix multiplication comes for free, so the authors might have not even bothered.
Anyway, the best practice is to pre-multiply everything constant on the CPU. Same applies for moving work from the pixel shaders into the vertex shaders (if something is constant across a triangle, don't compute it per-pixel.)
Well, that doesn't sound clueless to me at all, you are absolutely right!
I don't know exactly what examples you have been looking at, but in general you'd pass precalculated matrices as much as possible, that is what semantics like WORLDVIEW (and even more appropriate for simple shaders, WORLDVIEWPROJECTION) are for.
Exceptions could be cases where the shader code needs access to the separate matrices as well (but even then I'd usually pass the combined matrices as well)... or perhaps those examples where all about illustrating matrix multiplication. :-)

Resources