Does flat shading require this much vertex duplication? - webgl

I'm very new to WebGL (1.0) / OpenGL and I'm having trouble understanding vertices for flat and smooth shading -- and whether data optimization is even possible for flat shading in this situation:
Say I want to use an icosphere (2-subdivision). It has 42 points that define its 80 faces. Those point coordinates lie on a unit sphere.
Both flat and smooth-shaded icospheres will appear on the same screen.
With smooth shading, the normals will be identical to position vectors, so I get them for free. So I could use 42 vec3 in one buffer for both a_position and v_normal and an index buffer of 240 unsigned_byte to access them for the object. Cheap!
But with flat shading, each face would have its own normal, which I think that means for WebGL 1.0 there will be three duplicate normals for each face. 80 faces means 240 vec3 for a_position (with a lot of duplicate vectors) and 240 vec3 for a_normal (two-thirds of which is just duplicate vectors). I can't see any other way to do this. On the other hand, I can add position and normal data together in the same buffer and I wouldn't need an index buffer.
I've got this working and it seems fast, but am I correct? Does it matter?
Icosphere property
Count
Floats needed
Faces
80
Positions (smooth)
42
126 (+240 indices)
Normals (smooth)
42
0 (reuse positions)
Positions (flat)
240
720
Normals (flat)
240
720
I feel like either I missed something in my studies or that this is just the reality of OpenGL and I should get used to it because it's inherently fast.

You are correct at arriving that for flat shading you'll have to duplicate positions. That is, because a vertex is the whole tuple of position, normal, and all the other attributes.
However this duplication has almost zero impact on rendering times. It adds some memory overhead, yes, but as far as the rendering process is concerned the same amount of data is transferred and incorporated into the rendering process. As a matter of fact the duplication of certain properties at a whole makes caching more predictable, since there's no data indirections (i.e. look up a different normal, depending on what face is rendered) involved. So that actually has theoretical performance gains.
You're doing it exactly right.

Indeed, the most portable implementation of flat shading will require duplicating vertexes for each drawn triangle, which brings an overhead on memory usage (considerable in case of a complex geometry). Potentially, it may affect rendering performance as well, but this would depend on hardware (shouldn't be noticeable nowadays). That's what a basic WebGL 1.0 allows to do.
However, WebGL 2.0 and WebGL 1.0 with OES_standard_derivatives extension gives another option - computing triangle normal directly in a Fragment Shader via derivatives:
#extension GL_OES_standard_derivatives : enable
...
varying vec4 Position;
varying vec3 View;
...
void main()
{
vec3 Normal = normalize (cross (dFdx (Position.xyz / Position.w), dFdy (Position.xyz / Position.w)));
if (!gl_FrontFacing) { Normal = -Normal; }
...
gl_FragColor = computeLighting (normalize (Normal), normalize (View), Position);
This requires per-fragment lighting (e.g. Phong shading instead of Gouraud shading). The shading result will NOT be exactly the same as duplicating vertexes and precomputing triangle normals on CPU, but visual effect will be the same - flat shading with distinguishable triangles.
Practically speaking, GL_OES_standard_derivatives is widely adopted.
In fact, GLSL 1.1 from desktop OpenGL 2.0 supported derivatives from the very beginning (no extension required) - it is only OpenGL ES 2.0 (and hence, WebGL 1.0) decided excluding it.
There are, however, some complains against derivatives implementations on various GPUs. Precise derivatives are expensive to compute, so that GLSL specifications allowed returning faster approximations instead - which was critical for an old graphics hardware. In practice, the method works mostly fine for flat shading, though one OpenGL ES implementation (Qualcomm) has a weird behavior with flipped sign of returned values.
Here is, for example, a research done for Android devices a couple of years ago (don't know if the same issues will be experienced in WebGL - web browsers might black-list broken implementations or apply some workarounds to known driver bugs):

Related

For batch rendering multiple similar objects which is more performant, drawArrays(TRIANGLE_STRIP) with "degenerate triangles" or drawArraysInstanced?

MDN states that:
Fewer, larger draw operations will generally improve performance. If
you have 1000 sprites to paint, try to do it as a single drawArrays()
or drawElements() call.
It's common to use "degenerate triangles" if you need to draw
discontinuous objects as a single drawArrays(TRIANGLE_STRIP) call.
Degenerate triangles are triangles with no area, therefore any
triangle where more than one point is in the same exact location.
These triangles are effectively skipped, which lets you start a new
triangle strip unattached to your previous one, without having to
split into multiple draw calls.
However, it is also commmonly recommended that for multiple similar objects one should use instanced rendered. For webGl2 something like drawArraysInstanced() or for webGl1 drawArrays with the ANGLE_instanced_arrays extension activated.
For my personal purposes I need to render a large amount of rectangles of the same width in a 2d plane but with varying heights (webgl powered charting application). So any recommendation particular to my usecase is valuable.
Degenerate triangles are generally faster than drawArraysInstanced but there's arguably no reason to use degenerate triangles when you can just make quads with no degenerate triangles.
While it's probably true that degenerate triangles are slightly faster than quads you're unlikely to notice that difference. In fact I suspect it wold be difficult to create an example in WebGL that would show the difference.
To be clear I'm suggesting manually instanced quads. If you want to draw 1000 quads put 1000 quads in a single vertex buffer and draw all with 1 draw call using either drawElements or drawArrays
On the other hand instanced quads using drawArraysInstances might be the most convenient way depending on what you are trying to do.
If it was me though I'd first test without optimization, drawing 1 quad per draw call unless I already knew I was going to draw > 1000 quads. Then I'd find some low-end hardware and see if it's too slow. Most GPU apps get fillrate bound (drawing pixels) before they get vertex bound so even on a slow machine drawing lots of quads might be slow in a way that optimizing vertex calculation won't fix the issue.
You might find this and/or this useful
You can take as a given that the performance of rendering has been optimized by the compiler and the OpenGL core.
static buffers
If you have a buffers that are static then there is generally an insignificant performance difference between the techniques mentioned. Though different hardware (GPUs) will favor one technique over another, but there is no way to know what type of GPU you are running on.
Dynamic buffers
If however when the buffers are dynamic then you need to consider the transfer of data from the CPU RAM to the GPU RAM. This transfer is a slow point and on most GPU's will stop rendering as the data is moved (Messing up concurrent rendering advantages).
On average anything that can be done to reduce the size of the buffers moved will improve the performance.
2D Sprites Triangle V Triangle_Strip
At the most basic 2 floats per vertex (x,y for 2D sprites) you need to modify and transfer a total of 6 verts per quad for gl.TRIANGLE (6 * 2 * b = 48bytes per quad. where b is bytes per float (4)). If you use (gl.TRIANGLE_STRIP) you need to move only 4 verts for a single quad, but for more than 1 you need to create the degenerate triangle each of which requires an additional 2 verts infront and 2 verts behind. So the size per quad is (8 * 2 * 4 = 64bytes per quad (actual can drop 2verts lead in and 2 lead out, start and end of buffer))
Thus for 1000 sprites there are 12000 doubles (64Bit) that are converted to Floats (32Bit) then transfer is 48,000bytes for gl.TRIANGLE. For gl.TRIANGLE_STRIP there are 16,000 doubles for a total of 64,000bytes transferred
There is a clear advantage when using triangle over triangle strip in this case. This is compounded if you include additional per vert data (eg texture coords, color data, etc)
Draw Array V Element
The situation changes when you use drawElements rather than drawArray as the verts used when drawing elements are located via the indices buffer (a static buffer). In this case you need only modify 4Verts per quad (for 1000 quads modify 8000 doubles and transfer 32,000bytes)
Instanced V modify verts
Now using elements we have 4 verts per quad (modify 8 doubles, transfer 32bytes).
Using drawArray or drawElement and each quad has a uniform scale, be rotated, and a position (x,y), using instanced rendering each quad needs only 4 doubles per vert, the position, scale, and rotation (done by the vertex shader).
In this case we have reduced the work load down to (for 1000 quads) modify 4,000 doubles and transfer 16,000bytes
Thus instanced quads are the clear winner in terms of alleviating the transfer and JavaScript bottle necks.
Instanced elements can go further, in the case where it is only position needed, and if that position is only within a screen you can position a quad using only 2 shorts (16bit Int) reducing the work load to modify 2000 ints (32bit JS Number convert to shorts which is much quicker than the conversion of Double to Float)) and transfer only 4000bytes
Conclusion
It is clear in the best case that instanced elements offer up to 16times less work setting and transferring quads to the GPU.
This advantage does not always hold true. It is a balance between the minimal data required per quad compared to the minimum data set per vert per quad (4 verts per quad).
Adding additional capabilities per quad will alter the balance, so will how often you modify the buffers (eg with texture coords you may only need to set the coords once when not using instanced, by for instanced you need to transfer all the data per quad each time anything for that quad has changed (Note the fancy interleaving of instance data can help)
There is also the hardware to consider. Modern GPUs are much better at state changes (transfer speeds), in these cases its all in the JavaScript code where you can gain any significant performance increase. Low end GPUs are notoriously bad at state changes, though optimal JS code is always important, reducing the data per quad is where the significant performance is when dealing with low end devices

Why do we implement lighting in the Pixel Shader?

I am reading Introduction to 3D Game Programing with DirectX 11 by Frank D. Luna, and can't seem to understand why do we implement lighting in Pixel Shader? I would be grateful if you could send me some reference pages on the subject.
Thank you.
Lighting can be done many ways. There are hundreds of SIGGRAPH papers on the topic.
For games, there are a few common approaches (or more often, games will employ a mixture of these approaches)
Static lighting or lightmaps: Lighting is computed offline, usually with a global-illumination solver, and the results are baked into textures. These lightmaps are blended with the base diffuse textures at runtime to create the sense of sophisticated shadows and subtle lighting, but none of it actually changes. The great thing about lightmaps is that you can capture very interesting and sophisticated lighting techniques that are very expensive to compute and then 'replay' them very inexpensively. The limitation is that you can't move the lights, although there are techniques for layering a limited number of dynamic lights on-top.
Deferred lighting: In this approach, the scene is rendered many times to encode information into offscreen textures, then additional passes are made to compute the final image. Here often there is one rendering pass per light in the scene. See deferred shading. The good thing about deferred shading is that it is very easy to make the renderer scale with art-driven content without as many hard limits--you can just do more passes for more lights for example which are simply additive. The problem with deferred shading is that each pass tends to do little computation, and the many passes really push hard on the memory bandwidth of modern GPUs which have a lot more compute power than bandwidth.
Per-face Forward lighting: This is commonly known as flat shading. Here the lighting is performed once per triangle/polygon using a face-normal. On modern GPUs, this is usually done on the programmable vertex shader but could also use a geometry shader to compute the per-face normal rather than having to replicate it in vertices. The result is not very realistic, but very cheap to draw since the color is constant per face. This is really only used if you are going for a "Tron look" or some other non-photorealistic rendering technique.
Vertex Forward lighting: This is classic lighting where the light computation is performed per vertex with a per-vertex normal. The colors at each vertex are then interpolated across the face of the triangle/polygon (Gouraud shading). This lighting is cheap, and on modern GPUs would be done in the vertex shader, but the result can be too smooth for many complex materials, and any specular highlights tend to get blurred or missed.
Per-pixel Forward lighting: This is the heart of your question: Here the lighting is computed once per pixel. This can be something like classic Phong or Blinn/Phong shading where the normal is interpolated between the vertices or normal maps where a second texture provides the normal information for the surface. In a modern GPU, this is done in the pixel shader and can provide much more surface information, better specular highlights, roughness, etc. at the expensive of more pixel shader computation. On modern GPUs, they tend to have a lot of compute power relative to the memory bandwidth, so per-pixel lighting is very affordable compared to the old days. In fact, Physically Based Rendering techniques are quite popular in modern games and these tend to have very long and complex pixel shaders combining data from 6 to 8 textures for every pixel on every surface in the scene.
That's a really rough survey and as I said there's a ton of books, articles, and background on this topic.
The short answer to your question is: because we can!

iOS OpenGL ES 2.0 VBO confusion

I'm attempting to render a large number of textured quads on the iPhone. To improve render speeds I've created a VBO that I leverage to render my objects in a single draw call. This seems to work well, but I'm new to OpenGL and have run into issues when it comes to providing a unique transform for each of my quads (ultimately I'm looking for each quad to have a custom scale, position and rotation).
After a decent amount of Googling, it appears that the standard means of handling this situation is to pass a uniform matrix to the vertex shader and to have each quad take care of rendering itself. But this approach seems to negate the purpose of the VBO, by ultimately requiring a draw call per object.
In my mind, it makes sense that each object should keep it's own model view matrix, using it to transform, scale and rotate the object as necessary. But applying separate matrices to objects in a VBO has me lost. I've considered two approaches:
Send the model view matrix to the vertex shader as a non-uniform attribute and apply it within the shader.
Or transform the vertex data before it's stored in the VBO and sent to the GPU
But the fact that I'm finding it difficult to find information on how best to handle this leads me to believe I'm confusing the issue. What's the best way of handling this?
This is the "evergreen" question (a good one) on how to optimize the rendering of many simple geometries (a quad is in fact 2 triangles, 6 vertices most of the time unless we use a strip).
Anyway, the use of VBO vs VAO in this case should not mean a significant advantage since the size of the data to be transferred on the memory buffer is rather low (32 bytes per vertex, 96 bytes per triangle, 192 per quad) which is not a big effort for nowadays memory bandwidth (yet it depends on How many quads you mean. If you have 20.000 quads per frame then it would be a problem anyway).
A possible approach could be to batch the drawing of the quads by building a new VAO at each frame with the different quads positioned in your own coordinate system. Something like shifting the quads vertices to the correct position in a "virtual" mesh origin. Then you just perform a single draw of the newly creates mesh in your VAO.
In this way, you could batch the drawing of multiple objects in fewer calls.
The problem would be if your quads need to "scale" and "rotate" and not just translate, you can compute it with CPU the actual vertices position but it would be way to costly in terms of computing power.
A simple suggestion on top of the way you transfer the meshes is to use a texture atlas for all the textures of your quads, in this way you will need a much lower (if not needed at all) texture bind operation which might be costly in rendering operations.

Most Efficient way of Multi-Texturing - iOS, OpenGL ES2, optimization

I'm trying to find the most efficient way of handling multi-texturing in OpenGL ES2 on iOS. By 'efficient' I mean the fastest rendering even on older iOS devices (iPhone 4 and up) - but also balancing convenience.
I've considered (and tried) several different methods. But have run into a couple of problems and questions.
Method 1 - My base and normal values are rgb with NO ALPHA. For these objects I don't need transparency. My emission and specular information are each only one channel. To reduce texture2D() calls I figured I could store the emission as the alpha channel of the base, and the specular as the alpha of the normal. With each being in their own file it would look like this:
My problem so far has been finding a file format that will support a full non-premultiplied alpha channel. PNG just hasn't worked for me. Every way that I've tried to save this as a PNG premultiplies the .alpha with the .rgb on file save (via photoshop) basically destroying the .rgb. Any pixel with a 0.0 alpha has a black rgb when I reload the file. I posted that question here with no activity.
I know this method would yield faster renders if I could work out a way to save and load this independent 4th channel. But so far I haven't been able to and had to move on.
Method 2 - When that didn't work I moved on to a single 4-way texture where each quadrant has a different map. This doesn't reduce texture2D() calls but it reduces the number of textures that are being accessed within the shader.
The 4-way texture does require that I modify the texture coordinates within the shader. For model flexibility I leave the texcoords as is in the model's structure and modify them in the shader like so:
v_fragmentTexCoord0 = a_vertexTexCoord0 * 0.5;
v_fragmentTexCoord1 = v_fragmentTexCoord0 + vec2(0.0, 0.5); // illumination frag is up half
v_fragmentTexCoord2 = v_fragmentTexCoord0 + vec2(0.5, 0.5); // shininess frag is up and over
v_fragmentTexCoord3 = v_fragmentTexCoord0 + vec2(0.5, 0.0); // normal frag is over half
To avoid dynamic texture lookups (Thanks Brad Larson) I moved these offsets to the vertex shader and keep them out of the fragment shader.
But my question here is: Does reducing the number of texture samplers used in a shader matter? Or would I be better off using 4 different smaller textures here?
The one problem I did have with this was bleed over between the different maps. A texcoord of 1.0 was was averaging in some of the blue normal pixels due to linear texture mapping. This added a blue edge on the object near the seam. To avoid it I had to change my UV mapping to not get too close to the edge. And that's a pain to do with very many objects.
Method 3 would be to combine methods 1 and 2. and have the base.rgb + emission.a on one side and normal.rgb + specular.a on the other. But again I still have this problem getting an independent alpha to save in a file.
Maybe I could save them as two files but combine them during loading before sending it over to openGL. I'll have to try that.
Method 4 Finally, In a 3d world if I have 20 different panel textures for walls, should these be individual files or all packed in a single texture atlas? I recently noticed that at some point minecraft moved from an atlas to individual textures - albeit they are 16x16 each.
With a single model and by modifying the texture coordinates (which I'm already doing in method 2 and 3 above), you can easily send an offset to the shader to select a particular map in an atlas:
v_fragmentTexCoord0 = u_texOffset + a_vertexTexCoord0 * u_texScale;
This offers a lot of flexibility and reduces the number of texture bindings. It's basically how I'm doing it in my game now. But IS IT faster to access a small portion of a larger texture and have the above math in the vertex shader? Or is it faster to repeatedly bind smaller textures over and over? Especially if you're not sorting objects by texture.
I know this is a lot. But the main question here is what's the most efficient method considering speed + convenience? Will method 4 be faster for multiple textures or would multiple rebinds be faster? Or is there some other way that I'm overlooking. I see all these 3d games with a lot of graphics and area coverage. How do they keep frame rates up, especially on older devices like the iphone4?
**** UPDATE ****
Since I've suddenly had 2 answers in the last few days I'll say this. Basically I did find the answer. Or AN answer. The question is which method is more efficient? Meaning which method will result in the best frame rates. I've tried the various methods above and on the iPhone 5 they're all just about as fast. The iPhone5/5S has an extremely fast gpu. Where it matters is on older devices like the iPhone4/4S, or on larger devices like a retina iPad. My tests were not scientific and I don't have ms speeds to report. But 4 texture2D() calls to 4 RGBA textures was actually just as fast or maybe even faster than 4 texture2d() calls to a single texture with offsets. And of course I do those offset calculations in the vertex shader and not the fragment shader (never in the fragment shader).
So maybe someday I'll do the tests and make a grid with some numbers to report. But I don't have time to do that right now and write a proper answer myself. And I can't really checkmark any other answer that isn't answering the question cause that's not how SO works.
But thanks to the people who have answered. And check out this other question of mine that also answered some of this one: Load an RGBA image from two jpegs on iOS - OpenGL ES 2.0
Have a post process step in your content pipeline where you merge your rgb with alpha texture and store it in a. Ktx file when you package the game or as a post build event when you compile.
It's fairly trivial format and would be simple to write such command-line tool that loads 2 png's and merges these into one Ktx, rgb + alpha.
Some benefits by doing that is
- less cpu overhead when loading the file at game start up, so the games starts quicker.
- Some GPUso does not natively support rgb 24bit format, which would force the driver to internally convert it to rgba 32bit. This adds more time to the loading stage and temporary memory usage.
Now when you got the data in a texture object, you do want to minimize texture sampling as it means alot of gpu operations and memory accesses depending on filtering mode.
I would recommend to have 2 textures with 2 layers each since there's issues if you do add all of them to the same one is potential artifacts when you sample with bilinear or mipmapped as it may include neighbour pixels close to edge where one texture layer ends and the second begins, or if you decided to have mipmaps generated.
As an extra improvement I would recommend not having raw rgba 32bit data in the Ktx, but actually compressing it into a dxt or pvrtc format. This would use much less memory which means faster loading times and less memory transfers for the gpu, as memory bandwidth is limited.
Of course, adding the compressor to the post process tool is slightly more complex.
Do note that compressed textures do loose a bit of the quality depending on algorithm and implementation.
Silly question but are you sure you are sampler limited? It just seems to me that, with your "two 2-way textures" you are potentially pulling in a lot of texture data, and you might instead be bandwidth limited.
What if you were to use 3 textures [ BaseRGB, NormalRBG, and combined Emission+Specular] and use PVRTC compression? Depending on the detail, you might even be able to use 2bpp (rather than 4bpp) for the BaseRGB and/or Emission+Specular.
For the Normals I'd probably stick to 4bpp. Further, if you can afford the shader instructions, only store the R&G channels (putting 0 in the blue channel) and re-derive the blue channel with a bit of maths. This should give better quality.

Efficient pixel shader when only u-axis varies?

I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers
Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.

Resources