In OpenGL, you generally get better performance by using vertex buffers, and even better performance by putting many objects into the same vertex buffer, so that lots of vertices can be drawn with a single glDrawArrays call.
But, what is the practical upper limit of this? How many MB of vertex data in the same buffer is too much? At what point should you cut a vertex array into two separate vertex arrays? How do you know this?
I know this answer could be seen as generic but I guess it is actually dependent on the GPU you are using and the GPU memory available.
There is no hard limit in the number of vertices you can process as per OpenGL specs.
Here you can find some useful information on this topic:
How many maximum triangles can be drawn on ipad using opengl es in 1 frame?
Related
I'm implementing a 2D game with lots of independent rectangular game pieces of various dimensions. The dimensions of each piece do not change between frames. Most of the pieces will display an image and share the same fragment shader. I am new to WebGL and it is not clear to me what the best strategy is for managing vertex buffers in regard to performance for this situation.
Is it better to use a single vertex buffer (quad) to represent all of the game's pieces and then rescale those vertices in the vertex shader for each piece? Or, should I define a separate static vertex buffer for each piece?
The GPU is a state machine, switching states is expensive(even more when done through WebGL because of the additional layer of checks introduced by the WebGL implementation) so binding vertex buffers is expensive.
Its good practice to reduce API calls to a minimum.
Even when having multiple distinct objects you still want to use a single vertex buffer and use the offset parameter of the drawArrays or drawElements methods.
Here is a list of API calls ordered by decreasing expensiveness(top is most expensive):
FrameBuffer
Program
Texture binds
Vertex format
Vertex bindings
Uniform updates
For more information on this you can watch this great talk Beyond Porting: How Modern OpenGL can Radically Reduce Driver Overhead by Cass Everitt and John McDonald, this is also where the list above comes from.
While these benchmarks were done on Nvidia hardware its a good guideline for AMD and Intel graphics hardware as well.
I'm attempting to render a large number of textured quads on the iPhone. To improve render speeds I've created a VBO that I leverage to render my objects in a single draw call. This seems to work well, but I'm new to OpenGL and have run into issues when it comes to providing a unique transform for each of my quads (ultimately I'm looking for each quad to have a custom scale, position and rotation).
After a decent amount of Googling, it appears that the standard means of handling this situation is to pass a uniform matrix to the vertex shader and to have each quad take care of rendering itself. But this approach seems to negate the purpose of the VBO, by ultimately requiring a draw call per object.
In my mind, it makes sense that each object should keep it's own model view matrix, using it to transform, scale and rotate the object as necessary. But applying separate matrices to objects in a VBO has me lost. I've considered two approaches:
Send the model view matrix to the vertex shader as a non-uniform attribute and apply it within the shader.
Or transform the vertex data before it's stored in the VBO and sent to the GPU
But the fact that I'm finding it difficult to find information on how best to handle this leads me to believe I'm confusing the issue. What's the best way of handling this?
This is the "evergreen" question (a good one) on how to optimize the rendering of many simple geometries (a quad is in fact 2 triangles, 6 vertices most of the time unless we use a strip).
Anyway, the use of VBO vs VAO in this case should not mean a significant advantage since the size of the data to be transferred on the memory buffer is rather low (32 bytes per vertex, 96 bytes per triangle, 192 per quad) which is not a big effort for nowadays memory bandwidth (yet it depends on How many quads you mean. If you have 20.000 quads per frame then it would be a problem anyway).
A possible approach could be to batch the drawing of the quads by building a new VAO at each frame with the different quads positioned in your own coordinate system. Something like shifting the quads vertices to the correct position in a "virtual" mesh origin. Then you just perform a single draw of the newly creates mesh in your VAO.
In this way, you could batch the drawing of multiple objects in fewer calls.
The problem would be if your quads need to "scale" and "rotate" and not just translate, you can compute it with CPU the actual vertices position but it would be way to costly in terms of computing power.
A simple suggestion on top of the way you transfer the meshes is to use a texture atlas for all the textures of your quads, in this way you will need a much lower (if not needed at all) texture bind operation which might be costly in rendering operations.
I am testing the rendering of extremely large 3d meshes, and I am currently testing on an iPhone 5 (I also have an iPad 3).
I have here two screenshots of Instruments with a profiling run. The first one is rendering a 1.3M vertex mesh, and the second is rendering a 2.1M vertex mesh.
The blue histogram-bar at the top shows CPU load, and it can be seen that for the first mesh is hovering at around ~10% CPU load so the GPU is doing most of the heavy lifting. The mesh is very detailed and my point-light-with-specular shader makes it look quite impressive if I say so myself, as it is able to render consistently above 20 frames per second. Oh, and 4x MSAA is enabled as well!
However, once I step up to a 2 million+ vertex mesh, everything goes to crap as we see here a massive CPU bound situation, and all instruments report 1 frame per second performance.
So, it's pretty clear that somewhere between these two assets (and I will admit that they are both tremendously large meshes to be loading in under one single VBO), whether it is the vertex buffer size or the index buffer size that is over the limit, some limit is being surpassed by the 2megavertex (462K tris) mesh.
So, the question is, what is this limit, and how can I query it? It would really be very preferable if I can have some reasonable assurance that my app will function well without exhaustively testing every device.
I also see an alternative approach to this problem, which is to stick to a known good VBO size limit (I have read about 4MB being a good limit), and basically just have the CPU work a little bit harder if the mesh being rendered is monstrous. With a 100MB VBO, having it in 4MB chunks (segmenting the mesh into 25 draw calls) does not really sound that bad.
But, I'm still curious. How can I check the max size, in order to work around the CPU fallback? Could I be running into an out of memory condition, and Apple is simply applying a CPU based workaround (oh LORD have mercy, 2 million vertices in immediate mode...)?
In pure OpenGL, there are two implementation-defined attributes: GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. When exceeded performance can drop off a cliff in some implementations.
I spent a while looking through the OpenGL ES specification for the equivalent and could not find it. Chances are it's burried in one of the OES or vendor-specific extensions on OpenGL ES. Nevertheless, there is a very real hardware limit to the number of elements you can draw and the number of vertices. After a point with too many indices, you can exceed the capacity of the post-T&L cache. 2 million is a lot for a single draw call, in lieu of being able to query the OpenGL ES implementation for this information, I'd try successively lower powers-of-two until you dial it back to the sweet spot.
65,536 used to be a sweet spot on DX9 hardware. That was the limit for 16-bit indices and was always guaranteed to be below the maximum hardware vertex count. Chances are it'll work for OpenGL ES class hardware too...
I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers
Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.
I'm implementing billboards for vegetation where a billboard is of course a single quad consisting of two triangles. The vertex data is stored in a vertex buffer, but should I bother with indices? I understand that the savings on things like terrain can be huge in terms of vertices sent to the graphics card when you use indices, but using indices on billboards means that I'll have 4 vertices per quad rather than 6, since each quad is completely separate from the others.
And is it possible that the use of indices actually reduces performance because there is an extra level of indirection? Or isn't that of any significance at all?
I'm asking this because using indices would slightly complicate matters and I'm curious to know if I'm not doing extra work that just makes things slower (whether just in theory or actually noticeable in practice).
This is using XNA, but should apply to DirectX.
Using indices not only saves on bandwidth, by sending less data to the card, but also reduces the amount of work the vertex shader has to do. The results of the vertex shader can be cached if there is an index to use as a key.
If you render lots of this billboarded vegetation and don't change your index buffer, I think you should see a small gain.
When it comes to very primitive gemotery then it might won't make any sense to use indices, I won't even bother with performance in that case, even the modest HW will render millions of triangles a seconds.
Now, technically, you don't know how the HW will handle the data internally, it might convert them to indices anyway because that's the most popular form of geometry presentation.