I've been working on improving my OpenGL ES 2.0 render performance by introducing batching; specifically one creates a RenderBatch, specifying a texture and a shader (for now) upon creation. This sets the state into a VAO to allow for inexpensive state switching. I started the implementation looking something like this:
batch = RenderBatch.new "SpriteSheet" "FlatShader"
batch.begin GL_TRIANGLE_STRIP
batch.addGeometry Geometry.newFromFile "Billboard"
batch.end
batch.render renderEngine
But then it hit me: my Billboard file has vertices that are meant to be scaled and translated for specific instance usage. So I added a transform argument to the addGeometry call.
batch.addGeometry(Geometry.newFromFile("Billboard"), myObject.transform)
This solves the problem of scaling, translating, and rotating the vertices, but it does so by first looking up the vertex information, transforming it by the transform matrix, and then inserts it into the batch data. While this works it seems inefficient; it is CPU intensive and doesn't take advantage of the GPU's transformation power. However, it works, so not that big of a deal. (Would be nice to have a better way to do this though)
However, I've run into a roadblock: texture coordinates may need to be different for each instance as well, and that means I would have to pass in a texture transformation matrix, and now this is feeling hacky.
Is there an easier way to handle this kind of transformation to existing data using shaders that does not limit the geometry/models given and is easily extensible to use normal maps, UV maps, and other fancy tricks? Thanks!
It seems to me that what you are talking about are shader uniforms. Normally you would set up the vertex data and attributes for each batch in a VBO and a VAO. Then, in your render method, you switch to the correct VAO and set up the shader uniforms. These normally include a model-view-projection matrix to transform vertices into clip space, which necessarily would change nearly every frame, the correct texture to use, etc.
This is efficient because the unchanging vertex data is held in GPU memory, the VAO takes care of cheap state switching, and only the uniforms, which generally change often, are sent to the GPU each render call.
If you are batching multiple objects that require separate model view projection matrices, then you have a few options:
you have to perform a separate draw call for each batch that requires a separate model view projection matrix
use an array of model view projection matrices as a uniform and have an attribute for each object that provides the correct projection matrix index to use
you have to transform the vertices using the CPU and refill the VBO with the updated data
The first method is the preferred solution, it will be efficient and simple. The slow part of rendering lots of draw calls is generally getting the data from the CPU to the GPU, if you already have the vertex data in VBOs then the overhead of a draw call per object is not going to be a big deal. This also solves the problem of how to provide different uniforms per object based on object properties. In each objects render method, the relevant properties are set up as uniforms before the draw call is made. If each object requires different data sent to the GPU, then how else could this work?
This is a trade-off situation. Costs of state changes due to insufficient batching compared to costs of transformation on the CPU. There is no single best solution, but it depends on how much of your scene is static, how much is dynamic and how it is laid out.
A common solution is to put static objects, whose transformation relative to each other never changes into a single VBO, or few VBOs (if they use different textures, vertex formats, etc), completely transformed. This is done once before rendering. Not each frame. Dynamic objects (players, monster, whatever) are then rendered individually, with transformation done in the vertex shader.
You can still optimize for state changes by roughly ordering the drawing of the individual objects by textures and programs.
Related
I have some vertex data. Positions, normals, texture coordinates. I probably loaded it from a .obj file or some other format. Maybe I'm drawing a cube. But each piece of vertex data has its own index. Can I render this mesh data using OpenGL/Direct3D?
In the most general sense, no. OpenGL and Direct3D only allow one index per vertex; the index fetches from each stream of vertex data. Therefore, every unique combination of components must have its own separate index.
So if you have a cube, where each face has its own normal, you will need to replicate the position and normal data a lot. You will need 24 positions and 24 normals, even though the cube will only have 8 unique positions and 6 unique normals.
Your best bet is to simply accept that your data will be larger. A great many model formats will use multiple indices; you will need to fixup this vertex data before you can render with it. Many mesh loading tools, such as Open Asset Importer, will perform this fixup for you.
It should also be noted that most meshes are not cubes. Most meshes are smooth across the vast majority of vertices, only occasionally having different normals/texture coordinates/etc. So while this often comes up for simple geometric shapes, real models rarely have substantial amounts of vertex duplication.
GL 3.x and D3D10
For D3D10/OpenGL 3.x-class hardware, it is possible to avoid performing fixup and use multiple indexed attributes directly. However, be advised that this will likely decrease rendering performance.
The following discussion will use the OpenGL terminology, but Direct3D v10 and above has equivalent functionality.
The idea is to manually access the different vertex attributes from the vertex shader. Instead of sending the vertex attributes directly, the attributes that are passed are actually the indices for that particular vertex. The vertex shader then uses the indices to access the actual attribute through one or more buffer textures.
Attributes can be stored in multiple buffer textures or all within one. If the latter is used, then the shader will need an offset to add to each index in order to find the corresponding attribute's start index in the buffer.
Regular vertex attributes can be compressed in many ways. Buffer textures have fewer means of compression, allowing only a relatively limited number of vertex formats (via the image formats they support).
Please note again that any of these techniques may decrease overall vertex processing performance. Therefore, it should only be used in the most memory-limited of circumstances, after all other options for compression or optimization have been exhausted.
OpenGL ES 3.0 provides buffer textures as well. Higher OpenGL versions allow you to read buffer objects more directly via SSBOs rather than buffer textures, which might have better performance characteristics.
I found a way that allows you to reduce this sort of repetition that runs a bit contrary to some of the statements made in the other answer (but doesn't specifically fit the question asked here). It does however address my question which was thought to be a repeat of this question.
I just learned about Interpolation qualifiers. Specifically "flat". It's my understanding that putting the flat qualifier on your vertex shader output causes only the provoking vertex to pass it's values to the fragment shader.
This means for the situation described in this quote:
So if you have a cube, where each face has its own normal, you will need to replicate the position and normal data a lot. You will need 24 positions and 24 normals, even though the cube will only have 8 unique positions and 6 unique normals.
You can have 8 vertexes, 6 of which contain the unique normals and 2 of normal values are disregarded, so long as you carefully order your primitives indices such that the "provoking vertex" contains the normal data you want to apply to the entire face.
EDIT: My understanding of how it works:
Topic is pretty much the question. I'm trying to understand how CPU and GPU cooperation works.
I'm developing my game via cocos2d. It is a game engine so it redraws the whole screen 60 times per second. Every node in cocos2d draws its own set of triangles. Usually you set vertexes for triangle after performing node transforms (from node to world) on CPU side. I've realized the way to do it on GPU side with vertex shaders by passing view model projection to uniforms.
I see CPU time decreasing by ~1ms and gpu time raised by ~0.5ms.
Can I consider this as a performance gain?
In other words: if something can be done on GPU side is there any reasons you shouldn't do it?
The only time you shouldn't do something on the GPU side is if you need the result (in easily accessible form) on the CPU side to further the simulation.
Taking your example. If we assume you have 4 250KB meshes which represent a hierarchy of body parts (as a skeleton). Lets assume you are using a 4x4 matrix of floats for the transformations (64bytes) for each mesh. You could either:
Each frame, perform the mesh transformation calculations on the application side (CPU) and then upload the four meshes to the GPU. This would result in about ~1000kb of data being sent to the GPU per frame.
When the application starts, upload the data for the 4 meshes to the GPU (this will be in a rest / identity pose). Then each frame when you make the render call, you calculate only the new matrices for each mesh (position/rotation/scale) and upload those matrices to the GPU and perform the transformation there. This results in ~256bytes being sent to the GPU per frame.
As you can see, even if the data in the example is fabricated, the main advantage is that you are minimizing the amount of data being transferred between CPU and GPU on a per frame basis.
The only time you would prefer the first option is if your application needs the results of the transformation to do some other work. The GPU is very efficient (especially at processing vertices in parallel), but it isn't too easy to get information back from the GPU (and then its usually in the form on a texture - i.e. a RenderTarget). One concrete example of this 'further work' might be performing collision checks on transformed mesh positions.
edit
You can tell based on how you are calling the openGL api where the data is stored to some extent*. Here is a quick run-down:
Vertex Arrays
glVertexPointer(...)
glDrawArray(...)
using this method passing an array of vertices from the CPU -> GPU each frame. The vertices are processed sequentially as they appear in the array. There is a variation of this method (glDrawElements) which lets you specify indices.
VBOs
glBindBuffer(...)
glBufferData(...)
glDrawElements(...)
VBOs allow you to store the mesh data on the GPU (see below for note). In this way, you don't need to send the mesh data to the GPU each frame, only the transformation data.
*Although we can indicate where our data is to be stored, it is not actually specified in the OpenGL specification how the vendors are to implement this. It means that, we can give hints that our vertex data should be stored in VRAM, but ultimately, it is down to the driver!
Good reference links for this stuff is:
OpenGL ref page: https://www.opengl.org/sdk/docs/man/html/start.html
OpenGL explanations: http://www.songho.ca/opengl
Java OpenGL concepts for rendering: http://www.java-gaming.org/topics/introduction-to-vertex-arrays-and-vertex-buffer-objects-opengl/24272/view.html
I'm attempting to render a large number of textured quads on the iPhone. To improve render speeds I've created a VBO that I leverage to render my objects in a single draw call. This seems to work well, but I'm new to OpenGL and have run into issues when it comes to providing a unique transform for each of my quads (ultimately I'm looking for each quad to have a custom scale, position and rotation).
After a decent amount of Googling, it appears that the standard means of handling this situation is to pass a uniform matrix to the vertex shader and to have each quad take care of rendering itself. But this approach seems to negate the purpose of the VBO, by ultimately requiring a draw call per object.
In my mind, it makes sense that each object should keep it's own model view matrix, using it to transform, scale and rotate the object as necessary. But applying separate matrices to objects in a VBO has me lost. I've considered two approaches:
Send the model view matrix to the vertex shader as a non-uniform attribute and apply it within the shader.
Or transform the vertex data before it's stored in the VBO and sent to the GPU
But the fact that I'm finding it difficult to find information on how best to handle this leads me to believe I'm confusing the issue. What's the best way of handling this?
This is the "evergreen" question (a good one) on how to optimize the rendering of many simple geometries (a quad is in fact 2 triangles, 6 vertices most of the time unless we use a strip).
Anyway, the use of VBO vs VAO in this case should not mean a significant advantage since the size of the data to be transferred on the memory buffer is rather low (32 bytes per vertex, 96 bytes per triangle, 192 per quad) which is not a big effort for nowadays memory bandwidth (yet it depends on How many quads you mean. If you have 20.000 quads per frame then it would be a problem anyway).
A possible approach could be to batch the drawing of the quads by building a new VAO at each frame with the different quads positioned in your own coordinate system. Something like shifting the quads vertices to the correct position in a "virtual" mesh origin. Then you just perform a single draw of the newly creates mesh in your VAO.
In this way, you could batch the drawing of multiple objects in fewer calls.
The problem would be if your quads need to "scale" and "rotate" and not just translate, you can compute it with CPU the actual vertices position but it would be way to costly in terms of computing power.
A simple suggestion on top of the way you transfer the meshes is to use a texture atlas for all the textures of your quads, in this way you will need a much lower (if not needed at all) texture bind operation which might be costly in rendering operations.
Using webgl I need to perform 3 passes to render my scene. Each pass runs the same geometry and shaders but has differing values for some uniforms and textures.
I seem to have two choices. Have a single "program" and set all of the uniforms and textures for each pass. Or have 3 "programs" each containing the same shaders, and set all the necessary uniforms/shaders once per program, and then just switch programs for each pass. This means that I will do one useProgram call per pass instead of man setUniform calls for each pass.
Is this second technique likely to be faster as it will avoid very many setuniform calls, or is changing the program very expensive? I've done some trials but with the very simple geometry I have at the moment I don't see any difference in performance because setup costs overwhelm any differences.
Is there any reason to prefer one technique over the other?
Just send different values via glUniform if the shader programs are the same.
Switching between programs is generally slower than change value of uniform.
Anyway Uber Shader Program (with list of uniforms like useLighting, useAlphaMap) in most cases aren't good.
#gman
We are talking about WebGL (GLES 2.0) where we don't have UBO. (uniform buffer object)
#top
Summing try to avoid rebinding shader programs (but it's not end of the world) and don't create one uber shader!
When you have large amouts of textures to rebind, texture atlasing should be the fastest solution, so you don't need to rebind textures, don't need to rebind programs. Textures can be switched by modifying uniforms representing texCoord offsets.
Modifying such uniforms can be optimized even further:
You should consider moving frequently modified uniforms to attributes. Usualy their data source are provided using attribPointers but you can also use constant values when they are disabled. Instead of unformXXX() use attribXXX() functions to specify their constant values.
I think best example is light position. Normaly you'd have to specify uniform values for it every time light position changes to ALL programs that make use of it. In contrast, when using 'attributed' uniforms you can specify attribute value once globaly when your light moves.
-pros:
This method is best suited when you have many programs which would like to share uniforms, as we know we can't use uniform buffers in WebGL, it seams to be the only reasonable solution.
-cons:
Of course available size of such 'attributed' uniforms will be much smaller than using regular uniforms, but it still can speed things up a lot if you do it to some part of your uniforms.
I'm implementing billboards for vegetation where a billboard is of course a single quad consisting of two triangles. The vertex data is stored in a vertex buffer, but should I bother with indices? I understand that the savings on things like terrain can be huge in terms of vertices sent to the graphics card when you use indices, but using indices on billboards means that I'll have 4 vertices per quad rather than 6, since each quad is completely separate from the others.
And is it possible that the use of indices actually reduces performance because there is an extra level of indirection? Or isn't that of any significance at all?
I'm asking this because using indices would slightly complicate matters and I'm curious to know if I'm not doing extra work that just makes things slower (whether just in theory or actually noticeable in practice).
This is using XNA, but should apply to DirectX.
Using indices not only saves on bandwidth, by sending less data to the card, but also reduces the amount of work the vertex shader has to do. The results of the vertex shader can be cached if there is an index to use as a key.
If you render lots of this billboarded vegetation and don't change your index buffer, I think you should see a small gain.
When it comes to very primitive gemotery then it might won't make any sense to use indices, I won't even bother with performance in that case, even the modest HW will render millions of triangles a seconds.
Now, technically, you don't know how the HW will handle the data internally, it might convert them to indices anyway because that's the most popular form of geometry presentation.