iOS OpenGL ES 2.0 VBO confusion - ios

I'm attempting to render a large number of textured quads on the iPhone. To improve render speeds I've created a VBO that I leverage to render my objects in a single draw call. This seems to work well, but I'm new to OpenGL and have run into issues when it comes to providing a unique transform for each of my quads (ultimately I'm looking for each quad to have a custom scale, position and rotation).
After a decent amount of Googling, it appears that the standard means of handling this situation is to pass a uniform matrix to the vertex shader and to have each quad take care of rendering itself. But this approach seems to negate the purpose of the VBO, by ultimately requiring a draw call per object.
In my mind, it makes sense that each object should keep it's own model view matrix, using it to transform, scale and rotate the object as necessary. But applying separate matrices to objects in a VBO has me lost. I've considered two approaches:
Send the model view matrix to the vertex shader as a non-uniform attribute and apply it within the shader.
Or transform the vertex data before it's stored in the VBO and sent to the GPU
But the fact that I'm finding it difficult to find information on how best to handle this leads me to believe I'm confusing the issue. What's the best way of handling this?

This is the "evergreen" question (a good one) on how to optimize the rendering of many simple geometries (a quad is in fact 2 triangles, 6 vertices most of the time unless we use a strip).
Anyway, the use of VBO vs VAO in this case should not mean a significant advantage since the size of the data to be transferred on the memory buffer is rather low (32 bytes per vertex, 96 bytes per triangle, 192 per quad) which is not a big effort for nowadays memory bandwidth (yet it depends on How many quads you mean. If you have 20.000 quads per frame then it would be a problem anyway).
A possible approach could be to batch the drawing of the quads by building a new VAO at each frame with the different quads positioned in your own coordinate system. Something like shifting the quads vertices to the correct position in a "virtual" mesh origin. Then you just perform a single draw of the newly creates mesh in your VAO.
In this way, you could batch the drawing of multiple objects in fewer calls.
The problem would be if your quads need to "scale" and "rotate" and not just translate, you can compute it with CPU the actual vertices position but it would be way to costly in terms of computing power.
A simple suggestion on top of the way you transfer the meshes is to use a texture atlas for all the textures of your quads, in this way you will need a much lower (if not needed at all) texture bind operation which might be costly in rendering operations.

Related

For batch rendering multiple similar objects which is more performant, drawArrays(TRIANGLE_STRIP) with "degenerate triangles" or drawArraysInstanced?

MDN states that:
Fewer, larger draw operations will generally improve performance. If
you have 1000 sprites to paint, try to do it as a single drawArrays()
or drawElements() call.
It's common to use "degenerate triangles" if you need to draw
discontinuous objects as a single drawArrays(TRIANGLE_STRIP) call.
Degenerate triangles are triangles with no area, therefore any
triangle where more than one point is in the same exact location.
These triangles are effectively skipped, which lets you start a new
triangle strip unattached to your previous one, without having to
split into multiple draw calls.
However, it is also commmonly recommended that for multiple similar objects one should use instanced rendered. For webGl2 something like drawArraysInstanced() or for webGl1 drawArrays with the ANGLE_instanced_arrays extension activated.
For my personal purposes I need to render a large amount of rectangles of the same width in a 2d plane but with varying heights (webgl powered charting application). So any recommendation particular to my usecase is valuable.
Degenerate triangles are generally faster than drawArraysInstanced but there's arguably no reason to use degenerate triangles when you can just make quads with no degenerate triangles.
While it's probably true that degenerate triangles are slightly faster than quads you're unlikely to notice that difference. In fact I suspect it wold be difficult to create an example in WebGL that would show the difference.
To be clear I'm suggesting manually instanced quads. If you want to draw 1000 quads put 1000 quads in a single vertex buffer and draw all with 1 draw call using either drawElements or drawArrays
On the other hand instanced quads using drawArraysInstances might be the most convenient way depending on what you are trying to do.
If it was me though I'd first test without optimization, drawing 1 quad per draw call unless I already knew I was going to draw > 1000 quads. Then I'd find some low-end hardware and see if it's too slow. Most GPU apps get fillrate bound (drawing pixels) before they get vertex bound so even on a slow machine drawing lots of quads might be slow in a way that optimizing vertex calculation won't fix the issue.
You might find this and/or this useful
You can take as a given that the performance of rendering has been optimized by the compiler and the OpenGL core.
static buffers
If you have a buffers that are static then there is generally an insignificant performance difference between the techniques mentioned. Though different hardware (GPUs) will favor one technique over another, but there is no way to know what type of GPU you are running on.
Dynamic buffers
If however when the buffers are dynamic then you need to consider the transfer of data from the CPU RAM to the GPU RAM. This transfer is a slow point and on most GPU's will stop rendering as the data is moved (Messing up concurrent rendering advantages).
On average anything that can be done to reduce the size of the buffers moved will improve the performance.
2D Sprites Triangle V Triangle_Strip
At the most basic 2 floats per vertex (x,y for 2D sprites) you need to modify and transfer a total of 6 verts per quad for gl.TRIANGLE (6 * 2 * b = 48bytes per quad. where b is bytes per float (4)). If you use (gl.TRIANGLE_STRIP) you need to move only 4 verts for a single quad, but for more than 1 you need to create the degenerate triangle each of which requires an additional 2 verts infront and 2 verts behind. So the size per quad is (8 * 2 * 4 = 64bytes per quad (actual can drop 2verts lead in and 2 lead out, start and end of buffer))
Thus for 1000 sprites there are 12000 doubles (64Bit) that are converted to Floats (32Bit) then transfer is 48,000bytes for gl.TRIANGLE. For gl.TRIANGLE_STRIP there are 16,000 doubles for a total of 64,000bytes transferred
There is a clear advantage when using triangle over triangle strip in this case. This is compounded if you include additional per vert data (eg texture coords, color data, etc)
Draw Array V Element
The situation changes when you use drawElements rather than drawArray as the verts used when drawing elements are located via the indices buffer (a static buffer). In this case you need only modify 4Verts per quad (for 1000 quads modify 8000 doubles and transfer 32,000bytes)
Instanced V modify verts
Now using elements we have 4 verts per quad (modify 8 doubles, transfer 32bytes).
Using drawArray or drawElement and each quad has a uniform scale, be rotated, and a position (x,y), using instanced rendering each quad needs only 4 doubles per vert, the position, scale, and rotation (done by the vertex shader).
In this case we have reduced the work load down to (for 1000 quads) modify 4,000 doubles and transfer 16,000bytes
Thus instanced quads are the clear winner in terms of alleviating the transfer and JavaScript bottle necks.
Instanced elements can go further, in the case where it is only position needed, and if that position is only within a screen you can position a quad using only 2 shorts (16bit Int) reducing the work load to modify 2000 ints (32bit JS Number convert to shorts which is much quicker than the conversion of Double to Float)) and transfer only 4000bytes
Conclusion
It is clear in the best case that instanced elements offer up to 16times less work setting and transferring quads to the GPU.
This advantage does not always hold true. It is a balance between the minimal data required per quad compared to the minimum data set per vert per quad (4 verts per quad).
Adding additional capabilities per quad will alter the balance, so will how often you modify the buffers (eg with texture coords you may only need to set the coords once when not using instanced, by for instanced you need to transfer all the data per quad each time anything for that quad has changed (Note the fancy interleaving of instance data can help)
There is also the hardware to consider. Modern GPUs are much better at state changes (transfer speeds), in these cases its all in the JavaScript code where you can gain any significant performance increase. Low end GPUs are notoriously bad at state changes, though optimal JS code is always important, reducing the data per quad is where the significant performance is when dealing with low end devices

How to batch sprites in iOS/OpenGL ES 2.0

I have developed my own sprite library on top of OpenGL ES 2.0. Right now, I am not doing any batching of draw calls; instead, each sprite has its own VBO/VAO of four textured vertices, drawn as a triangle strip (The VAO/VBO itself is managed by the Texture atlas, so identical sprites reuse the same VAO/VBO, which is 'reference counted' and hence deleted when no sprite instances reference it).
Before drawing each sprite, I'll bind its texture, upload its uniforms/attributes to the shader (modelview matrix, opacity - Projection matrix stays constant all along), bind its Vertex Array Object (4 textured vertices + four indices), and call glDrawElements(). I do cull off-screen sprites (based on position and bounds), but still it is one draw call per sprite, even if all sprites share the same texture. The vertex positions and texture coordinates for each sprite never change.
I must say that, despite this inefficiency, I have never experienced performance issues, even when drawing many sprites on screen. I do split the sprites into opaque/non-opaque, draw the opaque ones first, and the non-opaque ones after, back to front. I have seen performance suffer only when I overdraw (tax the fill rate).
Nevertheless, the OpenGL instruments in Xcode will complain that I draw too many small meshes and that I should consolidate my geometry into less objects. And in the Unity world everyone talks about limiting the number of draw calls as if they were the plague.
So, how should I go around batching very many sprites, each with a different transform and opacity value (but the same texture), into one draw call? One thing that comes to mind is to modify the vertex data every frame, and stream it: applying the modelview matrix of each sprite to all its vertices, assembling the transformed vertices for all sprites into one mesh, and submitting it to the GPU. This approach does not solve the problem of varying opacity between sprites.
Another idea that comes to mind is to have all the textured vertices of all the sprites assembled into a single mesh (VBO), treated as 'static' (same vertex format I am using now), and a separate array with the stuff that changes per sprite every frame (transform matrix and opacity), and only stream that data each frame, and pull it/apply it on the vertex shader side. That is, have a separate array where the 'attribute' being represented is the modelview matrix/alpha for the corresponding vertices. Still have to figure out the exact implementation in terms of data format/strides etc. In any case, there is the additional complication that arises whenever a new sprite is created/destroyed, the whole mesh has to be modified...
Or perhaps there is an ideal, 'textbook' solution to this problem out there that I haven't figured out? What does cocos2d do?
When I initially started reading you post I though that each quad used a different texture (since you stated "Before drawing each sprite, I'll bind its texture") but then you said that each sprite has "the same texture".
A possible easy win is to control the way you bind your textures during the draw since each call is a burden for the OpenGL driver. If (and I am not really sure abut this from your post) you use different textures, I suggest to go for a simple texture atlas where all the sprites are inside a single picture (preferably a power of 2 texture with mipmapping) and then you take the piece of the texture you need in the fragment using texture coordinates (this is the reason they exist in the end)
If the position of the sprites change over time (of course it does) at each frame, a possible advantage would be to pack the new vertex coordinates of your sprites at each frame and draw directly from memory (possibly over VAO. VBO could cost more since you need to build it each frame? to be tested in real scenario). This would be a good call pack operation and I am pretty sure it will bust the performances.
Consider that the VAO option could be feasible since we are talking about very small amount of data and the memory bandwidth should not represent a real bottleneck (each quad I guess uses 12 floats for vertex coordinates, 8 for textures and 12 for normals, 128 byte?), it shouldn't be a big problem over VAO.
About opacity, can't you play using an uniform to your fragment shader where you play with alpha? Am I wrong with it? It should work.
I hope this helps.
Ciao,
Maurizio

Efficient pixel shader when only u-axis varies?

I'm writing a pixel shader that has the property where for a given quad the values returned only vary by u-axis value. I.e. for a fixed u, then the color output is constant as v varies.
The computation to calculate the color at a pixel is relatively expensive - i.e. does multiple samples per pixel / loops etc..
Is there a way to take advantage of the v-invariance property? If I was doing this on a CPU then you'd obviously just cache the values once calculated but guess that doesn't apply because of parallelism. It might be possible for me to move the texture generation to the CPU side and have the shader access a Texture1D but I'm not sure how fast that will be.
Is there a paradigm that fits this situation on GPU cards?
cheers
Storing your data in a 1D texture and sampling it in your pixel shader looks like a good solution. Your GPU will be able to use texture caching features, allowing it to make use of the fact that many of you pixels are using the same value from your 1D texture. This should be really fast, texture fetching and caching is one of the main reasons your gpu is so efficient at rendering.
It is commonpractice to make a trade-off between calculating the value in the pixel shader, or using a lookup table texture. You are doing complex calculations by the sound of it, so using a lookup texture with certainly improve performance.
Note that you could still generate this texture by the GPU, there is no need to move it to the CPU. Just render to this 1D texture using your existing shader code as a prepass.

OpenGL batching and instance uniqueness

I've been working on improving my OpenGL ES 2.0 render performance by introducing batching; specifically one creates a RenderBatch, specifying a texture and a shader (for now) upon creation. This sets the state into a VAO to allow for inexpensive state switching. I started the implementation looking something like this:
batch = RenderBatch.new "SpriteSheet" "FlatShader"
batch.begin GL_TRIANGLE_STRIP
batch.addGeometry Geometry.newFromFile "Billboard"
batch.end
batch.render renderEngine
But then it hit me: my Billboard file has vertices that are meant to be scaled and translated for specific instance usage. So I added a transform argument to the addGeometry call.
batch.addGeometry(Geometry.newFromFile("Billboard"), myObject.transform)
This solves the problem of scaling, translating, and rotating the vertices, but it does so by first looking up the vertex information, transforming it by the transform matrix, and then inserts it into the batch data. While this works it seems inefficient; it is CPU intensive and doesn't take advantage of the GPU's transformation power. However, it works, so not that big of a deal. (Would be nice to have a better way to do this though)
However, I've run into a roadblock: texture coordinates may need to be different for each instance as well, and that means I would have to pass in a texture transformation matrix, and now this is feeling hacky.
Is there an easier way to handle this kind of transformation to existing data using shaders that does not limit the geometry/models given and is easily extensible to use normal maps, UV maps, and other fancy tricks? Thanks!
It seems to me that what you are talking about are shader uniforms. Normally you would set up the vertex data and attributes for each batch in a VBO and a VAO. Then, in your render method, you switch to the correct VAO and set up the shader uniforms. These normally include a model-view-projection matrix to transform vertices into clip space, which necessarily would change nearly every frame, the correct texture to use, etc.
This is efficient because the unchanging vertex data is held in GPU memory, the VAO takes care of cheap state switching, and only the uniforms, which generally change often, are sent to the GPU each render call.
If you are batching multiple objects that require separate model view projection matrices, then you have a few options:
you have to perform a separate draw call for each batch that requires a separate model view projection matrix
use an array of model view projection matrices as a uniform and have an attribute for each object that provides the correct projection matrix index to use
you have to transform the vertices using the CPU and refill the VBO with the updated data
The first method is the preferred solution, it will be efficient and simple. The slow part of rendering lots of draw calls is generally getting the data from the CPU to the GPU, if you already have the vertex data in VBOs then the overhead of a draw call per object is not going to be a big deal. This also solves the problem of how to provide different uniforms per object based on object properties. In each objects render method, the relevant properties are set up as uniforms before the draw call is made. If each object requires different data sent to the GPU, then how else could this work?
This is a trade-off situation. Costs of state changes due to insufficient batching compared to costs of transformation on the CPU. There is no single best solution, but it depends on how much of your scene is static, how much is dynamic and how it is laid out.
A common solution is to put static objects, whose transformation relative to each other never changes into a single VBO, or few VBOs (if they use different textures, vertex formats, etc), completely transformed. This is done once before rendering. Not each frame. Dynamic objects (players, monster, whatever) are then rendered individually, with transformation done in the vertex shader.
You can still optimize for state changes by roughly ordering the drawing of the individual objects by textures and programs.

Optimise OpenGL ES 2.0 2D drawing using dirty rectangles

Is it possible to optimise OpenGL ES 2.0 drawing by using dirty rectangles?
In my case, I have a 2D app that needs to draw a background texture (full screen on iPad), followed by the contents of several VBOs on each frame. The problem is that these VBOs can potentially contain millions of vertices, taking anywhere up to a couple of seconds to draw everything to the display. However, only a small fraction of the display would actually be updated each frame.
Is this optimisation possible, and how (or perhaps more appropriately, where) would this be implemented? Would some kind of clipping plane need to be passed into the vertex shader?
If you set an area with glViewport, clipping is adjusted accordingly. This however happens after the vertex shader stage, just before rasterization. As the GL cannot know the result of your own vertex program, it cannot sort out any vertex before applying the vertex program. After that, it does. How efficent it does depents on the actual GPU.
Thus you have to sort and split your objects to smaller (eg. rectangulary bounded) tiles and test them against the field of view by yourself for full performance.

Resources