Optimize OpenGL ES 2.0 drawing iOS - ios

I have this huge model(helix) created with 2 million vertices at once and some million more indices for which vertices to use.
I am pretty sure this is a very bad way to draw so many vertices.
I need some hints to where I should start to optimize this?
I thought about copying 1 round of my helix (vertices) and moving the z of that. But in the end, I would be drawing a lot of triangles at once again...

How naive are you currently being? As per rickster's comment, there's a serious case of potential premature optimisation here: the correct way to optimise is to find the actual bottlenecks and to widen those.
Knee-jerk thoughts:
Minimise memory bandwidth. Pack your vertices into the smallest space they can fit into (i.e. limit precision where it is acceptable to do so) and make sure all the attributes that describe a single vertex are contiguously stored (i.e. the individual arrays themselves will be interleaved).
Consider breaking your model up to achieve that aim. Instanced drawing as rickster suggests is a good idea if it's sufficiently repetitive. You might also consider what you can do with 65536-vertex segments, since that'll cut your index size.
Use triangle strips if it allows you to specify the geometry in substantially fewer indices, even if you have to add degenerate triangles.
Consider where the camera will be. Do you really need that level of detail all the way around? Will the whole thing even ever be on screen? If not then consider level-of-detail solutions and subdivision for culling (both outside the viewport and within via the occlusion query).

Related

Why using triangulation for flat terrain?

I've seen many terrains in wire mode and all of them used triangles. I get it if you use it for different heights BUT why do people use so many triangles for flat areas in their terrain? If there is a large flat area wouldn't it be wise to create one big square or at least one big triangle (as big as possible) instead of using so many small ones?
So my question is, is there a reason to do this (maybe for textures)? I know tesselation does something like this but still leaves too many triangles from my point of view.
Possible reasons:
They don't have terrain optimization routines.
They use vertex lighting. Unless terrain is densely triangulated, it'll look horrible.
Shader does not work well with huge triangles. Interpolating huge values (like light dir etc) across a triangle might cause precision problems.
Physics engine does not work with huge triangles.
Huge triangles cause artifacts (I think there's a hardware-dependent limit on number of texture repeats).
Multiple materials (more than 8) across terrain. That'll go above multitexturing limits on certain cards, so it'll be necessary to split terrain.
Multiple different materials or regions across terrain, streamed zones. Different materials might require different texture coordinates, etc, and if there's some kind of 2nd set of coordinates on top of the terrain (optimized unwrapped lightmap), you won't be able to use one big flat triangle.
Per-pixel lighting with multiple sources. If you have several light sources, and want to use them all at once, you might have to use additive alpha-blending. With triangulated terrain you can pick out a small region that is affected by this particular resource, and redraw it with added specular from that lightsource. If you simply cut big triangle with clip-planes, you'll see z-fighting. If you don't select small region of terrain light affects, you'll have to redraw entire terrain, which is very likely to cause performance drop (fillrate/shader performance because many pixels are redrawn).

Which is faster: creating a detailed mesh before execution or tessellating?

For simplicity of the problem let's consider spheres. Let's say I have a sphere, and before execution I know the radius, the position and the triangle count. Let's also say the triangle count is sufficiently large (e.g. ~50k triangles).
Would it be faster generally to create this sphere mesh before hand and stream all 50k triangles to the graphics card, or would it be faster to send a single point (representing the centre of the sphere) and use tessellation and geometry shaders to build the sphere on the GPU?
Would it still be faster if I had 100 of these spheres in different positions? Can I use hull/geometry shaders to create something which I can then combine with instancing?
Tessellation is certainly valuable. Especially when combined with displacement from a heightmap. The isolated environment described in your question is bound not to fully answer your question.
Before using tessellation you would need to know that you will become CPU poly/triangle bound and therefore need to start utilizing the GPU to help you increase the overall triangles of your game/scene. Calculations are very fast on the GPU so yes using tessellation multiple subdivision levels is advisable if you are going to do it...though sometimes I've been happy with just subdividing 3-4 times from a 200 tri plane.
Mainly tessellation is used for environmental/static mesh scene objects so that you can spend your tri's on characters and other moving/animated models without becoming CPU bound.
Checkout engines like Unity3D and CryEngine for tessellation examples to help the learning curve.
I just so happen to be working with this at the same time.
In terms of FPS, the pre-computed method would be faster in this situation since you can
dump one giant 50K triangle sphere payload (like any other model) and
draw it in multiple places from there.
The tessellation method would be slower since all the triangles would
be generated from a formula, multiple times per frame.

Handling blocks in Minecraft-style terrain (d3d/c++)

In 3d terrain that consists of thousands of cubes (i.e. Minecraft ), what is a way to handle each block in terms of location and rendering? More specifically, I know that drawing a primitive of a cube and world transforming it everywhere in directX 9 is probably a ridiculous way to accomplish this since there are so many performance issues, so I was wondering what a more reasonable method would be.
Should each cube be a mesh that's copied many times, or is their a way to create the appropriate meshes from the data in your vertex buffer?
I found this article that walks through some of the theory behind implementing what I want to implement, but I've never used octrees before so I wasn't able to take too much from the source code. If octrees are indeed the way to go, where is a good starting point to learn about them? Most of my google searches only turned up blog posts about theory with little or no implementation examples.
It seems like using voxels would be useful in doing this, but like with octrees, I'm coming from no experience here, so I don't really know what to study first.
Anyway, thanks for any advice\resources\book names you can spare. I'm sure it's obvious, but I'm still very new to 3d programming, so I appreciate your help.
First off if you're using Minecraft as your reference, think of their use of chunks and relate it to Oct-trees. Minecraft divides up their world into smaller chunks to handle the massive amount information that is needed to be stored so use Oct-trees to organize this data that will be stored. Goz has a very accurate description of how Oct-trees and Quad-trees work, so use his information as a reference.
Another thing to consider is that you don't actually want to draw every cube to the screen as this will eat up your framerate. Use Object Culling to only draw visible cubes to the screen. Again if you think Minecraft; have you ever encountered a glitch where you can see through the blocks and under the world? This is because Minecraft only draws the top layer of blocks. With this many objects on screen, it would be a worthwhile investment to look into Object Culling using both the camera frustum and occlusion query.
For information on using DirectX I would recommend any book by Frank Luna. I own this book myself and it never leaves my side when programming in DirectX. http://www.amazon.com/Introduction-Game-Programming-Direct-9-0c/dp/1598220160/ref=sr_1_3?ie=UTF8&qid=1332478780&sr=8-3
I highly recommend this book as I've learned almost everything I know about DirectX from it.
Upon a Google search I found this link that discusses Occlusion Culling, because Luna doesn't cover occlusion culling, only frustum culling. I hear the Programming Gems series mentioned a lot, but I can't attest to its name personally. http://http.developer.nvidia.com/GPUGems/gpugems_ch29.html
Hope this helps.
Oct-trees are fairly simple, especially axis aligned ones like those in mine craft.
It is basically just a 3D extension of the quad-tree. You may find it easier to learn about Quad-trees first.
To give you a quick overview of a quad-tree; basically you start off with a square. Now imagine placing a much smaller square in that square. If you wish to build a quad tree representing it you first divide the original square into 4 equal sized squares.
Next you check each quadrant and if the smaller square is in that quadrant you split that quadrant into 4 smaller sized squares. Then you check those 4 quadrants choose the quadrant and subdivide. Eventually your smaller square will be wholly contained in one or more quadrants inside quadrants inside quadrants (etc). You have now built your quad tree.
Now if you imagine you are searching for a specific square inside the larger square you can quickly see the bonus of a quad-tree. Instead of searching every possible square in the quad tree (equivalent to searching every pixel in a texture) you can now check the first 4 quadrants to see if they contain it. If one does you can check its 4 sub quadrants and so on until you find the smallest quadrant wholly containing your square (or pixel). This way you end up doing many fewer tests to find your object.
Now an oct-tree is basically the same thing but instead of encoding squares in squares you now encode cubes in cubes. Every cube can be split into 8 smaller octants (and hence the name oct-tree).
Oct-trees have the advantage that by knowing which octant you are starting in you can easily cast rays through the oct-tree to find collisions (as an octant is either full, partially full or it is empty). If an octant is empty then you pass right through it and then check the octant on the other side. If it is partially full you check its sub-octants and so on until you either find a full octant (ie you've hit a solid cube and you render it) or you pass through the octant entirely and hence there is no cube to render. This is how minecraft works (I'm guessing anyway ;)). This is also a good way of quickly rendering voxel data which more people are looking into these days as a possible future rendering mechanism.
Hope thats some help! :)
Oct-trees and quad-trees are useful for culling sections of your geometry to render. Minecraft uses 16x16x16 render blocks to break up the terrain into manageable pieces.
Another technique to consider is instancing. Instancing is where you tell the GPU to render an object multiple times in different locations. It's used for crowd rendering, trees, anything where the geometry is the same, but you have lots of them.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb173349(v=vs.85).aspx
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter03.html
Here is an article where the writer duplicates the minecraft renderer in OpenGL 4. While the code won't apply to your case the techniques (culling cubes that are surrounded, etc) can be applied to a directx renderer.
http://codeflow.org/entries/2010/dec/09/minecraft-like-rendering-experiments-in-opengl-4/
Don't be fooled by the blocky graphics and the low quality textures. Minecraft is an extremely complex renderer and you'll need to come up with ways to handle the sheer number of items involved. For example even a "small" part of the world, say 100x100x100 blocks is 1 million blocks. To push each block to the GPU as a separate mesh would kill your GPU. The Minecraft renderer is far more complex than most first person shooters when you get down to the technology.

How should I optimize drawing a large, dynamic number of collections of vertices?

...or am I insane to even try?
As a novice to using bare vertices for 3d graphics, I haven't ever worked with vertex buffers and the like before. I am guessing that I should use a dynamic buffer because my game deals with manipulating, adding and deleting primitives. But how would I go about doing that?
So far I have stored my indices in a Triangle.cs class. Triangles are stored in Quads (which contain the vertices that correspond to their indices), quads are stored in blocks. In my draw method, I iterate through each block, each quad in each block, and finally each triangle, apply the appropriate texture to my effect, then call DrawUserIndexedPrimitives to draw the vertices stored in the triangle.
I'd like to use a vertex buffer because this method cannot support the scale I am going for. I am assuming it to be dynamic. Since my vertices and indices are stored in a collection of separate classes, though, can I still effectively use a buffer? Is using separate buffers for each quad silly (I'm guessing it is)? Is it feasible and effective for me to dump vertices into the buffer the first time a quad is drawn and then store where those vertices were so that I can apply that offset to that triangle's indices for successive draws? Is there a feasible way to handle removing vertices from the buffer in this scenario (perhaps event-based shifting of index offsets in triangles)?
I apologize that these questions may be either far too novicely or too confusing/vague. I'd be happy to provide clarification. But as I've said, I'm new to this and I may not even know what I'm talking about...
I can't exactly tell what you're trying to do, but using a seperate buffer for every quad is very silly.
The golden rule in graphics programming is batch, batch, batch. This means to pack as much stuff into a single DrawUserIndexedPrimitives call as possible, your graphics card will love you for it.
In your case, put all of your verticies and indicies into one vertex buffer and index buffer (you might need to use more, I have no idea how many verticies we're talking about). Whenever the user changes one of the primatives, regenerate the entire buffer. If you really have a lot of primatives, split them up into multiple buffers and on only regenerate the ones you need when the user changes something.
The most important thing is to minimize the amount of 'DrawUserIndexedPrimitives' calls, those things have a lot of overhead, you could easily make your game on the order of 20x faster.
Graphics cards are pipelines, they like being given a big chunk of data for them to eat away at. What you're doing by giving it one triangle at a time is like forcing a large-scale car factory to only make one car at a time. Where they can't start on building the next car before the last one is finished.
Anyway good luck, and feel free to ask any questions.

DirectX world view matrix multiplications - GPU or CPU the place

I am new to directx, but have been surprised that most examples I have seen the world matrix and view matrix are multiplied as part of the vertex shader, rather than being multiplied by the CPU and the result being passed to the shader.
For rigid objects this means you multiply the same two matrices once for every single vertex of the object. I know that the GPU can do this in parallel over a number of vertices (don't really have an idea how many), but isn't this really inefficient, or am I just missing something? I am still new and clueless.
In general, you want to do it on the CPU. However, DirectX 9 has the concept of "preshaders", which means that this multiplication will be done on the CPU up-front. This has been removed for newer APIs, but it might be very well relevant for the examples you're looking at.
Moreover, modern GPUs are extremely fast when it comes to ALU operations compared to memory access. Having a modestly complex vertex shader (with a texture fetch maybe) means that the math required to do the matrix multiplication comes for free, so the authors might have not even bothered.
Anyway, the best practice is to pre-multiply everything constant on the CPU. Same applies for moving work from the pixel shaders into the vertex shaders (if something is constant across a triangle, don't compute it per-pixel.)
Well, that doesn't sound clueless to me at all, you are absolutely right!
I don't know exactly what examples you have been looking at, but in general you'd pass precalculated matrices as much as possible, that is what semantics like WORLDVIEW (and even more appropriate for simple shaders, WORLDVIEWPROJECTION) are for.
Exceptions could be cases where the shader code needs access to the separate matrices as well (but even then I'd usually pass the combined matrices as well)... or perhaps those examples where all about illustrating matrix multiplication. :-)

Resources