I have read an article that guides you through writing your own SpriteBatch and I've noticed that all the vertices and indices are destroyed and recreated each frame. Isn't that wasteful? Wouldn't it be better if all of the data was permanently stored in the SpriteBatch and a way to manipulate them was added? Could someone please tell me the cons of doing it like that?
Thanks for any opinions
If you have a game that literally always has the same triangles (or the like) rendered on every frame, and just needs to tweak the coordinates, then it may be worthwhile to customize your own SpriteBatch to recognize that. If you sometimes show/hide stuff, it might get messy and complicated. Only way to find out the possible benefit is to profile your code and find where the meaningful bottlenecks are.
Since SpriteBatch deals in structs and pre-allocated lists and arrays, there shouldn't be any wastefulness of memory allocation, so it just becomes a question of whether you can avoid re-assigning all those values, and (if this applies to you) avoid performing the z-order sort.
I myself use a customized SpriteBatch, and based on profiling, I can see that the z-order sort step is the primary bottleneck in my SpriteBatch drawing call stack. However, my frame rates are already OK enough on my wimpiest target platforms. So at least at this point, I have no reason to attempt further optimization there. Furthermore, my app has a bigger need for optimization of my physics and segregating my drawables by their orthogonal updating characteristics to avoid unnecessary update work.
Your results may vary. Good luck!
Related
I have this huge model(helix) created with 2 million vertices at once and some million more indices for which vertices to use.
I am pretty sure this is a very bad way to draw so many vertices.
I need some hints to where I should start to optimize this?
I thought about copying 1 round of my helix (vertices) and moving the z of that. But in the end, I would be drawing a lot of triangles at once again...
How naive are you currently being? As per rickster's comment, there's a serious case of potential premature optimisation here: the correct way to optimise is to find the actual bottlenecks and to widen those.
Knee-jerk thoughts:
Minimise memory bandwidth. Pack your vertices into the smallest space they can fit into (i.e. limit precision where it is acceptable to do so) and make sure all the attributes that describe a single vertex are contiguously stored (i.e. the individual arrays themselves will be interleaved).
Consider breaking your model up to achieve that aim. Instanced drawing as rickster suggests is a good idea if it's sufficiently repetitive. You might also consider what you can do with 65536-vertex segments, since that'll cut your index size.
Use triangle strips if it allows you to specify the geometry in substantially fewer indices, even if you have to add degenerate triangles.
Consider where the camera will be. Do you really need that level of detail all the way around? Will the whole thing even ever be on screen? If not then consider level-of-detail solutions and subdivision for culling (both outside the viewport and within via the occlusion query).
Given one texture sheet is it better to have one or multiple CCSpriteBatchNodes? Or does this not affect at all the GPU computational cost in processing the non visible CCSprite quads?
I am thinking about performance and referring to this question and answer I got. Basically it suggests that I should use more than one CCSpriteBatchNode even if I have only one file. I don't understand if the sentence "Too many batched sprites still affects performance negatively even if they are not visible or outside the screen" is applicable also having two CCSpriteBatchNode instead of one. In other words, does the sentence refer to this "The GPU is responsible for cancelling draws of quads that are not visible due to being entirely outside the screen. It still needs to process those quads."? And if so it should meant that it doesn't really matter how may CCSpriteBatchNode instances I have using the same texture sheet, right?
How can I optimize this? I mean, how can I avoid the GPU having to process the non visible quads?
Would you be able to answer to at least the questions in bold?
First case: Too many nodes (or sprites) in the scene and many of them are out of screen/visible area. In this case for each sprite, GPU has to check if its outside the visible area or not. Too many sprite-nodes means too much load on GPU.
Adding more CCSpriteBatchNode should not effect the performance. Because the sprite-sheet bitmap is loaded to the GPU memory, and an array of coordinates is kept by the application for drawing individual sprites. So if you put 2 images in 2 different CCSpriteBatchNodes or 2 images in 1, it will be same for both CPU and GPU.
How to optimize?
The best way would be to remove the invisible nodes/sprites from the parent. But it depends on your application.
FPS drops certainly because of two reasons:
fillrate - when a lot of sprites overlap each others (and additionally if we render high-res texture into small sprite)
redundant state changes - in this case the heaviest are shader and texture switches
You can render sprites outside of screen in single batch and this doesn't drop performance singnificantly. Pay attention that rendering sprite with zero opacity (or transparent texture) takes the same time as non-transparent sprite.
First of all, this really sounds like a case of premature optimization. Do a test with the number of sprites you expect to be on screen, and some added, others removed. Do you get 60 fps on the oldest supported device? If yes, good, no need to optimize. If no, tweak the code design to see what actually makes a difference.
I mean, how can I avoid the GPU having to process the non visible quads?
You can't, unless you're going to rewrite how cocos2d handles drawing of sprites/batched sprites.
it doesn't really matter how may CCSpriteBatchNode instances I have using the same texture sheet, right?
Each additional sprite batch node adds a draw call. Given how many sprites they can batch into a single draw call, the benefit far outweighs the drawbacks. Whether you have one, two or three sprite batch nodes makes absolutely no difference.
In 3d terrain that consists of thousands of cubes (i.e. Minecraft ), what is a way to handle each block in terms of location and rendering? More specifically, I know that drawing a primitive of a cube and world transforming it everywhere in directX 9 is probably a ridiculous way to accomplish this since there are so many performance issues, so I was wondering what a more reasonable method would be.
Should each cube be a mesh that's copied many times, or is their a way to create the appropriate meshes from the data in your vertex buffer?
I found this article that walks through some of the theory behind implementing what I want to implement, but I've never used octrees before so I wasn't able to take too much from the source code. If octrees are indeed the way to go, where is a good starting point to learn about them? Most of my google searches only turned up blog posts about theory with little or no implementation examples.
It seems like using voxels would be useful in doing this, but like with octrees, I'm coming from no experience here, so I don't really know what to study first.
Anyway, thanks for any advice\resources\book names you can spare. I'm sure it's obvious, but I'm still very new to 3d programming, so I appreciate your help.
First off if you're using Minecraft as your reference, think of their use of chunks and relate it to Oct-trees. Minecraft divides up their world into smaller chunks to handle the massive amount information that is needed to be stored so use Oct-trees to organize this data that will be stored. Goz has a very accurate description of how Oct-trees and Quad-trees work, so use his information as a reference.
Another thing to consider is that you don't actually want to draw every cube to the screen as this will eat up your framerate. Use Object Culling to only draw visible cubes to the screen. Again if you think Minecraft; have you ever encountered a glitch where you can see through the blocks and under the world? This is because Minecraft only draws the top layer of blocks. With this many objects on screen, it would be a worthwhile investment to look into Object Culling using both the camera frustum and occlusion query.
For information on using DirectX I would recommend any book by Frank Luna. I own this book myself and it never leaves my side when programming in DirectX. http://www.amazon.com/Introduction-Game-Programming-Direct-9-0c/dp/1598220160/ref=sr_1_3?ie=UTF8&qid=1332478780&sr=8-3
I highly recommend this book as I've learned almost everything I know about DirectX from it.
Upon a Google search I found this link that discusses Occlusion Culling, because Luna doesn't cover occlusion culling, only frustum culling. I hear the Programming Gems series mentioned a lot, but I can't attest to its name personally. http://http.developer.nvidia.com/GPUGems/gpugems_ch29.html
Hope this helps.
Oct-trees are fairly simple, especially axis aligned ones like those in mine craft.
It is basically just a 3D extension of the quad-tree. You may find it easier to learn about Quad-trees first.
To give you a quick overview of a quad-tree; basically you start off with a square. Now imagine placing a much smaller square in that square. If you wish to build a quad tree representing it you first divide the original square into 4 equal sized squares.
Next you check each quadrant and if the smaller square is in that quadrant you split that quadrant into 4 smaller sized squares. Then you check those 4 quadrants choose the quadrant and subdivide. Eventually your smaller square will be wholly contained in one or more quadrants inside quadrants inside quadrants (etc). You have now built your quad tree.
Now if you imagine you are searching for a specific square inside the larger square you can quickly see the bonus of a quad-tree. Instead of searching every possible square in the quad tree (equivalent to searching every pixel in a texture) you can now check the first 4 quadrants to see if they contain it. If one does you can check its 4 sub quadrants and so on until you find the smallest quadrant wholly containing your square (or pixel). This way you end up doing many fewer tests to find your object.
Now an oct-tree is basically the same thing but instead of encoding squares in squares you now encode cubes in cubes. Every cube can be split into 8 smaller octants (and hence the name oct-tree).
Oct-trees have the advantage that by knowing which octant you are starting in you can easily cast rays through the oct-tree to find collisions (as an octant is either full, partially full or it is empty). If an octant is empty then you pass right through it and then check the octant on the other side. If it is partially full you check its sub-octants and so on until you either find a full octant (ie you've hit a solid cube and you render it) or you pass through the octant entirely and hence there is no cube to render. This is how minecraft works (I'm guessing anyway ;)). This is also a good way of quickly rendering voxel data which more people are looking into these days as a possible future rendering mechanism.
Hope thats some help! :)
Oct-trees and quad-trees are useful for culling sections of your geometry to render. Minecraft uses 16x16x16 render blocks to break up the terrain into manageable pieces.
Another technique to consider is instancing. Instancing is where you tell the GPU to render an object multiple times in different locations. It's used for crowd rendering, trees, anything where the geometry is the same, but you have lots of them.
http://msdn.microsoft.com/en-us/library/windows/desktop/bb173349(v=vs.85).aspx
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter03.html
Here is an article where the writer duplicates the minecraft renderer in OpenGL 4. While the code won't apply to your case the techniques (culling cubes that are surrounded, etc) can be applied to a directx renderer.
http://codeflow.org/entries/2010/dec/09/minecraft-like-rendering-experiments-in-opengl-4/
Don't be fooled by the blocky graphics and the low quality textures. Minecraft is an extremely complex renderer and you'll need to come up with ways to handle the sheer number of items involved. For example even a "small" part of the world, say 100x100x100 blocks is 1 million blocks. To push each block to the GPU as a separate mesh would kill your GPU. The Minecraft renderer is far more complex than most first person shooters when you get down to the technology.
...or am I insane to even try?
As a novice to using bare vertices for 3d graphics, I haven't ever worked with vertex buffers and the like before. I am guessing that I should use a dynamic buffer because my game deals with manipulating, adding and deleting primitives. But how would I go about doing that?
So far I have stored my indices in a Triangle.cs class. Triangles are stored in Quads (which contain the vertices that correspond to their indices), quads are stored in blocks. In my draw method, I iterate through each block, each quad in each block, and finally each triangle, apply the appropriate texture to my effect, then call DrawUserIndexedPrimitives to draw the vertices stored in the triangle.
I'd like to use a vertex buffer because this method cannot support the scale I am going for. I am assuming it to be dynamic. Since my vertices and indices are stored in a collection of separate classes, though, can I still effectively use a buffer? Is using separate buffers for each quad silly (I'm guessing it is)? Is it feasible and effective for me to dump vertices into the buffer the first time a quad is drawn and then store where those vertices were so that I can apply that offset to that triangle's indices for successive draws? Is there a feasible way to handle removing vertices from the buffer in this scenario (perhaps event-based shifting of index offsets in triangles)?
I apologize that these questions may be either far too novicely or too confusing/vague. I'd be happy to provide clarification. But as I've said, I'm new to this and I may not even know what I'm talking about...
I can't exactly tell what you're trying to do, but using a seperate buffer for every quad is very silly.
The golden rule in graphics programming is batch, batch, batch. This means to pack as much stuff into a single DrawUserIndexedPrimitives call as possible, your graphics card will love you for it.
In your case, put all of your verticies and indicies into one vertex buffer and index buffer (you might need to use more, I have no idea how many verticies we're talking about). Whenever the user changes one of the primatives, regenerate the entire buffer. If you really have a lot of primatives, split them up into multiple buffers and on only regenerate the ones you need when the user changes something.
The most important thing is to minimize the amount of 'DrawUserIndexedPrimitives' calls, those things have a lot of overhead, you could easily make your game on the order of 20x faster.
Graphics cards are pipelines, they like being given a big chunk of data for them to eat away at. What you're doing by giving it one triangle at a time is like forcing a large-scale car factory to only make one car at a time. Where they can't start on building the next car before the last one is finished.
Anyway good luck, and feel free to ask any questions.
I have been doing some experimenting with iOS drawing. To do a practical exercise I wrote a BarChart component. The following is the class diagram (well, I wasnt allowed to upload images) so let me write it in words. I have a NGBarChartView which inherits from UIView has 2 protocols NGBarChartViewDataSource and NGBarChartViewDelegate. And the code is at https://github.com/mraghuram/NELGPieChart/blob/master/NELGPieChart/NGBarChartView.m
To draw the barChart, I have created each barChart as a different CAShapeLayer. The reason I did this is two fold, first I could just create a UIBezierPath and attach that to a CAShapeLayer object and two, I can easily track if a barItem is touched or not by using [Layer hitTest] method. The component works pretty well. However, I am not comfortable with the approach I have taken to draw the barCharts. Hence this note. I need expert opinion on the following
By using the CAShapeLayer and creating BarItems I am really not
using the UIGraphicsContext, is this a good design?
My approach will create several CALayers inside a UIView. Is there a
limit, based on performance, to the number of CALayers you can
create in a UIView.
If a good alternative is to use CGContext* methods then, whats the
right way to identify if a particular path has been touched
From an Animation point of view, such as the Bar blinking when you
tap on it, is the Layer design better or the CGContext design
better.
Help is very much appreciated. BTW, you are free to look at my code and comment. I will gladly accept any suggestions to improve.
Best,
Murali
IMO, generally, any kind of drawing shapes needs heavy processing power. And compositing cached bitmap with GPU is very cheap than drawing all of them again. So in many cases, we caches all drawings into a bitmap, and in iOS, CALayer is in charge of that.
Anyway, if your bitmaps exceed video memory limit, Quartz cannot composite all layers at once. Consequently, Quartz have to draw single frame over multiple phases. And this needs reloading some textures into GPU. This can impact on performance. I am not sure on this because iPhone VRAM is known to be integrated with system RAM. Anyway it's still true that it needs more work on even that case. If even system memory becomes insufficient, system can purge existing bitmap and ask to redraw them later.
CAShapeLayer will do all of CGContext(I believe you meant this) works instead of you. You can do that yourself if you felt needs of more lower level optimization.
Yes. Obviously, everything has limit by performance view. If you're using hundreds of layers with large alpha-blended graphics, it'll cause performance problem. Anyway, generally, it doesn't happen because layer composition is accelerated by GPU. If your graph lines are not so many, and they're basically opaque, you'll be fine.
All you have to know is once graphics drawings are composited, there is no way to decompose them back. Because composition itself is a sort of optimization by lossy compression, So you have only two options (1) redraw all graphics when mutation is required. Or (2) Make cached bitmap of each display element (like graph line) and just composite as your needs. This is just what the CALayers are doing.
Absolutely layer-based approach is far better. Any kind of free shape drawing (even it's done within GPU) needs a lot more processing power than simple bitmap composition (which will become two textured triangles) by GPU. Of course, unless your layers doesn't exceeds video memory limit.
I hope this helps.