When we draw 300 sprites on the iPad using opengl 2.0 with glEnable(GL_BLEND) (which we need because the sprites need tranparency and alpha blending) we get an framerate of around 40. But when we disable the blending we get an framerate of 60.
Now is (alpha) blending really that costly or are we doing something wrong?
Thanks for your time,
Richard.
Alpha blending really IS that costly. The problem is that you can solve a lot of overdraw issues (something the PowerVR is very good at) by using Z-Buffering. It can save a tonne of memory bandwidth by not writing to the Z-buffer and draw buffer.
The moment you start alpha blending then you instantly need to read from the frame buffer blend and then write back (Read-Modify-Write or RMW). Effectively if you have 10 overlapping sprites then they need to be drawn 10 times where as with a Z-Buffered (again ish .. the PowerVR is odd on that front) un-blended system you only actually need to draw ONE pixel. So in that edge case you have saved a tenth of the WRITE bandwidth by simply not alpha blending. And don't forget there is also the read from the frame buffer that needs to take place for alpha blending on top of that.
As I alluded to it does get more complicated when Z-buffering is involved. Mainly because a Z-buffer requires a read, a compare and potentially a write but using Z-buffering you can discard pixels much earlier in the pipeline meaning you can save even more processing time. Couple that with the tiling system the PowerVR still, I believe, uses and your main concern is that RMW bandwidth loss to alpha blending.
It is really that costly. It requires a read-modify-write for each pixel. the modify operation is of the form
fragment * alpha + previous * (1 - alpha)
Count the number of pixels you do that on, and you'll soon realize you do a lot of math (and require more memory bandwidth). It all depends on how big your sprites are, but it is not surprising to slow down heavily when you have a lot of overdraw.
Related
I have a 2D game in 4 directions, and I'm having problems with FPS (or GPU) because I have to draw a lot of textures.
I've read a lot about techniques to optimize performance, but I don't know what I can do anymore.
The main problem is that in some occasions I have about 200 creatures, where I have to draw his body (it is a single sprite) but also draw spells and other animations on his body. So, I think that is when it starts to give conflicts because the loop where I draw each creature, must change the textures for each creature, that is body>animation1>animation2>animation3 and this about 200 (creatures) times at 60 fps. Which lowers the fps to about 40-50.
Any suggestions?
This is how it looks:
The issue is probably - as you have already suggested - the constant switching between different textures. This is much slower than drawing the same number of sprites with the same texture.
To change this, consider putting all your textures into a single big texture. You then always draw that texture. This obviously would look quite wrong, so you also have to tell XNA which part of the texture you want to draw. For that, you can use the SourceRectangle parameter that can be passed to SpriteBatch.Draw(...). That way, you can always render the same texture but can still have different images on screen.
See also this answer about texture atlasses for more details.
I've tried to make an "overlay" effect in a 3d scene. After drawing stuff to the buffer, i tried to draw a full screen quad with blending enabled and the depth test disabled. On some android devices this seems to have caused a slow down.
I found this link:
The particularly slow point is the point where the drawing of a pixel needs to check what the color behind it was.
So instead of drawing a single full screen quad, i divided it up in tiles, and rendered with multiple draw calls, which seems to have caused some gain.
What may be happening here and how can this be profiled with webgl i.e. how does one come to the conclusion from the quote above?
I guess that to profile it, you simply have to test with several blending function, with or without blending enabled, etc...
Blending is not a trivial operation, and indeed we can assume that blending function which need to read pixel on buffer could induce performance lose, like all "reading" operation in OpenGL, because this can block the pipeline. I guess most of modern desktop GPU have some specific design to optimize this, but on mobile phones, this is maybe more problematic.
Anyway, if you are about to draw a full screen quad, why don't you render your quad directly using two source texture, which you blend directly in the fragment shader using a custom equation ? this way, you don't need to use blending and you avoid any back buffer reading problem.
I'm trying to find the most efficient way of handling multi-texturing in OpenGL ES2 on iOS. By 'efficient' I mean the fastest rendering even on older iOS devices (iPhone 4 and up) - but also balancing convenience.
I've considered (and tried) several different methods. But have run into a couple of problems and questions.
Method 1 - My base and normal values are rgb with NO ALPHA. For these objects I don't need transparency. My emission and specular information are each only one channel. To reduce texture2D() calls I figured I could store the emission as the alpha channel of the base, and the specular as the alpha of the normal. With each being in their own file it would look like this:
My problem so far has been finding a file format that will support a full non-premultiplied alpha channel. PNG just hasn't worked for me. Every way that I've tried to save this as a PNG premultiplies the .alpha with the .rgb on file save (via photoshop) basically destroying the .rgb. Any pixel with a 0.0 alpha has a black rgb when I reload the file. I posted that question here with no activity.
I know this method would yield faster renders if I could work out a way to save and load this independent 4th channel. But so far I haven't been able to and had to move on.
Method 2 - When that didn't work I moved on to a single 4-way texture where each quadrant has a different map. This doesn't reduce texture2D() calls but it reduces the number of textures that are being accessed within the shader.
The 4-way texture does require that I modify the texture coordinates within the shader. For model flexibility I leave the texcoords as is in the model's structure and modify them in the shader like so:
v_fragmentTexCoord0 = a_vertexTexCoord0 * 0.5;
v_fragmentTexCoord1 = v_fragmentTexCoord0 + vec2(0.0, 0.5); // illumination frag is up half
v_fragmentTexCoord2 = v_fragmentTexCoord0 + vec2(0.5, 0.5); // shininess frag is up and over
v_fragmentTexCoord3 = v_fragmentTexCoord0 + vec2(0.5, 0.0); // normal frag is over half
To avoid dynamic texture lookups (Thanks Brad Larson) I moved these offsets to the vertex shader and keep them out of the fragment shader.
But my question here is: Does reducing the number of texture samplers used in a shader matter? Or would I be better off using 4 different smaller textures here?
The one problem I did have with this was bleed over between the different maps. A texcoord of 1.0 was was averaging in some of the blue normal pixels due to linear texture mapping. This added a blue edge on the object near the seam. To avoid it I had to change my UV mapping to not get too close to the edge. And that's a pain to do with very many objects.
Method 3 would be to combine methods 1 and 2. and have the base.rgb + emission.a on one side and normal.rgb + specular.a on the other. But again I still have this problem getting an independent alpha to save in a file.
Maybe I could save them as two files but combine them during loading before sending it over to openGL. I'll have to try that.
Method 4 Finally, In a 3d world if I have 20 different panel textures for walls, should these be individual files or all packed in a single texture atlas? I recently noticed that at some point minecraft moved from an atlas to individual textures - albeit they are 16x16 each.
With a single model and by modifying the texture coordinates (which I'm already doing in method 2 and 3 above), you can easily send an offset to the shader to select a particular map in an atlas:
v_fragmentTexCoord0 = u_texOffset + a_vertexTexCoord0 * u_texScale;
This offers a lot of flexibility and reduces the number of texture bindings. It's basically how I'm doing it in my game now. But IS IT faster to access a small portion of a larger texture and have the above math in the vertex shader? Or is it faster to repeatedly bind smaller textures over and over? Especially if you're not sorting objects by texture.
I know this is a lot. But the main question here is what's the most efficient method considering speed + convenience? Will method 4 be faster for multiple textures or would multiple rebinds be faster? Or is there some other way that I'm overlooking. I see all these 3d games with a lot of graphics and area coverage. How do they keep frame rates up, especially on older devices like the iphone4?
**** UPDATE ****
Since I've suddenly had 2 answers in the last few days I'll say this. Basically I did find the answer. Or AN answer. The question is which method is more efficient? Meaning which method will result in the best frame rates. I've tried the various methods above and on the iPhone 5 they're all just about as fast. The iPhone5/5S has an extremely fast gpu. Where it matters is on older devices like the iPhone4/4S, or on larger devices like a retina iPad. My tests were not scientific and I don't have ms speeds to report. But 4 texture2D() calls to 4 RGBA textures was actually just as fast or maybe even faster than 4 texture2d() calls to a single texture with offsets. And of course I do those offset calculations in the vertex shader and not the fragment shader (never in the fragment shader).
So maybe someday I'll do the tests and make a grid with some numbers to report. But I don't have time to do that right now and write a proper answer myself. And I can't really checkmark any other answer that isn't answering the question cause that's not how SO works.
But thanks to the people who have answered. And check out this other question of mine that also answered some of this one: Load an RGBA image from two jpegs on iOS - OpenGL ES 2.0
Have a post process step in your content pipeline where you merge your rgb with alpha texture and store it in a. Ktx file when you package the game or as a post build event when you compile.
It's fairly trivial format and would be simple to write such command-line tool that loads 2 png's and merges these into one Ktx, rgb + alpha.
Some benefits by doing that is
- less cpu overhead when loading the file at game start up, so the games starts quicker.
- Some GPUso does not natively support rgb 24bit format, which would force the driver to internally convert it to rgba 32bit. This adds more time to the loading stage and temporary memory usage.
Now when you got the data in a texture object, you do want to minimize texture sampling as it means alot of gpu operations and memory accesses depending on filtering mode.
I would recommend to have 2 textures with 2 layers each since there's issues if you do add all of them to the same one is potential artifacts when you sample with bilinear or mipmapped as it may include neighbour pixels close to edge where one texture layer ends and the second begins, or if you decided to have mipmaps generated.
As an extra improvement I would recommend not having raw rgba 32bit data in the Ktx, but actually compressing it into a dxt or pvrtc format. This would use much less memory which means faster loading times and less memory transfers for the gpu, as memory bandwidth is limited.
Of course, adding the compressor to the post process tool is slightly more complex.
Do note that compressed textures do loose a bit of the quality depending on algorithm and implementation.
Silly question but are you sure you are sampler limited? It just seems to me that, with your "two 2-way textures" you are potentially pulling in a lot of texture data, and you might instead be bandwidth limited.
What if you were to use 3 textures [ BaseRGB, NormalRBG, and combined Emission+Specular] and use PVRTC compression? Depending on the detail, you might even be able to use 2bpp (rather than 4bpp) for the BaseRGB and/or Emission+Specular.
For the Normals I'd probably stick to 4bpp. Further, if you can afford the shader instructions, only store the R&G channels (putting 0 in the blue channel) and re-derive the blue channel with a bit of maths. This should give better quality.
I'm currently trying to reduce the memory size of my textures. I use texture packer already, as well as .pvr.cczs with either RGB565 or RGB5551. This, however, often leads to a huge, unacceptable reduction in texture quality.
Specifically, I got a spritesheet for the main character. In size it's roughly 4k*2.5k pixels. This is not really negotiable as we have lots of different animations and we need the character in a size acceptable for retina displays of ipads. So reducing the size of the character sprite would again result in huge reductions of quality when we use him in the scene.
So of course I'm trying to use 16 bit textures as often as possible. Using the above mentioned spritesheet as a 16 bit texture takes about 17 mb of memory. This is already a lot. As it's a spritesheet for a character, the texture needs transparency and therefor I need to use rgb5551 as colour depth. With only 1 bit for the alpha channel, the character just looks plain ugly. In fact, everything that needs alpha looks rather ugly with only 1 bit for the alpha channel.
However, if I'd use RGB8888 instead the spritesheet uses double the memory, around 34mb. Now imagine several characters in a scene and you'll end up with 100mb memory for characters alone. Add general overhead, sound, background, foreground, objects and UI to it and you'll end up with far too much memory. In fact, 100mb is "far too much memory" as far as I'm concerned.
I feel like I'm overlooking something in the whole process. Like something obvious I didn't do or something. RGB4444 is no solution either, it really looks unacceptably bad.
In short: How do I get acceptable texture quality including alpha channel for less than 100mb of memory? "Not at all"? Because that's kinda where it leads as far as I can see.
Split your main texture in 'per character/peranimation/per resolution' files. Use .pvr.ccz because they load faster (much faster, i've measured 8x faster on some devices'). If you are using TexturePacker, you should be able to eliminate most if not all artefacts from the 'pvr' conversion.
When running your scenes, preload only the 'next' posture/stance/combat that you know will need. Experiment with asynchronous loading, with completion block, to signal when the texture is available for use. Dump your unused texture as fast as you can. This will tend to keep the memory requirement flatish at a much lower clip than if you load all animations at once.
Finally, do you really need 15 frames for all these animations ? I get away with as few as 5 frames for some of the animations (idle, asleep, others too). TexturePacker takes of symetrical animations around a certain frame, just points frames midPoint +1 ... midPoint + N to MidPoint -N ... MidPoint -1.
Hi there, the effect I want to implement is burning out user's signature. I've done the signature drawing with quartz2D. Can any one show me a direction for drawing the burning glow effect? thanks!
The glow is caused by light streaming from a source through the strokes and illuminating particles in the air as it travels.
So a brute-force solution that works when viewed directly from the front is to draw the plane several times with additive transparency. You'll want to move and scale the plane for each draw so that you're tracing out the shape of a frustum.
You'll need to do so many draws that I can't imagine you'll end up with both real-time performance and an acceptable result. You should be fine if you can spend a second or a half-second or so on preparing the image on e though.
The most obvious alternative would be to work backwards, writing a shader that traces back through the frustum, sampling the 2d texture appropriately. That's likely to cost a similar amount because texture sampling will be the bottleneck due to memory bandwidth (make sure you upload as a one-channel texture in any event), but could be done so as to work from any angle.