How to measure and compare performance for two different implementations?

How to measure and compare performance for two different implementations? - ios

I have two different algorithms and want to know which one performs better in OpenGL ES.
There's this Time Profiler tool in Instruments which tells me how much % which line of code consumes of the overall processing time, but this is always relative to this algorithm.
How can I get an absolute value so I could compare which algorithm performs better? Actually I just need a percentage of overall CPU occupation. Couldn't find it in Time Profiler. Just percentages of consumed time but not overall CPU workload.
There was also a WWDC show talking about some nifty CPU tracker which showed each core separately. Which performance instrument do I need and at which values must I look for this comparison?

The situation you're talking about, optimizing OpenGL ES performance, is something that Time Profiler isn't well suited to help you with. Time Profiler simply measures CPU-side time spent in various functions and methods, not the actual load something places on the GPU when rendering. Also, the deferred nature of the iOS GPUs means that processing for draw calls can actually take place much later than you'd expect, causing certain functions to look like bottlenecks when they aren't. They just happen to be when actions queued up by earlier calls are finally executed.
As a suggestion, don't measure in frames per second, but instead report the time in milliseconds it takes from the start of your frame rendering to just after a glFinish() or -presentRenderbuffer: call. When you're profiling, you want to work directly with the time it takes to render, because it's easier to understand the impact you're having on that number than on its inverse, frames per second. Also, as you've found, iOS caps its display framerate at 60 FPS, but you can measure rendering times well below 16.7 ms to tell the difference between your two fast approaches.
In addition to time-based measurements, look at the Tiler and Renderer Utilization statistics in the OpenGL ES Driver instrument to see the load you are placing on the vertex and fragment processing portions of the GPU. When combined with the overall CPU load of your application while rendering, this can give a reasonable representation of the efficiency of one approach vs. another.

To answer your last question, the Time Profiler instrument has the CPU strategy, which lets you view each CPU core separately. Above the instrument list are three small buttons, where the center one is initially selected.
Click the left button to show the CPU strategy.

Related

Possible to turn off multibuffering/doublebuffering in Metal?

In reading about the PowerVR alpha drivers for Vulkan, they note that multibufering needs to be performed explicitly. Since Vulkan and Metal are so similar, can I actually turn off multibuffering altogether? I am willing to sacrifice throughput for low latency.
http://blog.imgtec.com/powervr/trying-out-the-new-vulkan-graphics-api-on-powervr-gpus
As a bonus, is it possible to avoid double buffering? I know racing the beam is coming back into style on the desktop but I don't know if mobile display tech supports simple single-buffering.

Double buffering is not about throughput and in fact for most modern GPUs the latency increases in single buffered operation.
I know racing the beam is coming back into style on the desktop
Most certainly not, because racing the beam works only if you can build your image scanline by scanline. However realtime graphics these days operate by sending triangles to the GPU to rasterize and the GPU has some leeway in the order in which it is touching the pixels. Also the order of the triangles relative to the screen continuously changes.
Last but not least all modern graphics systems these days are composited, which goes contrary to racing the beam.

iOS OpenGL ES 2.0 VBO vertex count limit: Once exceeded, CPU bound

I am testing the rendering of extremely large 3d meshes, and I am currently testing on an iPhone 5 (I also have an iPad 3).
I have here two screenshots of Instruments with a profiling run. The first one is rendering a 1.3M vertex mesh, and the second is rendering a 2.1M vertex mesh.
The blue histogram-bar at the top shows CPU load, and it can be seen that for the first mesh is hovering at around ~10% CPU load so the GPU is doing most of the heavy lifting. The mesh is very detailed and my point-light-with-specular shader makes it look quite impressive if I say so myself, as it is able to render consistently above 20 frames per second. Oh, and 4x MSAA is enabled as well!
However, once I step up to a 2 million+ vertex mesh, everything goes to crap as we see here a massive CPU bound situation, and all instruments report 1 frame per second performance.
So, it's pretty clear that somewhere between these two assets (and I will admit that they are both tremendously large meshes to be loading in under one single VBO), whether it is the vertex buffer size or the index buffer size that is over the limit, some limit is being surpassed by the 2megavertex (462K tris) mesh.
So, the question is, what is this limit, and how can I query it? It would really be very preferable if I can have some reasonable assurance that my app will function well without exhaustively testing every device.
I also see an alternative approach to this problem, which is to stick to a known good VBO size limit (I have read about 4MB being a good limit), and basically just have the CPU work a little bit harder if the mesh being rendered is monstrous. With a 100MB VBO, having it in 4MB chunks (segmenting the mesh into 25 draw calls) does not really sound that bad.
But, I'm still curious. How can I check the max size, in order to work around the CPU fallback? Could I be running into an out of memory condition, and Apple is simply applying a CPU based workaround (oh LORD have mercy, 2 million vertices in immediate mode...)?

In pure OpenGL, there are two implementation-defined attributes: GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. When exceeded performance can drop off a cliff in some implementations.
I spent a while looking through the OpenGL ES specification for the equivalent and could not find it. Chances are it's burried in one of the OES or vendor-specific extensions on OpenGL ES. Nevertheless, there is a very real hardware limit to the number of elements you can draw and the number of vertices. After a point with too many indices, you can exceed the capacity of the post-T&L cache. 2 million is a lot for a single draw call, in lieu of being able to query the OpenGL ES implementation for this information, I'd try successively lower powers-of-two until you dial it back to the sweet spot.
65,536 used to be a sweet spot on DX9 hardware. That was the limit for 16-bit indices and was always guaranteed to be below the maximum hardware vertex count. Chances are it'll work for OpenGL ES class hardware too...

Performace loss for using non-power of two textures

Is there any performance loss for using non-power-of-two textures under iOS? I have not noticed any my in quick benchmarks. I can save quite a bit of active memory by dumping them all together since there is a lot of wasted padding (despite texture packing). I don't care about the older hardware that can't use them.

This can vary widely depending on the circumstances and your particular device. On iOS, the loss is smaller if you use NEAREST filtering rather than LINEAR, but it isn't huge to begin with (think 5-10%).

Add constant latency to graphical output in XNA 4

does anyone know of an easy way to add a constant latency (about 30 ms) to the graphical output of an XNA 4 application?
I want to keep my graphical output in sync with a real-time buffered audio stream which inherently has a constant latency.
Thanks for any ideas on this!
Max

If you really need to delay your graphics, then what you could do is render your game to a cycling series of render-targets. So on frame n you display the frame you rendered at frame n-2. This will only work for small latencies, and requires a large amount of additional graphics memory and a small amount of extra GPU time.
A far better method is not to delay the graphical output at all, but delay the audio that is being used to generate the graphical output. Either by buffering it or having two read positions in your audio buffer. The "audio" read being X ms (the latency) ahead of the "game" read.
So if your computer's audio hardware has 100ms of latency (not uncommon), and your graphics hardware has a latency of 16ms: As you are feeding the sample at 100ms into the audio system, you are feeding the audio sample at 16ms into the your graphics calculation. At the same time, the audio from 0ms is hitting the speakers, and the matching graphic is hitting the screen.
Obviously this won't work if the thing generating the graphical output is also generating the audio. But the general principal of both these methods is that you have to buffer the input somewhere along your graphics chain, in order to introduce a delay that corresponds to the one you are experiencing for audio. Where along that chain it is easiest to insert a buffer is up to you.
For latencies of <100ms, I wouldn't worry about it for most games. You only really care about this kind of latency for audio programs and rhythm games.

I might not understand the question, but couldn't you keep track of how many times update is called and mod 2? 60fps mod 2 is 30...

glGenTextures speed and memory concerns

I am learning OpenGL and recently discovered about glGenTextures.
Although several sites explain what it does, I feel forced to wonder how it behaves in terms of speed and, particularly, memory.
Exactly what should I consider when calling glGenTextures? Should I consider unloading and reloading textures for better speed? How many textures should a standard game need? What workarounds are there to get around any limitations memory and speed may bring?

According to the manual, glGenTextures only allocates texture "names" (eg ids) with no "dimensionality". So you are not actually allocating texture memory as such, and the overhead here is negligible compared to actual texture memory allocation.
glTexImage will actually control the amount of texture memory used per texture. Your application's best usage of texture memory will depend on many factors: including the maximum working set of textures used per frame, the available dedicated texture memory of the hardware, and the bandwidth of texture memory.
As for your question about a typical game - what sort of game are you creating? Console games are starting to fill blu-ray disk capacity (I've worked on a PS3 title that was initially not projected to fit on blu-ray). A large portion of this space is textures. On the other hand, downloadable web games are much more constrained.
Essentially, you need to work with reasonable game design and come up with an estimate of:
1. The total textures used by your game.
2. The maximum textures used at any one time.
Then you need to look at your target hardware and decide how to make it all fit.
Here's a link to an old Game Developer article that should get you started:
http://number-none.com/blow/papers/implementing_a_texture_caching_system.pdf

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart