glDrawElements massive cpu usage on iOS - ios

Hardware: iPad2
Sofware: OpenGL ES 2.0 C++
glDrawElements seems to take up about 25% of the cpu. Making the CPU 18ms and the GPU 10ms per frame.
When I don't use an index buffer and use glDrawArrays, it speeds up and glDrawArrays barley shows up on the profiler. Everything else is the same, glDrawArrays has more verts because I have to duplicate verts in the VBO without the index buffer.
so far:
virtually the same amount of state changes between the two methods
vertex structure is two floats(8 bytes).
indexbuffer is 16bit(tried 32bit as well)
GL_SATIC_DRAW for both buffers
buffers don't change after load
the same VBO and the indexbuffer render multiple times per frame, with different offsets and sizes
no opengl errors
So it looks like it's doing a software fallback of some sort. But I can't figure out what would cause OpenGL to fallback.

There are a few things that immediately jump to mind that might affect speed the way you describe.
For one, many commands are issued passively to reduce the number of bus transfers. They are queued up and wait for the next batch transfer. State changes, texture changes, and similar commands all accumulate. It is possible that the the draw commands are triggering a larger transfer in the one case but not in the other, or that you are triggering more frequent transfers in the one case or the other. For another, your specific models might be better organized for one or the other draw calls. You need to look at how big they are, if they reuse index values, and if they are optimized or reordered for rendering. glDrawArrays may require more data to be transferred, but if your models are small the overhead may not be much of a concern. Draw frequency becomes important since you want to queue off calls frequently to keep the card busy and let your CPU do other work, you don't want it to just accumulate in the command buffer waiting to be sent, but it needs to be balanced since there is a cost with those transfers. And to top it off, frequently indexed values can benefit from cache effects when they are frequently reused, but linearly accessed arrays can benefit from cache effects when they are accessed linearly, so you need to know your data since different types of data benefit from different methods.
Even Apple seems to be unsure which method to use.
Up until iOS7 the OpenGL ES Programming Guide for IOS for that version and earlier wrote:
For best performance, your models should be submitted as a single unindexed triangle strip using glDrawArrays with as few duplicated vertices as possible. If your models require many vertices to be duplicated (...), you may obtain better performance using a separate index buffer and calling glDrawElements instead. ... For best results, test your models using both indexed and unindexed triangle strips, and use the one that performs the fastest.
But their updated OpenGL ES Programming Guide for iOS that applies to iOS8 offers the opposite:
For best performance, your models should be submitted as a single indexed triangle strip. To avoid specifying data for the same vertex multiple times in the vertex buffer, use a separate index buffer and draw the triangle strip using the glDrawElements function
It looks like in your case you have just tried both, and found that one method is better suited for your data.

Related

Need for CPU-GPU sync if write and read happens on different memory pages of MTLBuffer?

I am using an MTLBuffer in Metal that I created by allocating several memory pages (using vm_allocate) with
device.makeBuffer(bytesNoCopy:length:options:deallocator:).
I write the buffer with CPU and the GPU only reads it. I know that generally I need to synchronise between CPU and GPU.
However, I have more knowledge about where in the MTLBuffer write (by CPU) and read (by GPU) happens and in my case writing is into different memory pages than the read (in a given time interval).
My question: Do I need to sync between CPU and GPU even if the relevant data that is written and read are on different memory pages (but in the same MTLBuffer)? Intuitively, I would think not, but then MTLBuffer is a bit opaque and I don't really know what kind of processing/requirement the GPU actually does/has with the MTLBuffer.
Additional info: This is a question for iOS and MTLStorageMode is shared.
Thank you very much for help!
Assuming the buffer was created with MTLStorageModeManaged, you can use the function didModifyRange to sync CPU to GPU for only a portion (a page for example) of the buffer.

vulkan pushConstant vs uniform buffer update

So I am reading the vulkan book now and got a problem about the push Constant and ubo update.
After I set up all the pipeline and descriptor stuff. Basically I just need the copy the buffer to the UBO buffer such as memcpy then I am done.
Basically I can understand the issue about the whole pipeline needs to wait for this "buffer" ready then change it's content. So it will be slow.
On the other hand, when I use push constant, there is no such a problem. Although it's small (say 256 bytes big).
So far so good.
However, on the second thought, I find that if I am updating the UBO, I don't need to change the command buffer, or re-record it, I can submit the old CB since it's still the same.
Then if I want to update by using Push Constant, I have to reset the CB and record it again then submit it.
So won't this be an issue? How to make sure which one is faster?
Thanks.
Lots of people get confused on this issue, because the Vulkan Tutorial pre-records commands and Vulkan Guide re-record commands every frame.
When people say to use push constants for per-frame changing data like transform matrices and time data, there's the implicit assumption that you are recording command buffer per frame. Push constants essentially hitch a ride with the rest of your commands when submitted, which is also how they avoid synchronization and cache flushing to operate.
Now, in a lot of scenarios, re-recording command buffers can be easier and not significantly more costly than re-use. And indeed, re-using command buffers when things change can be a real pain to manage. Command buffers are meant to be fast to record. Still, the Vulkan tutorial went with pre-recording everything, which is also a valid approach though potentially harder to maintain at scale.
At the time the tutorial was created, the Vulkan Tutorial was essentially one of the only resources available to learn vulkan in a structured manner. Even though command buffers are quick to record, pre-recording command buffers eliminates even more CPU overhead and exemplifies Vulkan's "Never be draw call limited again" mantra to eliminating CPU overhead in graphics applications.
As for the speed comparison, you'll have to benchmark, but I would not necessarily choose one or the other for "speed" reasons. If you pre-record, you don't want to re-orient your entire rendering architecture just to take advantage of push constants. If you don't pre-record, there's no reason not to use push constants, they are just straight up easier to deal with.
It seems like currently you are pre-recording. I would not bother with push constants at all for this kind of data. I would also not focus on these kinds of issues until you get more familiar with vulkan, as it is very easy to get caught in the weeds with optimization in vulkan, strategies for optimization are no where near as uniform as in the CPU space.

EAGLContext_presentRenderBuffer taking the majority of time in a OpenGLES stress test

I'm using instruments to capture information on a OpenGL stress test for my engine.
After a long period, the top 3 Functions (using API Statistics from the OpenGL ES Analyzer Instrument) are :
EAGLContext_presentRenderBuffer (654,827,246)
glBufferData (16,128,155)
glDrawElements (11,555,768)
Why is EAGLContext_presentRenderBuffer so high? My guess is, because CPU utilization is so low, that this timing also includes the time spent stalling on the CPU waiting for vsync.
Is that correct? If not, what else could explain the high cost of this function?
In my experience, a large part of this comes from the "deferred" part of the tile-based deferred renderers used in iOS devices. When setting up the rendering of your scene, the GPU puts off a lot of the draw calls until just before they are needed.
In many cases, that can mean that the OpenGL ES draw calls appear to be very quick when timed on the CPU, but the last element that reads from or displays the scene seems to take a lot of time. That last call will block until all rendering is completed, because it needs for that to be true in order to display the completed image onscreen.
Unfortunately, this can make it hard to profile your rendering because you can't get an accurate assessment of what stages in your OpenGL ES scene are the slowest. This is where I've relied on the OpenGL ES Driver instrument to tell me whether I'm geometry or fill rate limited and then put dummy elements in my pipeline to try to localize bottlenecks.
We don't really have a good counterpart to Time Profiler for OpenGL ES yet, and I recommend filing a feature request for one if you'd like to see that. I know I would.

OpenGL DisplayList using video memory

Is it possible to store the display list data on the video card memory?
I want to use only video memory like Video Buffer Object(VBO) to store DisplayList.
But when I try it, it always uses main memory instead of video memory.
I tested on nVidia geForce 8600GTS, and GTX260.
Display lists are a very old feature, that dates back to OpenGL-1.0. They have been depreceated a long time ago. Anyhow you can still use them for compatibility reasons.
The way OpenGL works, prevents display lists from being held in GPU memory only. The graphics server (as OpenGL calls it) is a purely abstract thing, and the specification warrants, that what you put in a display lists is always available. However in modern GPUs there's only a limited amount of memory, so payload data may be swapped in and out as needed.
Effectively GPU memory is a cache for data in system RAM (the same way system RAM should be treaded as cache for storage).
Even moreso, modern GPUs may crash, and the drivers will perform a full reset giving the user the impression everything works normal. But after the reset all the data on GPU memory must be reinitialized.
So it is necessary for OpenGL to keep copies of every payload data in memory to support smooth operation.
Hence it is perfectly normal for your data to show up as consuming system RAM as well. It is though very likely, that the display lists are also cached in GPU memory.
Display Lists are deprecated. You can use VBO with vertex indices to use graphics memory, and draw it with glDrawElements.

Determining available video memory

When developing an OpenGL program, is there a way to poll from the system to find out just how many megabytes are available to store textures, etc?
Or is the standard approach these days just allocate memory and forget about everything?
Although the official stance remains "you don't need to know, you don't want to know, and it would not help you anyway", luckily at least two IHVs have shown a little more insight lately and offer extensions to query that information:
NVX_gpu_memory_info
ATI_meminfo
One nice thing about these extensions is that they have a least common denominator which is just what most people need, and you don't need to query extension support or do anything special, as they both work via glGetIntegerv.
In the easiest case, you can just initialize an array of 4 integers to zero (or some minimum default value that you'll assume in case the extensions don't work), then you call glGetIntegerv twice (with GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX and TEXTURE_FREE_MEMORY_ATI, respectively), and finally call glGetError to clear the error state. glGetIntegerv does not modify the pointed-to memory if it fails, nor does it crash or any other bad thing -- it merely sets the error state to GL_INVALID_ENUM.
Both extensions return a value in the first array position, the ATI one returns some values in the other 3 too.
n.b.: glAreTexturesResident has not been supported for almost a decade on mainstream hardware, in the same manner as texture priorities. The common mantra is that the driver writer knows much better than you anyway.
OpenGL doesn't give you this information. And frankly: There's only little benefit, simply because today we have multitasking operating systems. The OpenGL driver is responsible for swapping in texture data to/from system memory, if there's demand for it.
What OpenGL can do for you, is tell, if the textures you've uploaded are still resident in fast memory. The function is called "glAreTexturesResident". You can use this to gradually upload stuff to the GPU until you've filled up the GPU's memory. But keep in mind that you're not the only user of the GPU.

Resources