In reading about the PowerVR alpha drivers for Vulkan, they note that multibufering needs to be performed explicitly. Since Vulkan and Metal are so similar, can I actually turn off multibuffering altogether? I am willing to sacrifice throughput for low latency.
http://blog.imgtec.com/powervr/trying-out-the-new-vulkan-graphics-api-on-powervr-gpus
As a bonus, is it possible to avoid double buffering? I know racing the beam is coming back into style on the desktop but I don't know if mobile display tech supports simple single-buffering.
Double buffering is not about throughput and in fact for most modern GPUs the latency increases in single buffered operation.
I know racing the beam is coming back into style on the desktop
Most certainly not, because racing the beam works only if you can build your image scanline by scanline. However realtime graphics these days operate by sending triangles to the GPU to rasterize and the GPU has some leeway in the order in which it is touching the pixels. Also the order of the triangles relative to the screen continuously changes.
Last but not least all modern graphics systems these days are composited, which goes contrary to racing the beam.
Related
I am using the expensive R32G32B32A32Float format for some quality reason. In particular, I need to retrieve the result by un-doing the pre-multiply on the buffer. Now I want to do some optimization. Thus I wonder if there would be problem / hidden trap if I mix use textures in different formats. e.g. Use another lightweight texture format for non-transparent images.
Note: I am not making games but some image processing stuff. While speed is not that much of a concern because of hardware acceleration which is already faster than doing that on CPU, GPU memory is quite limited. Thus I am here asking.
I am receiving YUV 420 CMSampleBuffers of the screen in my System Broadcast Extension, however when I attempt to access the underlying bytes, I get inconsistent results: artefacts that are a mixture of (it seems) past and future frames. I am accessing the bytes in order to rotate portrait frames a quarter turn to landscape, but the problem reduces to not being able to correctly copy the texture.
The pattern of artefacts can change quite a lot. They can be all over the place and seem to have a fundamental "brush shape" that is square tile, sometimes small, sometimes large, which seems to depend on the failing work around at hand. They can occur in both the luminance and chroma channels, which results in interesting effects. The "grain" of the artefacts sometimes appears to be horizontal, which I guess is vertical in the original frame.
I do have two functioning work arounds:
rotate the buffers using Metal
rotate the buffers using CoreImage (even a "software" CIContext works)
The reason that I can't yet ship these workarounds is that System Broadcast Extensions have a very low memory limit of 50MB and memory usage can spike with these two solutions, and there seem to be interactions with other parts of the system (e.g. the AVAssetWriter or the daemon that dumps frames into my address space). I'm still working to understand memory usage here.
The artefacts seem like a synchronisation problem. However I have a feeling that this is not so much a new frame being written into the buffer that I'm looking at, but rather some sort of stale cache. CPU or GPU? Do GPUs have caches? The tiled nature of the artefacts reminds me of iOS GPUs, but that with a grain of salt (not a hardware person).
This brings me around to the question title. If this is a caching problem, and Metal / CoreImage has a consistent view of the pixels, maybe I can get Metal to flush the data I want for me, because an BGRA screen capture being converted to YUV IOSurface has Metal shader written all over it.
So I took the incoming CMSampleBuffer's CVPixelBuffer's IOSurface and created an MTLTexture from it (with all sorts of cacheModes and storageModes, haven't tried hazardTrackingModes yet) and then copied the bytes out with MTLTexture.getBytes(bytesPerRow:from:mipmapLevel:).
Yet the problem persists. I would really like to make the CPU deep copy approach work, for memory reasons.
To head off some questions:
it's not a bytes-per-row issue, that would slant the images
in the cpu case I do lock the CVPixelBuffer's base address
I even lock the the underlying IOSurface
I have tried discarding IOSurfaces whose lock seed changes under lock
I do discard frames when necessary
I have tried putting random memory fences and mutexes all over the place (not a hardware person)
I have not disassembled CoreImage yet
This question is the continuation of one a posted on the Apple Developer Forums
Art by https://twitter.com/artofzara
I have been maintaining my own custom 2D library -written in Objective-C / OpenGL ES 2.0- for a while now, to use in my personal projects (not work). I have also tried cocos2d and SpriteKit now and then, but eventually settled for "reinventing the wheel" because
It's fun,
Knowledge-wise, I'd rather be the guy who can code a graphics library than just a guy who can use one,
Unlimited possibilities for customization.
Now, I am transitioning my code base to Swift and (besides all the design differences that arise when moving to a language where class inheritance takes a back seat to protocols, etc) I was thinking that while I'm at it, I should consider transitioning to Metal as well. If anything, for the sake of future-proofness (also, I'm all for learning new technologies, and to be sincere OpenGL/OpenGL ES are a terribly cluttered bag of "legacy" and backwards compatibility).
My library is designed around all sorts of OpenGL (ES)-specific performance bottlenecks: Use of texture atlases and mesh consolidation to reduce draw calls, rendering opaque sprites first, and semitransparent ones last (ordered back to front), etc.
My question is: Which of these considerations still apply to Metal, and which ones should I not even bother implementing (because they're not a performance issue anymore)?
Metal is only available on the subset of IOS devices which support OpenGLES3, so te be fair you need to compare Metal to GLES3.
Texture atlases & mesh consolidation:
With Metal, CPU cost of draw calls is lower than with GLES3, and you can parallelize draw call setup on multiple threads.
So this could allow you to skip atlasing & consolidation ... but those are good practices so it would be even better if you kept those with Metal and use the extra CPU time to do more things !
Note that with GLES3 by using instancing & texture arrays you should also be able to get rid of atlasing and keep a low draw call count.
Rendering opaque sprites first, and semitransparent ones last
Metal will change absolutely nothing to this, this is a constraint from the PowerVR GPUs tile based defered renderer, whatever driver you use this will not change the GPU hardware. And anyway rendering opaques before semi transparent is the recommended way to proceed when you do 3D, wheter you use DirectX, OpenGL or Metal ...
Metal will not help if you are fillrate bound !
In general, Metal will only give you improvements on the CPU side.
If your performance is limited by fillrate (fragment shaders too complex, too much transparent overdraw, resolution too high etc.) then you will get the exact same result in Metal and GLES3 (assuming that you have carefully optimized shaders for each platform).
I am testing the rendering of extremely large 3d meshes, and I am currently testing on an iPhone 5 (I also have an iPad 3).
I have here two screenshots of Instruments with a profiling run. The first one is rendering a 1.3M vertex mesh, and the second is rendering a 2.1M vertex mesh.
The blue histogram-bar at the top shows CPU load, and it can be seen that for the first mesh is hovering at around ~10% CPU load so the GPU is doing most of the heavy lifting. The mesh is very detailed and my point-light-with-specular shader makes it look quite impressive if I say so myself, as it is able to render consistently above 20 frames per second. Oh, and 4x MSAA is enabled as well!
However, once I step up to a 2 million+ vertex mesh, everything goes to crap as we see here a massive CPU bound situation, and all instruments report 1 frame per second performance.
So, it's pretty clear that somewhere between these two assets (and I will admit that they are both tremendously large meshes to be loading in under one single VBO), whether it is the vertex buffer size or the index buffer size that is over the limit, some limit is being surpassed by the 2megavertex (462K tris) mesh.
So, the question is, what is this limit, and how can I query it? It would really be very preferable if I can have some reasonable assurance that my app will function well without exhaustively testing every device.
I also see an alternative approach to this problem, which is to stick to a known good VBO size limit (I have read about 4MB being a good limit), and basically just have the CPU work a little bit harder if the mesh being rendered is monstrous. With a 100MB VBO, having it in 4MB chunks (segmenting the mesh into 25 draw calls) does not really sound that bad.
But, I'm still curious. How can I check the max size, in order to work around the CPU fallback? Could I be running into an out of memory condition, and Apple is simply applying a CPU based workaround (oh LORD have mercy, 2 million vertices in immediate mode...)?
In pure OpenGL, there are two implementation-defined attributes: GL_MAX_ELEMENTS_VERTICES and GL_MAX_ELEMENTS_INDICES. When exceeded performance can drop off a cliff in some implementations.
I spent a while looking through the OpenGL ES specification for the equivalent and could not find it. Chances are it's burried in one of the OES or vendor-specific extensions on OpenGL ES. Nevertheless, there is a very real hardware limit to the number of elements you can draw and the number of vertices. After a point with too many indices, you can exceed the capacity of the post-T&L cache. 2 million is a lot for a single draw call, in lieu of being able to query the OpenGL ES implementation for this information, I'd try successively lower powers-of-two until you dial it back to the sweet spot.
65,536 used to be a sweet spot on DX9 hardware. That was the limit for 16-bit indices and was always guaranteed to be below the maximum hardware vertex count. Chances are it'll work for OpenGL ES class hardware too...
I hear a lot that power of 2 textures are better for performance reasons, but I couldn't find enough solid information about if it's a problem when using XNA. Most of my textures have random dimensions and I don't see much of a problem, but maybe VS profiler doesn't show that.
In general, pow 2 textures are better. But most graphics cards should allow non pow 2 textures with a minimal loss of performance. However, if you use XNA reach profile, only pow 2 textures are allowed. And some small graphics cards only support the reach profile.
XNA is really a layer built on top of DirectX. So any performance guidelines that goes for that will also apply for anything using XNA.
The VS profiler also won't really apply to the graphics specific things you are doing. That will need to be profiled separately by some tool that can check how the graphic card itself is doing. If the graphics card is struggling it won't show up as a high resource usage on your CPU, but rather as a slow rendering speed.