iOS audio acceleration - ios

Is anybody using OpenGLES2.0 shaders (GLSL) successfully for audio synthesis?
I already use vDSP to accelerate audio in my iOS app, which provides a simple vector instruction set from C code. The main problem with vDSP is that you have to write what amounts to vector oriented assembly language, because the main per-sample loop gets pushed down into each primitive operation (vector add, vector multiply). Compiling expressions into these sequences is the essence of what shader languages automate for you. OpenCL is not public in iOS. It is also interesting that GLSL is compiled at runtime, which means that if most of the sound engine could be in GLSL, then users could make non-trivial patch contributions.

Although the iOS GPU shaders can be relatively "fast", the paths to load and recover data (textures, processed pixels, etc.) from the GPU are slow enough to more than offset any current shader computational efficiencies from using GLSL.
For real-time synthesis, the latencies of the GPU pixel unload path are much larger than the best possible audio response latency using just CPU synthesis to feed RemoteIO. e.g. display frame rates (to which the GPU pipeline is locked) are slower than optimal RemoteIO callback rates. There's just not enough parallelism to exploit within these short audio buffers.

Related

I am not able to understand Command Buffer and Frame Buffer in Metal. I don't understand how do they work with each other

I was watching the Metal API overview video and it was mentioned that command buffers can have multiple command encoders and command encoders define render targets which are frame buffer specific. But the thing that I do not understand is that how are command buffers executed with respect to frame buffers. What is the hierarchy?
To quote from #Dan Hulme's response to "What is the difference in overlay and framebuffer?" on the Computer Graphics Beta StackExchange community found here (accessed on 3/7/21):
As you've [the asker] understood, the framebuffer [emphasis added] is an array in memory that holds all the pixels to display on the screen
The frame buffer acts as a storage for pixel data that the video hardware uses to display things to the screen.
Render targets refer to the output textures of the entire rendering process and are sometimes said to be attached to the frame buffer. They are where all of the work done in your fragment shaders (and possibly compute kernels) are stored and are simply textures in system memory, or reside entirely in tile memory — memory on-chip in the GPU — if you set their storage mode to MTLStoageMode.memoryless in certain circumstances
A command buffer is an abstraction that represents an encoding of a sequence of commands to be run by the GPU. Because GPU hardware can be quite different, it would be impractical to have to go down to driver-level implementations each time you wanted to support a new GPU and do work with it. So Metal provides us with a way to manage the process a GPU follows without having to worry about what GPU we are working with. I applaud the wonderful software design Metal implements, and I think that we should all study its power. Since we are working with the GPU, buffers are used pretty much everywhere, so I suppose the name "command buffer" fit naturally within the lexicon.
I am also still learning about graphics and all of the jargon that comes with it as well. Metal adds a whole other set of terms to learn, but I can assure you that it will become seemingly natural when you start to do more things in Metal (and more research of course)

How to use OpenCV functions in Metal on iOS?

I have developed the Xcode project that uses OpenCV functions for image processing when the iPhone camera live stream.
It takes some time to process one frame and doesn't look like real time.
Is it possible to accelerate the calculation by integrating OpenCV and Metal?
For example, OpenCV function "grabCut" takes more than 1 second to detect certain foreground objects.
How can I reduce the processing time down to 10ms at least using Metal?
You can't call OpenCV functions from Metal.
If you want to speed up this algorithm, you could try porting it to Metal but that's only an option if the algorithm -- or major parts of it -- are highly parallel.
Now, it looks like grabCut has a CUDA implementation (which I found by googling for "grabcut cuda"), which means that implementing this in Metal might actually be worth doing. If you can find the CUDA source code, it's usually a relatively straightforward port.

Metal Compute versus ARM Neon

I was considering migrating my current Neon (vector-processing instruction set for the ARM) code to Metal but after running the HelloCompute sample code (that demonstrates how to perform data-parallel computations using the GPU), the GPU seems much slower than using the CPU.
The HelloCompute project takes 13ms on a iPhone 5S to perform this very basic kernel on a 512 x 512 RGBA texture.
{
half4 inColor = inTexture.read(gid);
outTexture.write(inColor, gid);
}
In comparaison, my Neon code takes less than 1ms!!!
GPU should not be at least faster than the CPU?
GPGPU only makes sense when dealing with a huge amount of computations, because the data transfer/ HW initialization time spoils the fun in addition to the horrible APIs such as OpenCL.
NEON on the other hand is tightly integrated into the main pipeline and thus, is way more responsive while packing more than adequate punch.
AI and crypto coin mining have been pretty much the only areas I've seen so far where GPGPU makes sense. For anything lighter, SIMD is the way to go.
And since crypto coin mining is virtually dead, and IPs dedicated to AI related computing are around the corner, I'd say GPGPU is almost pointless.

iOS - GPU Accelerated Matrix Transpose, Multiplication and Eigen-Decomposition Dilemma

I'm working on a library that requires the use of vectors and matrices on the iOS platform. I decided to look into OpenGLES because the matrix and vector manipulations I plan on doing (mainly, transposing, matrix multiplication, and eigendecomposition) could definitely benefit from GPU acceleration.
The issue is that I'm not that familiar with OpenGLES and honestly might not be the best option. If I were to utilize OpenGLES, would I have to manually write the algorithms that do the matrix transposition, multiplication and eigendecomposition? Or is there another Apple or 3rd party framework that can help me with these tasks.
The main dividing issue however is that I want these operations to be GPU accelerated.
I'm going to implement my program using the Accelerate Framework and vectorized arithmetic and then test to see if its fast enough for my purposes and, if it isn't, then try the GPU implementation.
As combinatorial states, Accelerate uses SIMD to accelerate many of its functions, but it is CPU-based. For smaller data sets, it's definitely the way to go, but operating on the GPU can significantly outclass it for large enough data sets with easily parallelized operations.
To avoid having to write all of the OpenGL ES interaction code yourself, you could take a look at my GPUImage framework, which encapsulates fragment shader operations within Objective-C. In particular, you can use the GPUImageRawDataInput and GPUImageRawDataOutput classes to feed raw byte data into the GPU, then operate over that using a custom fragment shader.
A matrix transpose operation would be quick to implement, since all of the matrix elements are independent of one another. Matrix multiplication by a constant or small matrix would also be reasonably easy to do, but I'm not sure how to scale the multiplication of two large matrices properly. Likewise, I don't have a good implementation of eigendecomposition that I could point to off of the top of my head.
The downside to dealing with fragment shader processing is the fact that by default OpenGL ES takes in and outputs 4-byte RGBA values at each pixel. You can change that to half floats on newer devices, and I know that others have done this with this framework, but I haven't attempted that myself. You can pack individual float values into RGBA bytes and unpack at the end, as another approach to get this data in and out of the GPU.
The OpenGL ES 3.0 support on the very latest A7 devices provides some other opportunities for working with float data. You can use vertex data instead of texture input, which lets you supply four floats per vertex and extract those floats in the end. Bartosz Ciechanowski has a very detailed writeup of this on his blog. That might be a better general approach for GPGPU operations, but if you can get your operations to run against texture data in a fragment shader, you'll see huge speedups on the latest hardware (the iPhone 5S can be ~100-1000X faster than the iPhone 4 in this regard, where vertex processing and CPU speeds haven't advanced nearly as rapidly).
The accelerate framework is not accelerated on the GPU, but it is very well optimized and uses SIMD on Neon where appropriate.

How important to send Interleaved Vertex Data on ios

I am using Assimp to import some 3d models.
Assimp is great, but it stores everything in a non-interleaved vertex format.
According to the Apple OpenGL ES Programming Guide, interleaved vertex data is preferred on ios: https://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW8
I am using VertexArrays to consolidate all the buffer related state changes - is it still worth the effort to interleave all the vertex data?
Because interleaved vertex data increases the locality of vertex data, it allows the GPU to cache much more efficiently and generally to be a lot lighter on memory bandwidth at that stage in the pipeline.
How much difference it makes obviously depends on a bunch of other factors — whether memory access is a bottleneck (though it usually is, since texturing is read intensive), how spaced out your vertex data is if not interleaved and the specifics of how that particular GPU does fetching and caching.
Uploading multiple vertex buffers and bundling them into a vertex array would in theory allow the driver to perform this optimisation behind your back (either so as to duplicate memory or once it becomes reasonably confident that the buffers is the array aren't generally in use elsewhere) but I'm not confident that it will. But the other way around to look at it is that you should be able to make the optimisation yourself at the very end of your data pipeline, so you needn't plan in advance for it or change your toolset. It's an optimisation so if it's significant work to implement then the general rule against premature optimisation applies — wait until you have hard performance data.

Resources