iOS - GPU Accelerated Matrix Transpose, Multiplication and Eigen-Decomposition Dilemma - ios

I'm working on a library that requires the use of vectors and matrices on the iOS platform. I decided to look into OpenGLES because the matrix and vector manipulations I plan on doing (mainly, transposing, matrix multiplication, and eigendecomposition) could definitely benefit from GPU acceleration.
The issue is that I'm not that familiar with OpenGLES and honestly might not be the best option. If I were to utilize OpenGLES, would I have to manually write the algorithms that do the matrix transposition, multiplication and eigendecomposition? Or is there another Apple or 3rd party framework that can help me with these tasks.
The main dividing issue however is that I want these operations to be GPU accelerated.
I'm going to implement my program using the Accelerate Framework and vectorized arithmetic and then test to see if its fast enough for my purposes and, if it isn't, then try the GPU implementation.

As combinatorial states, Accelerate uses SIMD to accelerate many of its functions, but it is CPU-based. For smaller data sets, it's definitely the way to go, but operating on the GPU can significantly outclass it for large enough data sets with easily parallelized operations.
To avoid having to write all of the OpenGL ES interaction code yourself, you could take a look at my GPUImage framework, which encapsulates fragment shader operations within Objective-C. In particular, you can use the GPUImageRawDataInput and GPUImageRawDataOutput classes to feed raw byte data into the GPU, then operate over that using a custom fragment shader.
A matrix transpose operation would be quick to implement, since all of the matrix elements are independent of one another. Matrix multiplication by a constant or small matrix would also be reasonably easy to do, but I'm not sure how to scale the multiplication of two large matrices properly. Likewise, I don't have a good implementation of eigendecomposition that I could point to off of the top of my head.
The downside to dealing with fragment shader processing is the fact that by default OpenGL ES takes in and outputs 4-byte RGBA values at each pixel. You can change that to half floats on newer devices, and I know that others have done this with this framework, but I haven't attempted that myself. You can pack individual float values into RGBA bytes and unpack at the end, as another approach to get this data in and out of the GPU.
The OpenGL ES 3.0 support on the very latest A7 devices provides some other opportunities for working with float data. You can use vertex data instead of texture input, which lets you supply four floats per vertex and extract those floats in the end. Bartosz Ciechanowski has a very detailed writeup of this on his blog. That might be a better general approach for GPGPU operations, but if you can get your operations to run against texture data in a fragment shader, you'll see huge speedups on the latest hardware (the iPhone 5S can be ~100-1000X faster than the iPhone 4 in this regard, where vertex processing and CPU speeds haven't advanced nearly as rapidly).

The accelerate framework is not accelerated on the GPU, but it is very well optimized and uses SIMD on Neon where appropriate.

Related

Metal Compute versus ARM Neon

I was considering migrating my current Neon (vector-processing instruction set for the ARM) code to Metal but after running the HelloCompute sample code (that demonstrates how to perform data-parallel computations using the GPU), the GPU seems much slower than using the CPU.
The HelloCompute project takes 13ms on a iPhone 5S to perform this very basic kernel on a 512 x 512 RGBA texture.
{
half4 inColor = inTexture.read(gid);
outTexture.write(inColor, gid);
}
In comparaison, my Neon code takes less than 1ms!!!
GPU should not be at least faster than the CPU?
GPGPU only makes sense when dealing with a huge amount of computations, because the data transfer/ HW initialization time spoils the fun in addition to the horrible APIs such as OpenCL.
NEON on the other hand is tightly integrated into the main pipeline and thus, is way more responsive while packing more than adequate punch.
AI and crypto coin mining have been pretty much the only areas I've seen so far where GPGPU makes sense. For anything lighter, SIMD is the way to go.
And since crypto coin mining is virtually dead, and IPs dedicated to AI related computing are around the corner, I'd say GPGPU is almost pointless.

iOS audio acceleration

Is anybody using OpenGLES2.0 shaders (GLSL) successfully for audio synthesis?
I already use vDSP to accelerate audio in my iOS app, which provides a simple vector instruction set from C code. The main problem with vDSP is that you have to write what amounts to vector oriented assembly language, because the main per-sample loop gets pushed down into each primitive operation (vector add, vector multiply). Compiling expressions into these sequences is the essence of what shader languages automate for you. OpenCL is not public in iOS. It is also interesting that GLSL is compiled at runtime, which means that if most of the sound engine could be in GLSL, then users could make non-trivial patch contributions.
Although the iOS GPU shaders can be relatively "fast", the paths to load and recover data (textures, processed pixels, etc.) from the GPU are slow enough to more than offset any current shader computational efficiencies from using GLSL.
For real-time synthesis, the latencies of the GPU pixel unload path are much larger than the best possible audio response latency using just CPU synthesis to feed RemoteIO. e.g. display frame rates (to which the GPU pipeline is locked) are slower than optimal RemoteIO callback rates. There's just not enough parallelism to exploit within these short audio buffers.

Real Time Cuda Image Processing advice

I am trying to implement an algorithm for a system which the camera get 1000fps, and I need to get the value of each pixel in all images and do the different calculation on the evolution of pixel[i][j] in N number of images, for all the pixels in the images. I have the (unsigned char *ptr) I want to transfer them to the GPU and start implementing the algorithm.but I am not sure what would be the best option for realtime processing.
my system:
CPU Intel Xeon x5660 2.8Ghz(2 processors)
GPU NVIDIA Quadro 5000
I got the following questions:
I do I need to add any Image Processing library addition to CUDA? if yes what do you suggest?
can I create a matrix for pixel[i,j] containing values for images [1:n] for each pixel in the image size? for example for 1000 images with 200x200 size I will end up with 40000 matrix each
containing 1000 values for one pixel? Does CUDA gives me some options like OpenCV to have a Matrices? or Vector?
1 - Do I need to add any Image Processing library addition to CUDA?
Apples and oranges. Each has a different purpose. An image processing library like OpenCV offers a lot more than simple accelerated matrix computations. Maybe you don't need OpenCV to do the processing in this project as you seem to rather use CUDA directly. But you could still have OpenCV around to make it easier to load and write different image formats from the disk.
2 - Does CUDA gives me some options like OpenCV to have a Matrices?
Absolutely. Some time ago I wrote a simple (educational) application that used OpenCV to load an image from the disk and use CUDA to convert it to its grayscale version. The project is named cuda-grayscale. I haven't tested it with CUDA 4.x but the code shows how to do the basic when combining OpenCV and CUDA.
It sounds like you will have 40000 independent calculations, where each calculation works only within one (temporal) pixel. If so, this should be a good task for the GPU. Your 352 core Fermi GPU should be able to beat your 12 hyperthreaded Xeon cores.
Is the algorithm you plan to run a common operation? It sounds like it might not be, in which case you will likely have to write your own kernels.
Yes, you can have arrays of elements of any type in CUDA.
Having this being a "streaming oriented" approach is good for a GPU implementation in that it maximizes number of calculations as compared to transfers over the PCIe bus. It it might also introduce difficulties in that, if you want to process the 1000 values for a given pixel in a specific order (oldest to newest, for instance), you will probably want to avoid continuously shifting all the frames in memory (to make room for the newest frame). It will slightly complicate your addressing of the pixel values, but the best approach, to avoid shifting the frames, may be to overwrite the oldest frame with the newest frame each time a new frame is added. That way, you end up with a "stack of frames" that is fairly well ordered but has a discontinuity between old and new frames somewhere within it.
I do I need to add any Image Processing library addition to CUDA ???
if yes what do you suggest?
Disclosure: My company develop & market CUVILib
There are very few options when it comes to GPU Accelerated Imaging libraries which also offer general-purpose functionality. CUVILib is one of those options which offers the following, very suited for your specific needs:
CuviImage object which holds your image data and image as a 2D matrix
You can write your own GPU function and use CuviImage as a 2D GPU matrix.
CUVILib already provides a rich set of Imaging functionality like Color Operations, Image Statistics, Feature detection, Motion estimation, FFT, Image Transforms etc so chances are that you will find your desired functionality.
As for the question of whether GPUs are suited for your application: Yes! Imaging is one of those domains which are ideal for parallel computation.
Links:
CUVILib: http://www.cuvilib.com
TunaCode: http://www.tunacode.com

How important to send Interleaved Vertex Data on ios

I am using Assimp to import some 3d models.
Assimp is great, but it stores everything in a non-interleaved vertex format.
According to the Apple OpenGL ES Programming Guide, interleaved vertex data is preferred on ios: https://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW8
I am using VertexArrays to consolidate all the buffer related state changes - is it still worth the effort to interleave all the vertex data?
Because interleaved vertex data increases the locality of vertex data, it allows the GPU to cache much more efficiently and generally to be a lot lighter on memory bandwidth at that stage in the pipeline.
How much difference it makes obviously depends on a bunch of other factors — whether memory access is a bottleneck (though it usually is, since texturing is read intensive), how spaced out your vertex data is if not interleaved and the specifics of how that particular GPU does fetching and caching.
Uploading multiple vertex buffers and bundling them into a vertex array would in theory allow the driver to perform this optimisation behind your back (either so as to duplicate memory or once it becomes reasonably confident that the buffers is the array aren't generally in use elsewhere) but I'm not confident that it will. But the other way around to look at it is that you should be able to make the optimisation yourself at the very end of your data pipeline, so you needn't plan in advance for it or change your toolset. It's an optimisation so if it's significant work to implement then the general rule against premature optimisation applies — wait until you have hard performance data.

Accelerate's vImage vs. vDSP

I'm trying to use the Accelerate framework on iOS to bypass the fact that Core Image on iOS doesn't support custom filters/kernels. I'm developing an edge detection filter using two convolutions with a Sobel kernel, but starting with a simple Gaussian blur to get the hangs of it. I know vImage is geared towards image manipulation as matrices and vDSP focuses in processing digital signals using Fourier transforms. But although I started using the vImage functions (vImageConvolve_XXXX, etc), I'm hearing a lot of people discussing the use of vDSP's functions (vDSP_conv, vDSP_imgfir, etc) to do such things as convolutions. So that leads me to the question at hand: when should I use one over the other? What are the differences between them with regards to convolution operations? I've looked everywhere but couldn't find a clear answer. Can someone shed some lights on it, or point me in the right direction?
Thanks!
If vImage provides the operation you need, it is usually simplest to use that. vImage does cache blocking and threading for you, vDSP does not. vImage provides operations on interleaved and integer formats, which are often useful for image processing.
Last time I experimented, neither of these frameworks took advantage of kernel separability, which affords a huge performance boost when convolving in the spatial domain -- a far larger performance boost than vectorized instructions will ever buy you. The Sobel kernel in particular is separable, so if you're using vDSP or vImage (instead of say OpenCV), be sure to separate the kernel yourself.

Resources