I am using Assimp to import some 3d models.
Assimp is great, but it stores everything in a non-interleaved vertex format.
According to the Apple OpenGL ES Programming Guide, interleaved vertex data is preferred on ios: https://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW8
I am using VertexArrays to consolidate all the buffer related state changes - is it still worth the effort to interleave all the vertex data?
Because interleaved vertex data increases the locality of vertex data, it allows the GPU to cache much more efficiently and generally to be a lot lighter on memory bandwidth at that stage in the pipeline.
How much difference it makes obviously depends on a bunch of other factors — whether memory access is a bottleneck (though it usually is, since texturing is read intensive), how spaced out your vertex data is if not interleaved and the specifics of how that particular GPU does fetching and caching.
Uploading multiple vertex buffers and bundling them into a vertex array would in theory allow the driver to perform this optimisation behind your back (either so as to duplicate memory or once it becomes reasonably confident that the buffers is the array aren't generally in use elsewhere) but I'm not confident that it will. But the other way around to look at it is that you should be able to make the optimisation yourself at the very end of your data pipeline, so you needn't plan in advance for it or change your toolset. It's an optimisation so if it's significant work to implement then the general rule against premature optimisation applies — wait until you have hard performance data.
Related
I was watching the Metal API overview video and it was mentioned that command buffers can have multiple command encoders and command encoders define render targets which are frame buffer specific. But the thing that I do not understand is that how are command buffers executed with respect to frame buffers. What is the hierarchy?
To quote from #Dan Hulme's response to "What is the difference in overlay and framebuffer?" on the Computer Graphics Beta StackExchange community found here (accessed on 3/7/21):
As you've [the asker] understood, the framebuffer [emphasis added] is an array in memory that holds all the pixels to display on the screen
The frame buffer acts as a storage for pixel data that the video hardware uses to display things to the screen.
Render targets refer to the output textures of the entire rendering process and are sometimes said to be attached to the frame buffer. They are where all of the work done in your fragment shaders (and possibly compute kernels) are stored and are simply textures in system memory, or reside entirely in tile memory — memory on-chip in the GPU — if you set their storage mode to MTLStoageMode.memoryless in certain circumstances
A command buffer is an abstraction that represents an encoding of a sequence of commands to be run by the GPU. Because GPU hardware can be quite different, it would be impractical to have to go down to driver-level implementations each time you wanted to support a new GPU and do work with it. So Metal provides us with a way to manage the process a GPU follows without having to worry about what GPU we are working with. I applaud the wonderful software design Metal implements, and I think that we should all study its power. Since we are working with the GPU, buffers are used pretty much everywhere, so I suppose the name "command buffer" fit naturally within the lexicon.
I am also still learning about graphics and all of the jargon that comes with it as well. Metal adds a whole other set of terms to learn, but I can assure you that it will become seemingly natural when you start to do more things in Metal (and more research of course)
I'm working on a library that requires the use of vectors and matrices on the iOS platform. I decided to look into OpenGLES because the matrix and vector manipulations I plan on doing (mainly, transposing, matrix multiplication, and eigendecomposition) could definitely benefit from GPU acceleration.
The issue is that I'm not that familiar with OpenGLES and honestly might not be the best option. If I were to utilize OpenGLES, would I have to manually write the algorithms that do the matrix transposition, multiplication and eigendecomposition? Or is there another Apple or 3rd party framework that can help me with these tasks.
The main dividing issue however is that I want these operations to be GPU accelerated.
I'm going to implement my program using the Accelerate Framework and vectorized arithmetic and then test to see if its fast enough for my purposes and, if it isn't, then try the GPU implementation.
As combinatorial states, Accelerate uses SIMD to accelerate many of its functions, but it is CPU-based. For smaller data sets, it's definitely the way to go, but operating on the GPU can significantly outclass it for large enough data sets with easily parallelized operations.
To avoid having to write all of the OpenGL ES interaction code yourself, you could take a look at my GPUImage framework, which encapsulates fragment shader operations within Objective-C. In particular, you can use the GPUImageRawDataInput and GPUImageRawDataOutput classes to feed raw byte data into the GPU, then operate over that using a custom fragment shader.
A matrix transpose operation would be quick to implement, since all of the matrix elements are independent of one another. Matrix multiplication by a constant or small matrix would also be reasonably easy to do, but I'm not sure how to scale the multiplication of two large matrices properly. Likewise, I don't have a good implementation of eigendecomposition that I could point to off of the top of my head.
The downside to dealing with fragment shader processing is the fact that by default OpenGL ES takes in and outputs 4-byte RGBA values at each pixel. You can change that to half floats on newer devices, and I know that others have done this with this framework, but I haven't attempted that myself. You can pack individual float values into RGBA bytes and unpack at the end, as another approach to get this data in and out of the GPU.
The OpenGL ES 3.0 support on the very latest A7 devices provides some other opportunities for working with float data. You can use vertex data instead of texture input, which lets you supply four floats per vertex and extract those floats in the end. Bartosz Ciechanowski has a very detailed writeup of this on his blog. That might be a better general approach for GPGPU operations, but if you can get your operations to run against texture data in a fragment shader, you'll see huge speedups on the latest hardware (the iPhone 5S can be ~100-1000X faster than the iPhone 4 in this regard, where vertex processing and CPU speeds haven't advanced nearly as rapidly).
The accelerate framework is not accelerated on the GPU, but it is very well optimized and uses SIMD on Neon where appropriate.
Is anybody using OpenGLES2.0 shaders (GLSL) successfully for audio synthesis?
I already use vDSP to accelerate audio in my iOS app, which provides a simple vector instruction set from C code. The main problem with vDSP is that you have to write what amounts to vector oriented assembly language, because the main per-sample loop gets pushed down into each primitive operation (vector add, vector multiply). Compiling expressions into these sequences is the essence of what shader languages automate for you. OpenCL is not public in iOS. It is also interesting that GLSL is compiled at runtime, which means that if most of the sound engine could be in GLSL, then users could make non-trivial patch contributions.
Although the iOS GPU shaders can be relatively "fast", the paths to load and recover data (textures, processed pixels, etc.) from the GPU are slow enough to more than offset any current shader computational efficiencies from using GLSL.
For real-time synthesis, the latencies of the GPU pixel unload path are much larger than the best possible audio response latency using just CPU synthesis to feed RemoteIO. e.g. display frame rates (to which the GPU pipeline is locked) are slower than optimal RemoteIO callback rates. There's just not enough parallelism to exploit within these short audio buffers.
When I bind an array to a texture in CUDA,
is that array copy to a texture space? or,
is that array reference as a texture?
If the answer is 1., then i can bind a texture and safety fetch data from the texture memory space while I write the result to the array, which is allocated in global memory.
If the answer is 2., then, is the texture memory a global memory space where the data is cached and spatially fetched?
I'd like to know about this topic, as I've seen some question related to this topic and I've not the answer clear right now.
Thanks in advance.
The answer is the second option, but from there things get a little more complex. There is no such thing as "texture memory", just global memory which is accessed via dedicated hardware which include an on GPU read cache (6-8kb per MP depending on card, see table F-2 in Appending F of Cuda Programming Guide) and a number of hardware accelerated filtering/interpolation actions. There are two ways the texture hardware can be used in CUDA:
Bind linear memory to a texture and read from it in a kernel using the 1D fetch API. In this case the texture hardware is really just acting as a read-through cache, and (IIRC) there are no filtering actions available.
Create a CUDA array, copy the contents of linear memory to that array, and bind it to a texture. The resulting CUDA array contains a spatially ordered version of the linear source, stored in global memory in some sort of (undocumented) space filling curve. The texture hardware provides cached access to that array, include simultaneous memory reads with hardware accelerated filtering.
You might find the overview of the GT200 architecture written by David Kanter worth reading to get a better idea of how the actual architecture implements the memory hierarchy the APIs expose.
I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.