Is there any container as std::vector in Metal Shader Language? - metal

MSL supports templates, but there is not c++'s stl. Looks like it does not have an analogue of std::vector and there is not any generics containers or may be I miss?

Related

Sorting in a Metal compute shader

Does Metal have functionality similar to CUDA's thrust::sort or is one expected to write their own sort function?
Apple's Accelerate framework has useful vector operations in the vDSP package. Sorting, specifically, can be found here.
https://developer.apple.com/documentation/accelerate/vdsp/reversing_and_sorting_functions

iOS - GPU Accelerated Matrix Transpose, Multiplication and Eigen-Decomposition Dilemma

I'm working on a library that requires the use of vectors and matrices on the iOS platform. I decided to look into OpenGLES because the matrix and vector manipulations I plan on doing (mainly, transposing, matrix multiplication, and eigendecomposition) could definitely benefit from GPU acceleration.
The issue is that I'm not that familiar with OpenGLES and honestly might not be the best option. If I were to utilize OpenGLES, would I have to manually write the algorithms that do the matrix transposition, multiplication and eigendecomposition? Or is there another Apple or 3rd party framework that can help me with these tasks.
The main dividing issue however is that I want these operations to be GPU accelerated.
I'm going to implement my program using the Accelerate Framework and vectorized arithmetic and then test to see if its fast enough for my purposes and, if it isn't, then try the GPU implementation.
As combinatorial states, Accelerate uses SIMD to accelerate many of its functions, but it is CPU-based. For smaller data sets, it's definitely the way to go, but operating on the GPU can significantly outclass it for large enough data sets with easily parallelized operations.
To avoid having to write all of the OpenGL ES interaction code yourself, you could take a look at my GPUImage framework, which encapsulates fragment shader operations within Objective-C. In particular, you can use the GPUImageRawDataInput and GPUImageRawDataOutput classes to feed raw byte data into the GPU, then operate over that using a custom fragment shader.
A matrix transpose operation would be quick to implement, since all of the matrix elements are independent of one another. Matrix multiplication by a constant or small matrix would also be reasonably easy to do, but I'm not sure how to scale the multiplication of two large matrices properly. Likewise, I don't have a good implementation of eigendecomposition that I could point to off of the top of my head.
The downside to dealing with fragment shader processing is the fact that by default OpenGL ES takes in and outputs 4-byte RGBA values at each pixel. You can change that to half floats on newer devices, and I know that others have done this with this framework, but I haven't attempted that myself. You can pack individual float values into RGBA bytes and unpack at the end, as another approach to get this data in and out of the GPU.
The OpenGL ES 3.0 support on the very latest A7 devices provides some other opportunities for working with float data. You can use vertex data instead of texture input, which lets you supply four floats per vertex and extract those floats in the end. Bartosz Ciechanowski has a very detailed writeup of this on his blog. That might be a better general approach for GPGPU operations, but if you can get your operations to run against texture data in a fragment shader, you'll see huge speedups on the latest hardware (the iPhone 5S can be ~100-1000X faster than the iPhone 4 in this regard, where vertex processing and CPU speeds haven't advanced nearly as rapidly).
The accelerate framework is not accelerated on the GPU, but it is very well optimized and uses SIMD on Neon where appropriate.

Calling BLAS routines inside OpenCL kernels

Currently I am doing some image processing algorithms using OpenCL. Basically my algorithm requires to solve a linear system of equations for each pixel. Each system is independent of others, so going for a parallel implementation is natural.
I have looked at several BLAS packages such as ViennaCL and AMD APPML, but it seems all of them have the same use pattern (host calling BLAS subroutines to be executed on CL device).
What I need is a BLAS library that could be called inside an OpenCL kernel so that I can solve many linear systems in parallel.
I found this similar question on the AMD forums.
Calling APPML BLAS functions from the kernel
Thanks
Its not possible. clBLAS routines make a series of kernel launches, some 'solve' routine kernel launches are really complicated. clBLAS routines take cl_mem and commandQueues as args. So if your buffer is already on device, clBLAS will directly act on that. It doesn't accept host buffer or manage host->device transfers
If you want to have a look at what kernel are generated and launched, uncomment this line https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/generic/common.c#L461 and build clBLAS. It will dump all kernels being called

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better.
Intel has the SVML as part of it's C++ compiler but the compiler suite is very expensive on Windows. Additionally, Intel cripples the library on non-Intel CPUs.
I found the following AVX library, http://software-lisc.fbk.eu/avx_mathfun/, which supports a few math functions (exp, log, sin, cos, and sincos). It gives very fast results for me, faster than SVML, but I have not checked the accuracy. It only works on single floating point and does not work in Visual Studio (though that would be easy to fix). It's based on another SSE library.
Does anyone have any other suggestions?
Edit: I found a SO thread that has many answers on this subject
Vectorized Trig functions in C?
I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.
Gromacs is a highly optimized molecular dynamics software package written in C++ that makes use of SIMD. As far as I know the mathematics SIMD functionality has not yet been split out into a separate library but I guess the implementation might be useful for others nonetheless.
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h
http://manual.gromacs.org/documentation/2016.4/doxygen/html-lib/simd__math_8h.xhtml

Implementing Seam Carving Method in C#

I would like to Scale an Image from 640x480 to 64x48. There are different approaches used by the picture manipulating programs like Gimp and Photoshop. For eg.Gimp uses liquid rescale algorithm and photoshop uses path match algorithm for better scaling. Both uses Seam Carving approach. But I am unable to compare and implement them in C#. I downloaded some example code , but it is in Matlab. Is there an efficient algorithm in C# to implement this in better way. Please provide, if any existing code for the Seam carving Implementation in C#.
I've found following implementations of seam carving on C#:
magicarver
My library: Seam-Carving-Advanced. GPU processing under developing. This lib is porting of C++ lib: seam-carving-gui
I would just use ImageMagic since your doing C#, it will give you all kind of conversion possibilities.

Resources