SIMD math libraries for SSE and AVX

SIMD math libraries for SSE and AVX - sse

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better.
Intel has the SVML as part of it's C++ compiler but the compiler suite is very expensive on Windows. Additionally, Intel cripples the library on non-Intel CPUs.
I found the following AVX library, http://software-lisc.fbk.eu/avx_mathfun/, which supports a few math functions (exp, log, sin, cos, and sincos). It gives very fast results for me, faster than SVML, but I have not checked the accuracy. It only works on single floating point and does not work in Visual Studio (though that would be easy to fix). It's based on another SSE library.
Does anyone have any other suggestions?
Edit: I found a SO thread that has many answers on this subject
Vectorized Trig functions in C?

I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.

Gromacs is a highly optimized molecular dynamics software package written in C++ that makes use of SIMD. As far as I know the mathematics SIMD functionality has not yet been split out into a separate library but I guess the implementation might be useful for others nonetheless.
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h
http://manual.gromacs.org/documentation/2016.4/doxygen/html-lib/simd__math_8h.xhtml

Related

Do all CPUs which support AVX2 also support SSE4.2 and AVX?

I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support?

Support for a more-recent Intel SIMD ISA extension implies support for previous SIMD ones.
AVX2 definitely implies AVX1.
I think AVX1 implies all of SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2 feature bits must also be set in CPUID. If not formally guaranteed, many things make this assumption and a CPU that violated it would probably not be commercially viable for general use.
Note that popcnt has its own feature bit, so in theory you could have a CPU with AVX2 and SSE4.2, but not popcnt, but many things treat SSE4.2 as implying popcnt. So it's more like you can advertize support for popcnt without SSE4.2.
In theory you could make a CPU (or virtual machine) with AVX but which didn't accept the non-VEX legacy-SSE encoding of SSE4.2 instructions like pcmpistri, but I think you'd be violating Intel's guarantees about what the AVX feature bit implies. Not sure if that's formally written down in a manual, but most software will assume that.
But AVX1 does imply support for the VEX encoding of all SSE4.2 and earlier SIMD instructions, e.g. vpcmpistri or vminss
gcc -mavx2 definitely implies AVX1 and previous extensions, but will only emit code that uses the VEX encoding. It will define the __SSE4_2__ macro and so on, though, so gcc does treat AVX2 as implying earlier SSE extensions and popcnt, but not FMA, AES-NI or PCLMUL. Those are separate features even for GCC.
(In practice you should use gcc -march=native or gcc -march=znver1 or whatever to enable all the features your CPU has, and set tuning options for it. Not just -mavx2 -mfma, that leaves tuning settings at bad defaults like splitting every possibly-unaligned 256-bit load/store into 128-bit halves.)
(Note that MSVC doesn't have as many SIMD ISA detection macros; it has one for AVX but not for all of the earlier SSE* extensions. MSVC's model is designed around the assumption that programs will do runtime CPU detection instead of being compiled for the local machine. Although MSVC does now have AVX and AVX2 options to use those as baselines.)
Note that AVX512 kind of breaks the traditions. AVX512F implies support for AVX2 and everything before it, but beyond that AVX512DQ doesn't come "before" or "after" AVX512ER, for example. You can (in theory) have either, both, or neither. (In practice, Skylake-X/Cannonlake/etc. has only a bit of overlap with Xeon Phi (Knight's Landing / Knight's Mill), beyond AVX512F. https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

If we set compiler option -mavx2 that GCC doesn't give an error when we use AVX or SSE intrinsics. So GCC supposes that existing of AVX2 flag is enough to run AVX and SSE code. Of course it does not garante that someone won't create CPU with AVX2 and without SSE.

In principle, a CPU could just support AVX2 without supporting any SSE4 instructions (Which isn't as stupid an idea as it sounds!). In practice though, if it supports AVX2, it also supports SSE4.

Use of third-party library

I'm interested in using Alea GPU with a third-party library and am trying to get a sense of my options. Specifically, I'm interested in using this L-BFGS library. I'm fairly new to the F# ecosystem but do have experience with both CUDA and functional programming.
I've been using that L-BFGS library as part of a program which implements logistic regression. It would be neat if I could assume the library correct and write the rest of my code (including that which runs on the GPU) in type-safe F#.
It seems possible to link C++ with F#. Assuming I figure out how to integrate the L-BFGS library into a F# program, would the introduction of Alea GPU cause any issues?
What I am trying to avoid is re-writing L-BFGS in F# using Alea. However, maybe that's actually the easiest path to using F#. If Alea has any facilities for nonlinear optimization, I could probably use those instead.

Alea GPU does not have a nonlinear optimizer yet. The CUDA version has a slightly different implementation than the standard CPU L-BFGS which sometimes causes some accuracy issues. Apart from this I did not face any issues with the code, except that the performance win also significantly depends on the objective function. The objective function for logistic regression is numerically relatively cheap.
We have an internal C# version for this code ported to Alea GPU, which could be also used from F# and we plan to release it in a future version.

SIFT hardware accelerator for smartphones

I'm a fresh graduate electronics engineer and I've an experience on computer vision.I want to ask if it's feasible to make a hardware accelerator of SIFT algorithm - or any other openCV algorithms - to be used on smartphones instead of the current software implementation?
What are the advantages (much low computation, lower power, more complex applications will appear, ...) and the disadvantages(isn't better than the current software implementation, ...)?
Do you have an insight of that?
Thanks

You might be interested to check NEON optimizations - a type of SIMD instructions supported by Nvidia Tegra 3 architectures. Some OpenCV functions are NEON optimized.
Start by reading this nice article Realtime Computer Vision with OpenCV, it has performance comparisons about using NEON, etc.
I also recommend you to start here and here, you will find great insights.

Opencv supports both cuda and (experimentally) opencl
There are specific optimizations for Nvidia's Tegra chipset used in a lot of phones/tablets. I don't know if any phone's use opencl

OpenCL compliant DSP

On the Khronos website, OpenCL is said to be open to DSP. But when I look on the website of DSP making companies, like Texas Instrument, Freescale, NXP or Analog Devices, I can't find any mention about OpenCL.
So does anyone knows if a OpenCL compliant DSP exists?
Edit: As this question seems surprising, I add the reason why I asked it. From the khronos.org page:
"OpenCL 1.0 at a glance
OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs"
So I think it would be interesting to know if it's true, if DSPs, which are particulary suited for some complex calculations, can really be programmed using OpenCL.

The OpenCL spec seems to support using a chip that has one or more programmable GPU shader cores as an expensive DSP. It does not appear that the spec makes many allowances for DSP chips that were not designed to support being used as a programmable GPU shader in a graphics pipeline.

I finally found one: The SNU-Samsung OpenCL Framework is able to use Texas Instrument C64x DSPs. More infos here:
http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU-SAMSUNG_OpenCL_Framework.html

How do I Perform Integer SIMD operations on the iPad A4 Processor?

I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor?
Thanks,
Doug

The instruction set is NEON, intrinsics reference
I've never been able to find good documentation on what they all actually are. But you pick it up pretty quickly if you've had any exposure to SSE

To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html
Note that the iPad A4 uses the ARMv7-A CPU, so the reference manual for the NEON SIMD instructions is at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
(but its 2000 pages long and requires the understanding of Assembly code and perhaps SIMD in general!).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart