Do all CPUs which support AVX2 also support SSE4.2 and AVX? - sse

I am planning to implement runtime detection of SIMD extensions. Is it such that if I find out that the processor has AVX2 support, it is also guaranteed to have SSE4.2 and AVX support?

Support for a more-recent Intel SIMD ISA extension implies support for previous SIMD ones.
AVX2 definitely implies AVX1.
I think AVX1 implies all of SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2 feature bits must also be set in CPUID. If not formally guaranteed, many things make this assumption and a CPU that violated it would probably not be commercially viable for general use.
Note that popcnt has its own feature bit, so in theory you could have a CPU with AVX2 and SSE4.2, but not popcnt, but many things treat SSE4.2 as implying popcnt. So it's more like you can advertize support for popcnt without SSE4.2.
In theory you could make a CPU (or virtual machine) with AVX but which didn't accept the non-VEX legacy-SSE encoding of SSE4.2 instructions like pcmpistri, but I think you'd be violating Intel's guarantees about what the AVX feature bit implies. Not sure if that's formally written down in a manual, but most software will assume that.
But AVX1 does imply support for the VEX encoding of all SSE4.2 and earlier SIMD instructions, e.g. vpcmpistri or vminss
gcc -mavx2 definitely implies AVX1 and previous extensions, but will only emit code that uses the VEX encoding. It will define the __SSE4_2__ macro and so on, though, so gcc does treat AVX2 as implying earlier SSE extensions and popcnt, but not FMA, AES-NI or PCLMUL. Those are separate features even for GCC.
(In practice you should use gcc -march=native or gcc -march=znver1 or whatever to enable all the features your CPU has, and set tuning options for it. Not just -mavx2 -mfma, that leaves tuning settings at bad defaults like splitting every possibly-unaligned 256-bit load/store into 128-bit halves.)
(Note that MSVC doesn't have as many SIMD ISA detection macros; it has one for AVX but not for all of the earlier SSE* extensions. MSVC's model is designed around the assumption that programs will do runtime CPU detection instead of being compiled for the local machine. Although MSVC does now have AVX and AVX2 options to use those as baselines.)
Note that AVX512 kind of breaks the traditions. AVX512F implies support for AVX2 and everything before it, but beyond that AVX512DQ doesn't come "before" or "after" AVX512ER, for example. You can (in theory) have either, both, or neither. (In practice, Skylake-X/Cannonlake/etc. has only a bit of overlap with Xeon Phi (Knight's Landing / Knight's Mill), beyond AVX512F. https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

If we set compiler option -mavx2 that GCC doesn't give an error when we use AVX or SSE intrinsics. So GCC supposes that existing of AVX2 flag is enough to run AVX and SSE code. Of course it does not garante that someone won't create CPU with AVX2 and without SSE.

In principle, a CPU could just support AVX2 without supporting any SSE4 instructions (Which isn't as stupid an idea as it sounds!). In practice though, if it supports AVX2, it also supports SSE4.

Related

Is OpenCL a shared, distributed or hybrid memory system

I'm having a hard time understanding if OpenCL and in particular OpenCL 2.0+ is a shared, distributed or a distributed shared memory architecture, in particular with a computer that has many OpenCL devices in the same PC.
In particular, I can see that It's a shared memory system in the fact that they can all access global memory but theirs a network-like aspect with the compute units that makes me question if it could classicly be classed as a distributed-shared memory architecture
Looking at it from a generic OpenCL coding perspective, your answer is "yes, maybe, unless it's not."
If you are talking about some specific hardware, there are (somewhere) clear and concise answers of how things work on the chip(s) and how OpenCL uses them.
By examining the OpenCL capacities and capabilities at runtime, you could modify some parameters of your OpenCL program or choose one of various kernels that is the best fit.

SIMD math libraries for SSE and AVX

I am looking for SIMD math libraries (preferably open source) for SSE and AVX. I mean for example if I have a AVX register v with 8 float values I want sin(v) to return the sin of all eight values at once.
AMD has a propreitery library, LibM http://developer.amd.com/tools/cpu-development/libm/ which has some SIMD math functions but LibM only uses AVX if it detects FMA4 which Intel CPUs don't have. Also I'm not sure it fully uses AVX as all the function names end in s4 (d2) and not s8 (d4). It give better performance than the standard math libraries on Intel CPUs but it's not much better.
Intel has the SVML as part of it's C++ compiler but the compiler suite is very expensive on Windows. Additionally, Intel cripples the library on non-Intel CPUs.
I found the following AVX library, http://software-lisc.fbk.eu/avx_mathfun/, which supports a few math functions (exp, log, sin, cos, and sincos). It gives very fast results for me, faster than SVML, but I have not checked the accuracy. It only works on single floating point and does not work in Visual Studio (though that would be easy to fix). It's based on another SSE library.
Does anyone have any other suggestions?
Edit: I found a SO thread that has many answers on this subject
Vectorized Trig functions in C?
I have implemented Vecmathlib https://bitbucket.org/eschnett/vecmathlib/ as a generic libraries for two other projects (The Einstein Toolkit, and pocl http://pocl.sourceforge.net/). Vecmathlib is open source, and is written in C++.
Gromacs is a highly optimized molecular dynamics software package written in C++ that makes use of SIMD. As far as I know the mathematics SIMD functionality has not yet been split out into a separate library but I guess the implementation might be useful for others nonetheless.
https://github.com/gromacs/gromacs/blob/master/src/gromacs/simd/simd_math.h
http://manual.gromacs.org/documentation/2016.4/doxygen/html-lib/simd__math_8h.xhtml

OpenCL compliant DSP

On the Khronos website, OpenCL is said to be open to DSP. But when I look on the website of DSP making companies, like Texas Instrument, Freescale, NXP or Analog Devices, I can't find any mention about OpenCL.
So does anyone knows if a OpenCL compliant DSP exists?
Edit: As this question seems surprising, I add the reason why I asked it. From the khronos.org page:
"OpenCL 1.0 at a glance
OpenCL (Open Computing Language) is the first open, royalty-free standard for general-purpose parallel programming of heterogeneous systems. OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance compute servers, desktop computer systems and handheld devices using a diverse mix of multi-core CPUs, GPUs, Cell-type architectures and other parallel processors such as DSPs"
So I think it would be interesting to know if it's true, if DSPs, which are particulary suited for some complex calculations, can really be programmed using OpenCL.
The OpenCL spec seems to support using a chip that has one or more programmable GPU shader cores as an expensive DSP. It does not appear that the spec makes many allowances for DSP chips that were not designed to support being used as a programmable GPU shader in a graphics pipeline.
I finally found one: The SNU-Samsung OpenCL Framework is able to use Texas Instrument C64x DSPs. More infos here:
http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU-SAMSUNG_OpenCL_Framework.html

How do I Perform Integer SIMD operations on the iPad A4 Processor?

I feel the need for speed. Double for loops are killing my iPad apps performance. I need SIMD. How do I perform integer SIMD operations on the iPad A4 processor?
Thanks,
Doug
The instruction set is NEON, intrinsics reference
I've never been able to find good documentation on what they all actually are. But you pick it up pretty quickly if you've had any exposure to SSE
To get the fastest speed, you will have to write ARM Assembly language code that uses NEON SIMD operations, because the C compilers generally don't make very good SIMD code, so hand-written Assembly will make a big difference. I have a brief intro here: http://www.shervinemami.co.cc/iphoneAssembly.html
Note that the iPad A4 uses the ARMv7-A CPU, so the reference manual for the NEON SIMD instructions is at: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
(but its 2000 pages long and requires the understanding of Assembly code and perhaps SIMD in general!).

Is OpenCV 2.0 optimized for AMD processors?

I know that in the past OpenCV was based on IPP and was optimized only for Intel CPUs. Is this still the case with OpenCV 2.0?
History says that OpenCV was originally developed by Intel.
If you check OpenCV faq, they'll say:
OpenCV itself is open source and written in quite portable C/C++, it runs on other processors already and should be fairly easy to port (for example, there are already some CUDA optimizations on NVidia. On the other hand, OpenCV can sometimes run much faster on Intel processors (and sometimes AMD) because it can take advantage of SSE optimizations. OpenCV can be compiled statically with IPP libraries from Intel also which can speed up some function.
I have used it on other processors and different OS and I've always been very happy, including for video processing applications.

Resources