SIMDe provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86, NEON on ARM, etc.).
This makes porting code to other architectures much easier in a few key ways:
First, instead of forcing you to rewrite everything for each architecture, SIMDe lets you get a port up and running almost effortlessly. You can then start working on switching the most performance-critical sections to native intrinsics, improving performance gradually. SIMDe lets (for example) SSE/AVX and NEON code exist side-by-side, in the same implementation.
Second, SIMDe makes it easier to write code targeting ISA extensions you don't have convenient access to. You can run NEON code on your x86 machine without an emulator. Obviously you'll eventually want to test on the actual hardware you're targeting, but for most development SIMDe can provide a much easier path.
SIMDe takes a very different approach from most other SIMD abstraction layers in that it aims to expose the entire functionality of the underlying instruction set. Instead of limiting functionality to a lowest common denominator, SIMDe tries to minimize the amount of effort required to port while still allowing you the space to optimize as needed.
The current focus is on writing complete portable implementations, though a large number of functions already have accelerated implementations using one (or more) of the following:
- SIMD intrinsics from other ISA extensions (e.g., using NEON to implement SSE).
- Compiler-specific vector extensions and built-ins such as
__builtin_shufflevector
and__builtin_convertvector
- Compiler auto-vectorization hints, using:
For an example of a project using SIMDe, see LZSSE-SIMDe.
There are currently complete implementations of the following instruction sets:
- MMX
- SSE
- SSE2
- SSE3
- SSSE3
- SSE4.1
As well as partial support for many others; see the instruction-set-support label in the issue tracker for details on progress. If you'd like to be notified when an instruction set is available you may subscribe to the relevant issue.
If you have a project you're interested in with SIMDe but we don't yet support all the functions you need, please file an issue with a list of what's missing so we know what to prioritize.
There are a lot of instructions to get through, so any help would be greatly appreciated! It's pretty straightforward work, and a great way to learn about the instructions.
There are three places you'll want to modify in order to implement a new function:
- ${arch}/${isax}.h — this is where the implementations live
- test/${isax}/${isax}.c — tests comparing the implementation with the expected result.
- test/${arch}/${isax}/compare.c — tests comparing the portable implementation with the "native" version, using random data for inputs.
The comparison test is optional, but very nice to have. The regular tests are required.
Hopefully it's clear what to do by using other functions in those files as a template, but if you have trouble please feel free to contact us; we're happy to help!
Each instruction set has a separate file; x86/mmx.h
for MMX,
x86/sse.h
for SSE, x86/sse2.h
for SSE2, and so on. Just include
the header for whichever instruction set(s) you want, and SIMDe will
provide the fastest implementation it can given which extensions
you've enabled in your compiler (i.e., if you want to use NEON to
implement SSE, you'll need to pass something like -mfpu=neon
).
Symbols are prefixed with simde_
. For example, the MMX
_mm_add_pi8
intrinsic becomes simde_mm_add_pi8
, and __m64
becomes simde__m64
.
Since SIMDe is meant to be portable, many functions which assume types
are of a specific size have been altered to use fixed-width types
instead. For example, Intel's APIs assume int
is 32 bits, so
simde_mm_set_pi32
's arguments are int32_t
instead of int
. On
platforms where the native API's assumptions hold (i.e., if int
really is 32-bits) SIMDe's types should be compatible, so existing
code needn't be changed unless you're porting to a new platform.
For best performance, you should enable OpenMP 4 SIMD support by
defining SIMDE_ENABLE_OPENMP
before including any SIMDe headers, and
enabling OpenMP support in your compiler. GCC and ICC both support a
flag to enable only OpenMP SIMD support instead of full OpenMP (the
SIMD support doesn't require the OpenMP run-time library); for GCC the
flag is -fopenmp-simd
, for ICC -openmp-simd
. SIMDe also supports
using Cilk Plus, GCC loop-specific
pragmas,
or clang pragma loop hint
directives,
though these are not as well tested.
SIMDe requires C99.
Every commit is tested with several different versions of GCC, clang, and PGI via Travis CI on Linux. Microsoft Visual C++ is tested on Windows using AppVeyor. Intel C/C++ Compiler is also tested sporadically (mostly because their optimization reports are excellent).
I'm generally willing to accept patches to add support for other compilers, as long as they're not too disruptive, especially if we can get CI support going. Travis and AppVeyor are great, but feel free to use whatever works.
Currently only x86_64, x86, and ARMv7 receive any sort of regular testing. If you'd like to see more thorough testing of other architectures, please consider finding a way to integrate it into CI. One example might be running qemu on Travis CI (or some other hosted CI).
- The "builtins" module in
portable-snippets
does much the same thing, but for compiler-specific intrinsics
(think
__builtin_clz
and_BitScanForward
), not SIMD intrinsics. - Intel offers an emulator, the Intel® Software Development Emulator which can be used to develop software which uses Intel intrinsics without having to own hardware which supports them, though AFAIK it doesn't help for deployment.
- I'm not aware of anyone else trying to create portable
implementa
659D
tions of an instruction set, but there are a few projects
trying to implement one set with another:
- ARM_NEON_2_x86_SSE — implementing NEON using SSE. Quite extensive, Apache 2.0 license.
- sse2neon — implementing SSE using NEON. This code has already been merged into SIMDe.
- veclib — implementing SSE2 using AltiVec/VMX, using a non-free IBM library called powerveclib
- SSE-to-NEON — implementing SSE with NEON. Non-free.
- arm-neon-tests contains tests te verify NEON implementations.
If you know of any other related projects, please let us know!
Sometime features can't be emulated. If SIMDe is operating in native mode the functions will work as expected, but if there is no native support the following caveats apply:
simde_MM_SET_ROUNDING_MODE()
will usefesetround()
, altering the global rounding mode.simde_mm_getcsr
andsimde_mm_setcsr
only implement bits 13 and 14 (rounding mode).
SIMDe is distributed under an MIT-style license; see COPYING for details.