SIMD Everywhere

SIMDe provides fast, portable implementations of SIMD intrinsics on hardware which doesn't natively support them, such as calling SSE functions on ARM. There is no performance penalty if the hardware supports the native implementation (e.g., SSE/AVX runs at full speed on x86, NEON on ARM, etc.).

This makes porting code to other architectures much easier in a few key ways:

First, instead of forcing you to rewrite everything for each architecture, SIMDe lets you get a port up and running almost effortlessly. You can then start working on switching the most performance-critical sections to native intrinsics, improving performance gradually. SIMDe lets (for example) SSE/AVX and NEON code exist side-by-side, in the same implementation.

Second, SIMDe makes it easier to write code targeting ISA extensions you don't have convenient access to. You can run NEON code on your x86 machine without an emulator. Obviously you'll eventually want to test on the actual hardware you're targeting, but for most development SIMDe can provide a much easier path.

SIMDe takes a very different approach from most other SIMD abstraction layers in that it aims to expose the entire functionality of the underlying instruction set. Instead of limiting functionality to a lowest common denominator, SIMDe tries to minimize the amount of effort required to port while still allowing you the space to optimize as needed.

The current focus is on writing complete portable implementations, though a large number of functions already have accelerated implementations using one (or more) of the following:

SIMD intrinsics from other ISA extensions (e.g., using NEON to implement SSE).
Compiler-specific vector extensions and built-ins such as __builtin_shufflevector and __builtin_convertvector
Compiler auto-vectorization hints, using:

For an example of a project using SIMDe, see LZSSE-SIMDe.

Current Status

There are currently complete implementations of the following instruction sets:

MMX
SSE
SSE2
SSE3
SSSE3
SSE4.1

As well as partial support for many others; see the instruction-set-support label in the issue tracker for details on progress. If you'd like to be notified when an instruction set is available you may subscribe to the relevant issue.

If you have a project you're interested in with SIMDe but we don't yet support all the functions you need, please file an issue with a list of what's missing so we know what to prioritize.

Want to help?

There are a lot of instructions to get through, so any help would be greatly appreciated! It's pretty straightforward work, and a great way to learn about the instructions.

There are three places you'll want to modify in order to implement a new function:

${arch}/${isax}.h — this is where the implementations live
test/${isax}/${isax}.c — tests comparing the implementation with the expected result.
test/${arch}/${isax}/compare.c — tests comparing the portable implementation with the "native" version, using random data for inputs.

The comparison test is optional, but very nice to have. The regular tests are required.

Hopefully it's clear what to do by using other functions in those files as a template, but if you have trouble please feel free to contact us; we're happy to help!

Usage

Each instruction set has a separate file; x86/mmx.h for MMX, x86/sse.h for SSE, x86/sse2.h for SSE2, and so on. Just include the header for whichever instruction set(s) you want, and SIMDe will provide the fastest implementation it can given which extensions you've enabled in your compiler (i.e., if you want to use NEON to implement SSE, you'll need to pass something like -mfpu=neon).

Symbols are prefixed with simde_. For example, the MMX _mm_add_pi8 intrinsic becomes simde_mm_add_pi8, and __m64 becomes simde__m64.

Since SIMDe is meant to be portable, many functions which assume types are of a specific size have been altered to use fixed-width types instead. For example, Intel's APIs assume int is 32 bits, so simde_mm_set_pi32's arguments are int32_t instead of int. On platforms where the native API's assumptions hold (i.e., if int really is 32-bits) SIMDe's types should be compatible, so existing code needn't be changed unless you're porting to a new platform.

For best performance, you should enable OpenMP 4 SIMD support by defining SIMDE_ENABLE_OPENMP before including any SIMDe headers, and enabling OpenMP support in your compiler. GCC and ICC both support a flag to enable only OpenMP SIMD support instead of full OpenMP (the SIMD support doesn't require the OpenMP run-time library); for GCC the flag is -fopenmp-simd, for ICC -openmp-simd. SIMDe also supports using Cilk Plus, GCC loop-specific pragmas, or clang pragma loop hint directives, though these are not as well tested.

Portability

Compilers

SIMDe requires C99.

Every commit is tested with several different versions of GCC, clang, and PGI via Travis CI on Linux. Microsoft Visual C++ is tested on Windows using AppVeyor. Intel C/C++ Compiler is also tested sporadically (mostly because their optimization reports are excellent).

I'm generally willing to accept patches to add support for other compilers, as long as they're not too disruptive, especially if we can get CI support going. Travis and AppVeyor are great, but feel free to use whatever works.

Hardware

Currently only x86_64, x86, and ARMv7 receive any sort of regular testing. If you'd like to see more thorough testing of other architectures, please consider finding a way to integrate it into CI. One example might be running qemu on Travis CI (or some other hosted CI).

Related Projects

The "builtins" module in portable-snippets does much the same thing, but for compiler-specific intrinsics (think __builtin_clz and _BitScanForward), not SIMD intrinsics.
Intel offers an emulator, the Intel® Software Development Emulator which can be used to develop software which uses Intel intrinsics without having to own hardware which supports them, though AFAIK it doesn't help for deployment.
I'm not aware of anyone else trying to create portable implementa 659D tions of an instruction set, but there are a few projects trying to implement one set with another:
- ARM_NEON_2_x86_SSE — implementing NEON using SSE. Quite extensive, Apache 2.0 license.
- sse2neon — implementing SSE using NEON. This code has already been merged into SIMDe.
- veclib — implementing SSE2 using AltiVec/VMX, using a non-free IBM library called powerveclib
- SSE-to-NEON — implementing SSE with NEON. Non-free.
arm-neon-tests contains tests te verify NEON implementations.

If you know of any other related projects, please let us know!

Caveats

Sometime features can't be emulated. If SIMDe is operating in native mode the functions will work as expected, but if there is no native support the following caveats apply:

SSE

simde_MM_SET_ROUNDING_MODE() will use fesetround(), altering the global rounding mode.
simde_mm_getcsr and simde_mm_setcsr only implement bits 13 and 14 (rounding mode).

License

SIMDe is distributed under an MIT-style license; see COPYING for details.

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
simde		simde
test		test
.appveyor.yml		.appveyor.yml
.gitignore		.gitignore
.gitmodules		.gitmodules
.travis.yml		.travis.yml
COPYING		COPYING
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SIMD Everywhere

Current Status

Want to help?

Usage

Portability

Compilers

Hardware

Related Projects

Caveats

SSE

License

About

Uh oh!

Releases

Packages

Languages

License

TrianglesPCT/simde

Folders and files

Latest commit

History

Repository files navigation

SIMD Everywhere

Current Status

Want to help?

Usage

Portability

Compilers

Hardware

Related Projects

Caveats

SSE

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages