SIMD and SSE


  • Description: Single-Instruction Multiple-Data parallelism on x86 — the SSE/AVX/AVX-512 register sets and instruction families, when SIMD pays off, autovectorization vs intrinsics vs portable wrappers (std::experimental::simd, Highway, xsimd), and the gotchas (alignment, tail handling, downclocking on AVX-512).
  • My Notion Note ID: K2C-1-1
  • Created: 2020-06-01
  • Updated: 2026-05-22
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. What SIMD Is

  • Single Instruction, Multiple Data — one CPU instruction operates on a vector of values in parallel.
  • One of Flynn's four taxonomy categories (SISD, SIMD, MISD, MIMD).
  • On modern x86, the vector width has grown over time:
    • MMX (1996) — 64-bit registers (mm0-mm7), shared with x87 FPU. Integer-only. Obsolete.
    • SSE family (1999+) — 128-bit xmm registers.
    • AVX (2011) — 256-bit ymm registers.
    • AVX-512 (2016) — 512-bit zmm registers, mask registers k0-k7.
  • ARM has its own SIMD: NEON (128-bit, ubiquitous on AArch64) and SVE / SVE2 (scalable, length-agnostic).
  • On a CPU running at 3 GHz with 256-bit AVX float ops: 8 floats per cycle per FMA unit × 2 FMA units = 16 floats/cycle ≈ 48 GFLOPS per core. The scalar baseline is ~6 GFLOPS. The gap is what makes SIMD worth the trouble.

2. SSE Family History

Extension Year Vector width Notable additions
SSE 1999 (Pentium III) 128-bit xmm Single-precision float SIMD, separate register file (no x87/MMX sharing).
SSE2 2000 (Pentium 4) 128-bit Double-precision float, 128-bit integer ops. Baseline for x86-64 — every x86-64 CPU has it.
SSE3 2004 128-bit Horizontal adds, complex-number ops.
SSSE3 2006 128-bit PSHUFB byte shuffle, PMADDUBSW.
SSE4.1 2007 128-bit PMULLD, PMOVZX, BLENDV, dot-product.
SSE4.2 2008 128-bit String/text processing, CRC32, PCMPESTRI.
AVX 2011 (Sandy Bridge) 256-bit ymm 3-operand VEX encoding, double-width floats.
AVX2 2013 (Haswell) 256-bit 256-bit integer ops, gather, FMA.
AVX-512 2016 (Knights Landing → Skylake-X) 512-bit zmm Mask registers, conflict detection, many sub-extensions (F, CD, BW, DQ, VL, ...).
AVX-512 on consumer mixed — present on Ice Lake, removed on Alder Lake/Raptor Lake (E-cores can't do it), partially back on Zen 4/Zen 5.
  • "SSE" today casually refers to the whole family. The original 1999 SSE is rarely the boundary you care about; SSE2 is the practical baseline for any 64-bit x86 code.
  • Detect support at runtime via cpuid (or __builtin_cpu_supports("avx2") in GCC/Clang).

3. Register and Instruction Layout

3.1 Registers

  • MMX: 8× 64-bit mm0-mm7 (aliased with x87 — obsolete).
  • SSE / SSE2-SSE4.2: 16× 128-bit xmm0-xmm15 (8 in 32-bit mode).
  • AVX / AVX2: 16× 256-bit ymm0-ymm15. Lower 128 bits = xmm.
  • AVX-512: 32× 512-bit zmm0-zmm31. Lower 256 = ymm, lower 128 = xmm. Plus 8 mask registers k0-k7 for predicated ops.

A 128-bit xmm register can hold 4 floats, 2 doubles, 16 bytes, 8 shorts, 4 ints, or 2 int64s — interpretation is per-instruction.

3.2 Operations

  • ArithmeticADDPS (add packed singles), MULPD (multiply packed doubles), VFMADD231PS (fused multiply-add).
  • ComparisonCMPPS, PCMPEQB produce a bit-mask result.
  • Permutation/shufflePSHUFD, PSHUFB, VPERMPS rearrange lanes. Often the bottleneck.
  • Load/storeMOVAPS (aligned), MOVUPS (unaligned), VMOVAPS (AVX). Pre-Nehalem (2008), unaligned loads were significantly slower; modern cores have near-identical aligned vs unaligned cost — except across cache-line boundaries.
  • Mask / predication (AVX-512) — k-register tells which lanes execute; eliminates a lot of branch-heavy code.

3.3 Calling-convention impact

  • AVX/AVX-512 write to wider registers; legacy code only saves the xmm portion → upper bits get zeroed (VZEROUPPER instruction). Mixing SSE and AVX without VZEROUPPER causes huge per-call stalls. Compilers handle this; hand-written assembly must not skip it.

4. Ways to Get SIMD Code

Ordered from most-portable / least-effort to most-control:

4.1 Autovectorization

  • Compiler turns scalar loops into SIMD automatically.
  • Conditions: countable loop, no aliasing, no data dependencies across iterations, simple body, predictable trip count, aligned access (or compiler handles peeling).
  • Inspect with -fopt-info-vec (GCC), -Rpass=loop-vectorize (Clang), or /Qvec-report (MSVC).
  • Help the compiler: __restrict__ to assert no aliasing, contiguous storage, simple control flow, mark loop with #pragma omp simd.

4.2 OpenMP / #pragma omp simd

  • Portable hint to vectorize a loop, even when autovec is too timid.
  • #pragma omp simd reduction(+:sum) is the typical incantation.

4.3 Portable SIMD wrappers

  • std::experimental::simd (Parallelism TS v2 — ISO/IEC TS 19570:2018; formalized as std::simd in C++26).
  • Google Highway — header-only, supports x86, ARM, RISC-V, WASM. Used by JPEG XL, Chromium.
  • xsimd — used by xtensor / mlpack.
  • Eve — modern C++20-style, expression templates.
  • Write once, the wrapper picks the widest vector type the target supports.

4.4 Intrinsics

  • C function names that map 1:1 to SIMD instructions. Define type prefix per width (__m128, __m256, __m512).
#include <immintrin.h>

// dot product of two arrays of 8 floats with AVX
float dot8(const float* a, const float* b) {
    __m256 va = _mm256_loadu_ps(a);
    __m256 vb = _mm256_loadu_ps(b);
    __m256 vmul = _mm256_mul_ps(va, vb);
    // horizontal sum: a bit of shuffling
    __m128 lo = _mm256_castps256_ps128(vmul);
    __m128 hi = _mm256_extractf128_ps(vmul, 1);
    __m128 sum128 = _mm_add_ps(lo, hi);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);
}
  • Compile with -mavx (or -mavx2, -mavx512f). Without the flag, the intrinsic is undefined.
  • Per-architecture; no portability without #ifdef walls.

4.5 Inline / standalone assembly

  • Maximum control, minimum portability and readability. Avoid except for the hottest microkernels.

5. When SIMD Helps and When It Doesn't

Helps:

  • Tight loops over contiguous arrays of small numeric types.
  • Image processing (per-pixel ops), audio (per-sample), physics (per-particle).
  • Dense linear algebra — BLAS/LAPACK libs (OpenBLAS, MKL, BLIS, Eigen) are extensively hand-SIMD'd.
  • Compression/decompression (zstd, lz4 use SIMD heavily).
  • Cryptography hash functions (BLAKE3, SHA-256 with SHA extensions).
  • Parsing — JSON (simdjson is the reference example), CSV.

Doesn't help much:

  • Branchy code, indirect lookups, pointer chasing — these defeat vectorization.
  • Loops dominated by memory bandwidth — SIMD doesn't make DRAM faster.
  • Small N — fixed setup cost / function-call overhead eats the gain.
  • Data already on GPU — orders of magnitude more parallel there.

6. Pitfalls

  • Alignment. MOVAPS requires 16-byte alignment; misaligned → #GP fault. MOVUPS works on any address. Use alignas(32) or aligned_alloc when in doubt. On modern cores the unaligned penalty is tiny except across cache lines.
  • Tail handling. Vectorized loop processes floor(N/W) * W elements; the remaining N mod W elements need scalar cleanup or a mask. Forgetting this = silent UB or skipped data.
  • AVX-512 downclocking. On Skylake-SP and Cascade Lake, executing 512-bit AVX-512 instructions throttled the core (and sometimes neighbors) by ~10-25%. Workload had to overcome the frequency loss before showing a speedup. Improved on Ice Lake; gone on Sapphire Rapids. Still real on older Xeon-SP fleets.
  • VZEROUPPER stall. Calling an SSE routine from AVX code without VZEROUPPER causes a ~70-cycle stall on each transition. Compilers insert it automatically; hand-assembly must not omit.
  • Denormals. SIMD denormal handling is slow; enable FTZ (flush-to-zero) and DAZ (denormals-are-zero) via _MM_SET_FLUSH_ZERO_MODE / _MM_SET_DENORMALS_ZERO_MODE in numerical hot paths.
  • CPU dispatch. Binaries built with -mavx2 only run on CPUs with AVX2. Either ship multiple variants + a CPU-dispatch table (__attribute__((target("avx2"))) function multi-versioning), or set a baseline (e.g., x86-64-v3 ABI → AVX2+BMI2).
  • AVX-512 absent on consumer Intel chips (Alder Lake onward) — fragmented. Don't assume it's there; check at runtime.
  • Autovec silently giving up. A subtle dependency, aliasing, or function call inside the loop drops vectorization with no error. Check the report flags every time.

7. References