Description: Single-Instruction Multiple-Data parallelism on x86 — the SSE/AVX/AVX-512 register sets and instruction families, when SIMD pays off, autovectorization vs intrinsics vs portable wrappers (std::experimental::simd, Highway, xsimd), and the gotchas (alignment, tail handling, downclocking on AVX-512).
My Notion Note ID: K2C-1-1
Created: 2020-06-01
Updated: 2026-05-22
License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

1. What SIMD Is

Single Instruction, Multiple Data — one CPU instruction operates on a vector of values in parallel.
One of Flynn's four taxonomy categories (SISD, SIMD, MISD, MIMD).
On modern x86, the vector width has grown over time:
- MMX (1996) — 64-bit registers (mm0-mm7), shared with x87 FPU. Integer-only. Obsolete.
- SSE family (1999+) — 128-bit xmm registers.
- AVX (2011) — 256-bit ymm registers.
- AVX-512 (2016) — 512-bit zmm registers, mask registers k0-k7.
ARM has its own SIMD: NEON (128-bit, ubiquitous on AArch64) and SVE / SVE2 (scalable, length-agnostic).
On a CPU running at 3 GHz with 256-bit AVX float ops: 8 floats per cycle per FMA unit × 2 FMA units = 16 floats/cycle ≈ 48 GFLOPS per core. The scalar baseline is ~6 GFLOPS. The gap is what makes SIMD worth the trouble.

2. SSE Family History

Extension	Year	Vector width	Notable additions
SSE	1999 (Pentium III)	128-bit `xmm`	Single-precision float SIMD, separate register file (no x87/MMX sharing).
SSE2	2000 (Pentium 4)	128-bit	Double-precision float, 128-bit integer ops. Baseline for x86-64 — every x86-64 CPU has it.
SSE3	2004	128-bit	Horizontal adds, complex-number ops.
SSSE3	2006	128-bit	`PSHUFB` byte shuffle, `PMADDUBSW`.
SSE4.1	2007	128-bit	`PMULLD`, `PMOVZX`, `BLENDV`, dot-product.
SSE4.2	2008	128-bit	String/text processing, CRC32, `PCMPESTRI`.
AVX	2011 (Sandy Bridge)	256-bit `ymm`	3-operand VEX encoding, double-width floats.
AVX2	2013 (Haswell)	256-bit	256-bit integer ops, gather, FMA.
AVX-512	2016 (Knights Landing → Skylake-X)	512-bit `zmm`	Mask registers, conflict detection, many sub-extensions (F, CD, BW, DQ, VL, ...).
AVX-512 on consumer	mixed — present on Ice Lake, removed on Alder Lake/Raptor Lake (E-cores can't do it), partially back on Zen 4/Zen 5.

"SSE" today casually refers to the whole family. The original 1999 SSE is rarely the boundary you care about; SSE2 is the practical baseline for any 64-bit x86 code.
Detect support at runtime via cpuid (or __builtin_cpu_supports("avx2") in GCC/Clang).

3. Register and Instruction Layout

3.1 Registers

MMX: 8× 64-bit mm0-mm7 (aliased with x87 — obsolete).
SSE / SSE2-SSE4.2: 16× 128-bit xmm0-xmm15 (8 in 32-bit mode).
AVX / AVX2: 16× 256-bit ymm0-ymm15. Lower 128 bits = xmm.
AVX-512: 32× 512-bit zmm0-zmm31. Lower 256 = ymm, lower 128 = xmm. Plus 8 mask registers k0-k7 for predicated ops.

A 128-bit xmm register can hold 4 floats, 2 doubles, 16 bytes, 8 shorts, 4 ints, or 2 int64s — interpretation is per-instruction.

3.2 Operations

Arithmetic — ADDPS (add packed singles), MULPD (multiply packed doubles), VFMADD231PS (fused multiply-add).
Comparison — CMPPS, PCMPEQB produce a bit-mask result.
Permutation/shuffle — PSHUFD, PSHUFB, VPERMPS rearrange lanes. Often the bottleneck.
Load/store — MOVAPS (aligned), MOVUPS (unaligned), VMOVAPS (AVX). Pre-Nehalem (2008), unaligned loads were significantly slower; modern cores have near-identical aligned vs unaligned cost — except across cache-line boundaries.
Mask / predication (AVX-512) — k-register tells which lanes execute; eliminates a lot of branch-heavy code.

3.3 Calling-convention impact

AVX/AVX-512 write to wider registers; legacy code only saves the xmm portion → upper bits get zeroed (VZEROUPPER instruction). Mixing SSE and AVX without VZEROUPPER causes huge per-call stalls. Compilers handle this; hand-written assembly must not skip it.

4. Ways to Get SIMD Code

Ordered from most-portable / least-effort to most-control:

4.1 Autovectorization

Compiler turns scalar loops into SIMD automatically.
Conditions: countable loop, no aliasing, no data dependencies across iterations, simple body, predictable trip count, aligned access (or compiler handles peeling).
Inspect with -fopt-info-vec (GCC), -Rpass=loop-vectorize (Clang), or /Qvec-report (MSVC).
Help the compiler: __restrict__ to assert no aliasing, contiguous storage, simple control flow, mark loop with #pragma omp simd.

4.2 OpenMP / `#pragma omp simd`

Portable hint to vectorize a loop, even when autovec is too timid.
#pragma omp simd reduction(+:sum) is the typical incantation.

4.3 Portable SIMD wrappers

std::experimental::simd (Parallelism TS v2 — ISO/IEC TS 19570:2018; formalized as std::simd in C++26).
Google Highway — header-only, supports x86, ARM, RISC-V, WASM. Used by JPEG XL, Chromium.
xsimd — used by xtensor / mlpack.
Eve — modern C++20-style, expression templates.
Write once, the wrapper picks the widest vector type the target supports.

4.4 Intrinsics

C function names that map 1:1 to SIMD instructions. Define type prefix per width (__m128, __m256, __m512).

#include <immintrin.h>

// dot product of two arrays of 8 floats with AVX
float dot8(const float* a, const float* b) {
    __m256 va = _mm256_loadu_ps(a);
    __m256 vb = _mm256_loadu_ps(b);
    __m256 vmul = _mm256_mul_ps(va, vb);
    // horizontal sum: a bit of shuffling
    __m128 lo = _mm256_castps256_ps128(vmul);
    __m128 hi = _mm256_extractf128_ps(vmul, 1);
    __m128 sum128 = _mm_add_ps(lo, hi);
    sum128 = _mm_hadd_ps(sum128, sum128);
    sum128 = _mm_hadd_ps(sum128, sum128);
    return _mm_cvtss_f32(sum128);
}

Compile with -mavx (or -mavx2, -mavx512f). Without the flag, the intrinsic is undefined.
Per-architecture; no portability without #ifdef walls.

4.5 Inline / standalone assembly

Maximum control, minimum portability and readability. Avoid except for the hottest microkernels.

5. When SIMD Helps and When It Doesn't

Helps:

Tight loops over contiguous arrays of small numeric types.
Image processing (per-pixel ops), audio (per-sample), physics (per-particle).
Dense linear algebra — BLAS/LAPACK libs (OpenBLAS, MKL, BLIS, Eigen) are extensively hand-SIMD'd.
Compression/decompression (zstd, lz4 use SIMD heavily).
Cryptography hash functions (BLAKE3, SHA-256 with SHA extensions).
Parsing — JSON (simdjson is the reference example), CSV.

Doesn't help much:

Branchy code, indirect lookups, pointer chasing — these defeat vectorization.
Loops dominated by memory bandwidth — SIMD doesn't make DRAM faster.
Small N — fixed setup cost / function-call overhead eats the gain.
Data already on GPU — orders of magnitude more parallel there.

6. Pitfalls

Alignment. MOVAPS requires 16-byte alignment; misaligned → #GP fault. MOVUPS works on any address. Use alignas(32) or aligned_alloc when in doubt. On modern cores the unaligned penalty is tiny except across cache lines.
Tail handling. Vectorized loop processes floor(N/W) * W elements; the remaining N mod W elements need scalar cleanup or a mask. Forgetting this = silent UB or skipped data.
AVX-512 downclocking. On Skylake-SP and Cascade Lake, executing 512-bit AVX-512 instructions throttled the core (and sometimes neighbors) by ~10-25%. Workload had to overcome the frequency loss before showing a speedup. Improved on Ice Lake; gone on Sapphire Rapids. Still real on older Xeon-SP fleets.
VZEROUPPER stall. Calling an SSE routine from AVX code without VZEROUPPER causes a ~70-cycle stall on each transition. Compilers insert it automatically; hand-assembly must not omit.
Denormals. SIMD denormal handling is slow; enable FTZ (flush-to-zero) and DAZ (denormals-are-zero) via _MM_SET_FLUSH_ZERO_MODE / _MM_SET_DENORMALS_ZERO_MODE in numerical hot paths.
CPU dispatch. Binaries built with -mavx2 only run on CPUs with AVX2. Either ship multiple variants + a CPU-dispatch table (__attribute__((target("avx2"))) function multi-versioning), or set a baseline (e.g., x86-64-v3 ABI → AVX2+BMI2).
AVX-512 absent on consumer Intel chips (Alder Lake onward) — fragmented. Don't assume it's there; check at runtime.
Autovec silently giving up. A subtle dependency, aliasing, or function call inside the loop drops vectorization with no error. Check the report flags every time.

7. References

Intel® Intrinsics Guide — https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Agner Fog's optimization manuals — https://www.agner.org/optimize/
Google Highway — https://github.com/google/highway
Daniel Lemire, "Parsing gigabytes of JSON per second" (simdjson) — https://arxiv.org/abs/1902.08318
Wikipedia: SSE family — https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
Wikipedia: AVX — https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
Wikipedia: AVX-512 — https://en.wikipedia.org/wiki/AVX-512
Computer Architecture: A Quantitative Approach (Hennessy & Patterson) — SIMD / vector chapter.

SIMD and SSE

Table of Contents

1. What SIMD Is

2. SSE Family History

3. Register and Instruction Layout

3.1 Registers

3.2 Operations

3.3 Calling-convention impact

4. Ways to Get SIMD Code

4.1 Autovectorization

4.2 OpenMP / `#pragma omp simd`

4.3 Portable SIMD wrappers

4.4 Intrinsics

4.5 Inline / standalone assembly

5. When SIMD Helps and When It Doesn't

6. Pitfalls

7. References

Table of Contents

1. What SIMD Is

2. SSE Family History

3. Register and Instruction Layout

3.1 Registers

3.2 Operations

3.3 Calling-convention impact

4. Ways to Get SIMD Code

4.1 Autovectorization

4.2 OpenMP / #pragma omp simd

4.3 Portable SIMD wrappers

4.4 Intrinsics

4.5 Inline / standalone assembly

5. When SIMD Helps and When It Doesn't

6. Pitfalls

7. References

4.2 OpenMP / `#pragma omp simd`