C++ OpenMP


  • Description: OpenMP for C++ — fork/join, parallel loops and schedules, data sharing, reductions and atomic, sections and tasks, SIMD, synchronization
  • My Notion Note ID: K2A-B2-2
  • Created: 2020-01-13
  • Updated: 2026-04-30
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. Model

#pragma omp directives in C, C++, or Fortran source instruct the compiler to emit threading code for shared-memory parallelism. A small runtime library (libgomp for GCC, libomp for Clang/LLVM) provides thread management and timing.

1.1 Fork/Join Model

  • The program starts with one thread (the master or initial thread).
  • At a #pragma omp parallel region, the master forks a team of threads.
  • Threads in the team execute the region concurrently.
  • At the end of the region there is an implicit barrier; all threads wait, then the team joins back to the master.
  • Execution between parallel regions is serial.

1.2 Compiling and Checking Version

# GCC and Clang
g++   -fopenmp -O2 main.cpp -o main
clang++ -fopenmp -O2 main.cpp -o main

# MSVC
cl /openmp main.cpp

Check the OpenMP version a compiler supports — the value of _OPENMP is a date corresponding to the spec version (e.g. 201511 ≈ 4.5, 201811 ≈ 5.0):

echo | cpp -fopenmp -dM | grep -i openmp
# #define _OPENMP 201511

2. Controlling the Number of Threads

In priority order, lowest-precedence first:

  1. Compile-time default (typically the number of online cores).
  2. Environment variable: OMP_NUM_THREADS=4 ./app.
  3. Runtime call: omp_set_num_threads(4);.
  4. Per-region clause: #pragma omp parallel num_threads(4).

Useful runtime functions (from <omp.h>):

Function Returns
omp_get_thread_num() Index of the calling thread within the team (0..N-1)
omp_get_num_threads() Size of the active team
omp_get_max_threads() Upper bound the next parallel region can use
omp_get_num_procs() Number of processors visible to the runtime
omp_in_parallel() true if inside an active parallel region
omp_get_wtime() Wall-clock time in seconds (for timing)
omp_get_wtick() Timer resolution

Gotcha: omp_get_num_threads() called outside a parallel region returns 1, not the team size you'd see inside. For the upper bound the next region might use, call omp_get_max_threads() instead.

Nested parallelism is disabled by default. Enable it with omp_set_max_active_levels(N) (or OMP_MAX_ACTIVE_LEVELS=N); the older omp_set_nested / OMP_NESTED are deprecated since OpenMP 5.0. In practice nested parallelism oversubscribes cores — prefer tasks or larger outer parallelism.

3. Parallel Loops

3.1 Syntax

Full form (parallel region + worksharing for):

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0; i < n; ++i) {
        a[i] = b[i] + c[i];
    }
}

Combined form (the common case):

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    a[i] = b[i] + c[i];
}

The two are equivalent when the parallel region contains exactly one loop. Use the separated form when you want multiple worksharing constructs to share the same team of threads.

3.2 Loop-Form Restrictions

To worksharing-parallelize a loop, OpenMP requires it to be in canonical form:

  • The loop variable is integer (or random-access iterator since OpenMP 3.0).
  • Trip count is computable before the loop starts.
  • Comparison is one of <, <=, >, >=.
  • Increment is ++i, --i, i += k, i -= k, with k invariant in the loop.
  • No break, goto, return, or exception escaping the loop body. (continue is fine.)

C++ range-based for is not parallelizable directly — convert to an index loop or use OpenMP 5.0 taskloop.

3.3 Schedules

The schedule clause picks how iterations are distributed among threads.

Schedule What it does When to use
static Equal contiguous chunks, assigned at compile/loop entry Iterations of roughly equal cost
static, n Cyclic chunks of size n Cache-friendly cyclic distribution
dynamic, n Threads grab chunks of n from a queue as they finish Iterations of variable cost
guided, n Like dynamic, but chunk size shrinks over time Variable-cost work, tail effects
auto Runtime/compiler picks Trust the implementation
runtime Picked from OMP_SCHEDULE env var Tune from outside the binary
#pragma omp parallel for schedule(dynamic, 64)
for (int i = 0; i < n; ++i) {
    process(i);   // each iteration may take very different time
}

Don't parallelize tiny loops blindly. Team-creation overhead can dominate for small n. Gate it with if(n > threshold):

#pragma omp parallel for if(n > 1024)
for (int i = 0; i < n; ++i) { ... }

Exceptions and OpenMP don't mix. An exception thrown inside a parallel region must not propagate out. Catch inside, communicate failure via a shared atomic flag, or wrap the whole region body in a try/catch.

4. Data Sharing

4.1 Default Rules

Inside a parallel region:

  • Variables declared outside the region are shared by default.
  • Variables declared inside the region are private.
  • Loop iteration variables on a work-sharing for (and parallel for, taskloop, distribute) are predetermined-private — private regardless of where they were declared.
  • Static and global variables are always shared.

Get into the habit of writing default(none) and listing every variable explicitly — it forces you to think about each one and catches accidental sharing.

int sum = 0;
#pragma omp parallel for default(none) shared(a, n) reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i];
}

4.2 private, firstprivate, lastprivate, shared

Clause Meaning
shared(x) All threads see and modify the same x. Programmer is responsible for race-free access.
private(x) Each thread gets its own x. Uninitialized at entry; the original value is invisible inside, and the original is unchanged on exit.
firstprivate(x) Per-thread copy, initialized from the value before the region.
lastprivate(x) Per-thread copy. After the region, the original x receives the value from the thread that ran the last iteration of the loop (or the last section).
`default(shared none)`

private(x) does NOT initialize. Each thread's copy starts uninitialized; the original value is invisible inside the region, and the original is unchanged on exit. Use firstprivate if you need the previous value.

Watch for false sharing. If multiple threads write to different bytes of the same cache line — common with arrays of per-thread accumulators — the line ping-pongs between cores' caches and parallel speedup collapses. Pad per-thread data to a cache line (typically 64 bytes), or restructure so each thread's working set is isolated.

4.3 Reductions

For accumulating a value across iterations, use reduction instead of a critical section — the compiler gives each thread a private accumulator and combines them at the end with the operator you specify, which is far cheaper.

double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; ++i) {
    sum += a[i] * b[i];
}

Built-in reduction operators: +, *, &, |, ^, &&, ||, min, max. (- was also a built-in but is deprecated since OpenMP 5.2 — use + with negated values instead.) OpenMP 4.0 added user-defined reductions via #pragma omp declare reduction.

4.4 Critical Sections and atomic

If a real critical section is unavoidable:

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
    int v = compute(i);
    #pragma omp critical
    {
        global_log.push_back(v);
    }
}

For a single update to a scalar, atomic is much cheaper than critical:

#pragma omp atomic
counter += 1;

#pragma omp atomic update
total += a[i];

#pragma omp atomic capture
{ old = counter; counter += 1; }

atomic maps to a hardware atomic instruction; critical takes a mutex. Use atomic whenever the operation fits its restricted forms.

A loop writing to a single shared variable without reduction, atomic, or critical is a data race. The result is undefined. It may even pass tests on some hardware and fail on others — there's no compile-time check. Always pick one of the three.

5. Sections and Tasks

Sections parallelize a fixed set of unrelated work blocks:

#pragma omp parallel sections
{
    #pragma omp section
    do_a();
    #pragma omp section
    do_b();
    #pragma omp section
    do_c();
}

Tasks (OpenMP 3.0+) parallelize irregular work — recursion, dynamic graphs, anywhere the iteration count isn't known upfront:

int fib(int n) {
    if (n < 2) return n;
    int x, y;
    #pragma omp task shared(x)
    x = fib(n - 1);
    #pragma omp task shared(y)
    y = fib(n - 2);
    #pragma omp taskwait
    return x + y;
}

int main() {
    int r;
    #pragma omp parallel
    #pragma omp single
    r = fib(20);
}

taskloop (OpenMP 4.5) is a task-based alternative to parallel for, useful when iterations have very uneven cost or when you need composability with other task work.

6. SIMD

#pragma omp simd asks the compiler to vectorize a loop using SIMD instructions (no threads involved). Combine with parallel for for both threading and vectorization.

#pragma omp simd
for (int i = 0; i < n; ++i) {
    a[i] = b[i] * c[i];
}

#pragma omp parallel for simd
for (int i = 0; i < n; ++i) {
    a[i] = std::sqrt(b[i] * b[i] + c[i] * c[i]);
}

#pragma omp declare simd on a function lets the compiler generate a vector version that can be called from inside a simd loop.

7. Synchronization

Directive What it does
#pragma omp barrier All threads wait until every thread reaches the barrier
#pragma omp single Only one (unspecified) thread executes the block; others wait at end
#pragma omp master Only the master thread executes; no implicit barrier at end
#pragma omp ordered Inside a for ordered, forces this block to run in iteration order
nowait clause On for/single/sections, skips the implicit end-barrier

8. References