Taskflow is a header-only C++20 library for parallel and heterogeneous task programming. You express work as a directed acyclic graph (DAG) of tasks, and Taskflow's work-stealing executor runs them in parallel.
Detailed guides (read when you need them):
docs/taskflow-patterns.md— Code conversion recipes, advanced patterns, large-scale DAGsdocs/taskflow-performance.md— Profiling, per-task cost model, tuning, NUMA pinningdocs/taskflow-internals.md— Contributing to Taskflow source code
Field notes (specialized; skip unless relevant):
field-notes/opentimer-granularity-experiments.md— OpenTimer granularity case study using DigestObserver (NUMA + SMT findings)field-notes/taskbench-porting.md— Porting a runtime to the task-bench benchmark framework
Profiling: for diagnostic profiling of Taskflow apps, prefer tf::DigestObserver (~2 KB structured digest, O(1) memory, all thresholds calibrated per-run). It's a drop-in custom observer — see docs/taskflow-performance.md Section 7 and INTEGRATION.md for integration details.
g++ -std=c++20 -O2 -pthread my_program.cpp -I /path/to/taskflow#include <taskflow/taskflow.hpp> // core (always needed)
#include <taskflow/algorithm/for_each.hpp> // algorithm headers are separate
#include <taskflow/algorithm/reduce.hpp> // include only what you useCMake options: find_package(Taskflow) (installed) or FetchContent (vendored).
With FetchContent, the target is Taskflow (no namespace), not Taskflow::Taskflow.
| Compiler | Min Version | Flag | CMake >= 3.18 |
|---|---|---|---|
| GCC | 11 | -std=c++20 |
CUDA: nvcc >= 12.0 |
| Clang | 12 | -std=c++20 |
|
| MSVC | 19.29 | /std:c++20 |
| Object | Role |
|---|---|
tf::Executor |
Thread pool that runs taskflows — heavyweight, spawns OS threads on construction |
tf::Taskflow |
DAG of tasks (the computation) — lightweight, can be rebuilt cheaply |
tf::Task |
Handle to a node in the DAG (just a Node*, 8 bytes) |
tf::TaskGroup |
Cooperative task group for fork-join within a running executor |
tf::Executor executor; // thread pool (default: hardware_concurrency)
tf::Taskflow taskflow; // task graph
auto A = taskflow.emplace([](){ /* work */ }).name("A");
auto B = taskflow.emplace([](){ /* work */ }).name("B");
A.precede(B); // A runs before B
executor.run(taskflow).wait(); // run() is non-blocking — .wait() is requiredExecutor lifetime: tf::Executor construction spawns N OS threads and initializes
work-stealing queues — this is expensive (~100µs+ per thread). When a function will be
called repeatedly (benchmarks, servers, iterative solvers), construct the executor once
and reuse it:
void compute(tf::Executor& executor) { // receives pre-built executor
tf::Taskflow taskflow;
// ... build graph ...
executor.run(taskflow).wait();
}
void measure(tf::Executor& executor) {
compute(executor);
}Choose the executor's worker count at the call site, construct it once for that configuration, and reuse it across calls.
Building a new tf::Taskflow each call is fine — it's lightweight. The executor is what's
expensive. If your workload only runs once, a local executor is perfectly acceptable.
| Sequential Pattern | Taskflow Replacement | Header |
|---|---|---|
for loop over range/indices |
for_each / for_each_index |
algorithm/for_each.hpp |
| Chunked parallel loop (large range, cheap body) | for_each_by_index |
algorithm/for_each.hpp |
| Accumulate / fold (iterators) | reduce / transform_reduce |
algorithm/reduce.hpp |
| Accumulate / fold (index range) | reduce_by_index |
algorithm/reduce.hpp |
std::sort |
sort |
algorithm/sort.hpp |
| Prefix sum | inclusive_scan / exclusive_scan |
algorithm/scan.hpp |
std::find_if / std::transform |
find_if / transform |
algorithm/find.hpp / transform.hpp |
| Multi-stage streaming (fixed stages) | tf::Pipeline |
algorithm/pipeline.hpp |
| Multi-stage streaming (typed data flow) | tf::DataPipeline |
algorithm/data_pipeline.hpp |
| Multi-stage streaming (dynamic stages) | tf::ScalablePipeline |
algorithm/pipeline.hpp |
| Independent tasks with deps | emplace() + precede() |
taskflow.hpp |
| Recursive fork-join | tf::TaskGroup + corun() |
taskflow.hpp |
| Dynamic sub-DAG (need task handles) | tf::Subflow |
taskflow.hpp |
| Conditional branching/loops | Condition task (returns int) | taskflow.hpp |
| Repeated parallel computation | Condition task cycle (graph-internal loop), or run_n() / run_until() for simpler repeated execution |
taskflow.hpp |
| Fire-and-forget async (no future needed) | executor.silent_async() |
taskflow.hpp |
Async with std::future<T> return value |
executor.async() |
taskflow.hpp |
| Async with ordering | dependent_async() |
taskflow.hpp |
| Reusable graph modules | composed_of() |
taskflow.hpp |
| Limit concurrency | tf::Semaphore |
taskflow.hpp |
| Sequential task chain | taskflow.linearize(tasks) |
taskflow.hpp |
Data-parallel loop?
- Per-element work is substantial →
for_each_index(Taskflow partitions the range internally; good default for non-trivial per-element work) - Large range with cheap per-element work, or when you want explicit inner-loop control
→
for_each_by_index(lambda receives a subrange to iterate) - Need to reduce/accumulate a result →
reduce_by_index(thread-local partial sums, merged at the end — avoids shared-state contention)
Recursive / fork-join?
- Structure discovered during execution (fibonacci, merge sort, recursive tree traversal)
→
tf::TaskGroup+corun(). Workers cooperatively steal tasks while waiting. - Structure known at construction time (enumerable in a loop — e.g., BFS levels, layer-by-layer
graph) →
emplace()+precede(). Build the full DAG statically, no runtime overhead. - Need a dynamic sub-DAG with explicit task handles and edges →
tf::Subflow. More powerful but heavier — each subflow materializes a full task graph.
TaskGroup+corun recipe (recursive fork-join):
// Recursive function — runs inside a worker, so TaskGroup is available
size_t fib(tf::Executor& executor, size_t n) {
if (n < 2) return n;
size_t x, y;
auto tg = executor.task_group();
tg.silent_async([&]{ x = fib(executor, n - 1); });
y = fib(executor, n - 2); // run one child inline to save a task
tg.corun(); // worker steals other tasks while waiting — no thread starvation
return x + y;
}
// Entry point — enter worker context first
size_t fib_entry(tf::Executor& executor, size_t n) {
return executor.async([&executor, n]() {
return fib(executor, n);
}).get();
}TaskGroup requires a worker context — call executor.task_group() only from code
running inside the executor (e.g., inside executor.async(...) or a task lambda). For
binary recursion, fork one child and run the other inline to reduce task creation overhead.
Pipeline / staged processing? → Use the Pipeline family rather than manually wiring stage-to-stage DAGs with barrier tasks. Pipeline provides optimized token-based scheduling without materializing the full DAG.
- Stages known at compile time →
tf::Pipeline(simplest) - Typed data flowing between stages →
tf::DataPipeline(compile-time type safety) - Number of stages determined at runtime →
tf::ScalablePipeline(dynamic configuration) Seedocs/taskflow-patterns.mdSection 4e for details on each variant.
Repeated computation in a loop? → Prefer a condition task cycle when loop state belongs
inside the DAG or you need graph-internal control flow. Use run_n() / run_until() when
you simply need to rerun the same graph with a simpler API.
None of the above? → Manual graph with emplace() + precede().
Before writing manual implementations, check if Taskflow already has a built-in algorithm
(e.g., sort, reduce, inclusive_scan, for_each_index). Built-in algorithms are pre-tuned
and always faster than hand-rolled recursive equivalents.
For large-scale explicit DAGs (100K+ tasks), see docs/taskflow-patterns.md Section 6f.
Algorithms like for_each_index, for_each_by_index, reduce_by_index, and sort accept
an optional partitioner that controls how work is distributed across threads.
| Partitioner | Best for | Chunk size |
|---|---|---|
tf::GuidedPartitioner{} (default) |
General-purpose — starts with large chunks, shrinks adaptively | Auto |
tf::DynamicPartitioner{C} |
Irregular per-element cost (image processing, sparse graphs) | C elements per steal |
tf::StaticPartitioner{} |
Uniform per-element cost (matrix ops, memset-like loops) | N/threads, no stealing |
Chunk size guidance:
DynamicPartitioner{1}— finest load balance, highest steal overheadDynamicPartitioner{64}or larger — amortizes steal overhead for cheap per-element workStaticPartitioner{}— zero steal overhead, best when every element costs the same
When to switch from default: If profiling shows load imbalance (some threads finish much
earlier than others), try DynamicPartitioner{1}. If profiling shows steal overhead dominates
(very cheap per-element work, uniform cost), try StaticPartitioner{}.
Each taskflow.emplace(lambda) allocates a heap Node (~160 bytes) and type-erases
the lambda via std::function. The SBO (small-buffer optimization) threshold for
std::function is only 16 bytes (libstdc++) — most realistic captures exceed this,
triggering a second heap allocation. Total cost: ~150-250ns per task.
Cost hierarchy — where to focus optimization effort:
- Macro-level strategy — monolithic vs batched DAG. Each
executor.run().wait()is a full synchronization barrier. Monolithic (1 barrier) is generally faster than batched (N barriers). Consider batching only for memory constraints (~0.5 GB per 1M tasks). - Lambda capture size — since most captures exceed the 16-byte SBO, the second heap
allocation is unavoidable. But capture size still matters: copying 240 bytes per task is
5x slower than 48 bytes, and fat captures pollute L1/L2 cache during construction.
Target ≤48 bytes. Use pointer+index into pre-allocated arrays rather than embedding
data in the capture struct (see
docs/taskflow-patterns.mdSection 6f). - Extra heap allocations per task — copying structs with vectors,
shared_ptrcaptures,std::vectorcaptured by value. Each adds ~50-150ns on top of the mandatory node alloc. - Algorithmic complexity of lookups — O(log n) containers (
std::map) for dependency tracking cause cache-miss pointer chasing during single-threaded construction. Prefer flat vectors for O(1) indexed access. - Scratch buffer strategy — pre-allocate per-worker scratch arrays and index by
executor.this_worker_id()orthread_local. Either mechanism works, but all init and allocation must happen outside the lambda. Never callprepare_scratch(), resize, or check init flags inside the task body — this adds per-task overhead at millions of tasks.
Implication: Simpler code with fewer allocations outperforms complex "optimized" code
that inadvertently adds overhead. See docs/taskflow-performance.md for the full cost model.
-
Forgetting
.wait()—run()is non-blocking; without.wait()results are undefined. -
Recreating the executor in hot paths —
tf::Executorspawns OS threads on construction. Creating and destroying it on every call pays thread spawn/join overhead each time.Anti-pattern (executor created per call):
auto measure_time_taskflow(unsigned num_threads) { tf::Executor executor(num_threads); // BAD: spawns threads every call tf::Taskflow taskflow; // ... executor.run(taskflow).wait(); }
Correct pattern — pass executor in:
auto measure_time_taskflow(tf::Executor& executor) { tf::Taskflow taskflow; // Taskflow is cheap to rebuild // ... executor.run(taskflow).wait(); }
If you cannot change the function signature, a static local executor is only safe when the worker count is fixed for the lifetime of the process:
auto measure_time_taskflow() { static tf::Executor executor(fixed_num_threads); // only when worker count never changes tf::Taskflow taskflow; // ... executor.run(taskflow).wait(); }
If
num_threadsvaries across calls (for example, benchmark sweeps), construct the executor at the caller and pass it in instead. -
Atomic contention in reductions — using
for_each_indexwithstd::atomic::fetch_addper element causes all threads to contend on one cache line. At high core counts, this effectively serializes the computation. Preferreduce_by_indexwhich accumulates into thread-local partials and merges only at the end. Seedocs/taskflow-patterns.mdSection 4b. -
Subflow+join()for recursive fork-join —sf.join()blocks the calling worker thread, removing it from the work-stealing pool. With deep recursion this cascades into thread starvation.TaskGroup+corun()lets the waiting worker steal other tasks cooperatively. PreferTaskGroup+corun()for all recursive fork-join. ReserveSubflowfor cases where you need explicit task handles and edges within the sub-DAG. Seedocs/taskflow-patterns.mdSection 6a for guidance on when each pattern fits. -
Rebuilding the graph every iteration — when a parallel computation repeats N times, avoid clearing and rebuilding the graph each iteration.
Anti-pattern (rebuild per iteration):
for (int i = 0; i < N; i++) { tf::Taskflow taskflow; // ... build graph ... executor.run(taskflow).wait(); // BAD: graph construction + barrier per iteration }
Correct pattern (condition task cycle with StaticPartitioner):
tf::Taskflow taskflow; int iter = 0; auto work = taskflow.for_each_index(0, M, 1, [&](int j){ /* ... */ }, tf::StaticPartitioner{}); auto cond = taskflow.emplace([&]() -> int { return (++iter < N) ? 0 : -1; // 0 = loop back, -1 = stop }); work.precede(cond); cond.precede(work); // successor 0 = work (cycles back) executor.run(taskflow).wait(); // single barrier for all N iterations
This avoids per-iteration graph construction and executor re-launch overhead. See
docs/taskflow-patterns.mdSection 6b. -
Data races — tasks run concurrently; protect shared mutable state.
-
Dangling references — captured references must outlive
executor.run().wait(). -
std::functionwrappers — unnecessary;emplace()accepts callables directly. -
Missing algorithm headers —
for_each,reduce, etc. need their own#include. -
Too-fine granularity — tasks should be ≥10-100 microseconds of work. If per-element work is cheap, prefer chunked APIs (
for_each_by_index,reduce_by_index) that let each task process a range of elements. -
Concurrent runs of same Taskflow — undefined behavior; wait for completion first.
-
Modifying graph while running — undefined behavior; build before
run(). -
Condition task successor ordering — return value selects by
precede()call order. Returning -1 (or any out-of-bounds value) selects no successor, effectively stopping. -
Compiling below C++20 — Taskflow requires
-std=c++20. -
Missing WAR deps with reused storage — circular buffers need WAR edges, not just RAW/WAW. See
docs/taskflow-patterns.mdSection 6f. -
Heap allocs in task loops — capturing
vectorby value, copying large structs → per-taskmalloc. Use fixed-size captures, pointers, references instead. -
constcaptures breaking pointer APIs — mark lambdamutableif called function needs non-const pointers from your captured struct. -
Skipping parallel init — when porting from OMP/TBB, parallelize all loops that were parallel in the source, including initialization. Sequential init of large data structures can be a meaningful fraction of total runtime.
Good candidates: Independent loop iterations (>10µs each), serialized independent stages, recursive divide-and-conquer, pipeline/streaming, CPU-bound work saturating one core.
Leave alone: Trivially cheap loops (nanoseconds), memory-bandwidth-bound code, tightly sequential deps (except reductions/scans), working OpenMP/TBB code unless the goal is to standardize on Taskflow or remove the existing framework dependency, I/O-bound code, N < 10K with moderate per-element work, SIMD-vectorizable inner loops.
Priority order: 1) Serialized independent work → task graph. 2) Large parallel loops →
for_each/for_each_by_index. 3) Reductions → reduce/reduce_by_index. 4) Pipelines →
Pipeline. 5) Recursive fork-join → TaskGroup + corun().
| Approach | Best when |
|---|---|
Explicit DAG (emplace + precede) |
Width < threads, or cross-timestep parallelism matters |
Per-timestep for_each_index |
Width >> threads (width ≥ workers × 8) |
| Hybrid | Mixed narrow/wide phases |