dag_exec: a std-only DAG executor for CPU-heavy pipelines (pruning + bounded parallelism)

dag_exec: a std-only DAG executor for CPU-heavy pipelines (pruning + bounded parallelism)

3/3/20264 min • rust
RustConcurrencyParallelismBackpressureExecution EnginesDAG

TL;DR — dag_exec is a small, std-only executor for CPU-heavy pipelines that are naturally graph-shaped.
You request the outputs you care about, it prunes the rest, and it can run independent nodes in parallel with explicit bounds (no async runtime, no Rayon).

Featured in This Week in Rust #641 (Project/Tooling Updates).

Links: crate + docs + repo

Why this crate exists

Many practical pipelines are not linear; they are dependency graphs:

  • hashing / commitments / decoding
  • multi-stage indexing
  • proof plumbing (witness/trace → commitments → aggregation)
  • computing multiple artifacts from the same input

In those cases, two issues come up repeatedly:

  1. You often don’t need all outputs. Computing everything and throwing most away is wasteful.
  2. When nodes are CPU-heavy, you want parallelism — but with predictable bounds (no unbounded dispatch / memory blowups).

dag_exec is a small executor for that niche: bounded parallel execution + partial evaluation in a tiny, dependency-free package.

When to use it (and when not to)

Use dag_exec if:

  • your pipeline is naturally a dependency graph (DAG)
  • nodes are CPU-ish (hashing, decoding, proof steps, indexing stages)
  • you want a std-only solution
  • you care about explicit bounds: max_in_flight, per-worker queue caps

Don’t use it if:

  • your workload is I/O bound and naturally async (Tokio/async is a better fit)
  • you just want “parallel map” over independent items (Rayon is simpler)
  • you need distributed execution / caching / retries / persistence (that’s a bigger system)

What you get

1) Partial evaluation (pruning)

You run the DAG requesting only the outputs you want; the executor evaluates only the ancestors needed to produce them.

Expected shape (merkle example demo):

  • request root → leaf_hash=16, internal_hash=15
  • request subtree → leaf_hash=4, internal_hash=3

The key property is simple: requesting fewer outputs should execute fewer nodes.

2) Bounded parallelism (std threads)

The parallel executor uses a worker pool and explicit caps:

  • max_in_flight: global bound on queued + running tasks
  • worker_queue_cap: per-worker bounded queue

This keeps dispatch bounded instead of allowing unbounded queue growth.

Results at a glance (from examples/rollup.rs):

  • Pruning: request one chunk proof ⇒ 16 → 4 leaf hashes
  • Parallel (heavy mode): ~3× wall-time improvement @4 workers
  • Backpressure: parallel scheduler bounded by max_in_flight + per-worker queue caps

Minimal API (60 seconds)

use dag_exec::{DagBuilder, Executor, ExecutorConfig};
use std::sync::Arc;
 
let mut b = DagBuilder::<String, i32, ()>::new();
b.add_source("a".into(), 1)?;
b.add_source("b".into(), 2)?;
b.add_task("c".into(), vec!["a".into(), "b".into()], |xs: &[Arc<i32>]| Ok(*xs[0] + *xs[1]))?;
let dag = b.build()?;
 
let exec = Executor::new(ExecutorConfig::default());
let out = exec.run_sequential(&dag, ["c".to_string()])?;
assert_eq!(*out["c"], 3);
 
# Ok::<(), dag_exec::BuildError<String>>(())

Examples (copy/paste runnable)

The examples are intentionally close to infrastructure-style workloads:

1) examples/pipeline.rs — branch pruning by output selection

Demonstrates output-driven pruning across branches (commitment vs receipts, etc).

cargo run --example pipeline
DAG_EXEC_HASH_ITERS=200000 cargo run --release --example pipeline

2) examples/merkle.rs — pruning-only demo (pure and fast)

Showcases pruning in a basic Merkle tree example:

cargo run --example merkle

3) examples/rollup.rs — rollup-shaped pipeline (prune + parallel + real numbers)

This is the most complete example in the crate: leaf commitments → chunk roots → batch root, plus an inclusion-proof artifact node. It also shows where parallelism loses on small workloads and where it wins on heavier ones.

cargo run --release --example rollup
DAG_EXEC_MAX_WORKERS=4 DAG_EXEC_HASH_ITERS=200000 cargo run --release --example rollup
DAG_EXEC_MAX_WORKERS=4 DAG_EXEC_HASH_ITERS=2000000 cargo run --release --example rollup

What to look for (in heavy mode):

  • requesting one chunk proof drops work from 16 leaves4 leaves
  • parallel batch root can drop wall-clock by ~3× at 4 workers (~24ms → ~8ms in heavy mode)

Benchmarks (criterion)

cargo bench

Notes:

  • chain_* has no inherent parallelism → measures scheduler overhead.
  • fanout_*_heavy shows the parallel win when nodes are CPU-heavy.

If you want the HTML report locally:

# linux
xdg-open target/criterion/report/index.html

Design notes & honest tradeoffs

  • For tiny per-node work, parallel can be slower (thread + channel overhead dominates).
  • This is not for I/O-bound async workloads — Tokio/async is a better fit.
  • This is not “Rayon replacement” — it’s a DAG scheduler with pruning + explicit backpressure.

Build log / posts from the implementation

The crate grew out of a few concrete engineering problems:


If you try dag_exec in a real pipeline (proof systems, indexing, commitments), I’d love to hear what worked and what didn’t.

Stay Updated

Get notified when I publish new articles about Web3 development, hackathon experiences, and cryptography insights.

You might also like

Designing Backpressure in a Parallel DAG Executor

How I introduced bounded backpressure into a parallel DAG scheduler using sync channels and an in-flight cap.

Rollup Commitments as a DAG: Pruning, Parallelism, and Proof Plumbing (Rust std-only)

A rollup-shaped batch pipeline modeled as a DAG: compute only the chunk/proof you need, and scale CPU-heavy hashing with bounded parallelism — all in std.

Testing Concurrency Invariants in a Parallel Executor

How to verify max_in_flight bounds using AtomicUsize, CAS loops, and deterministic gating without sleep().