dag_exec: a std-only DAG executor for CPU-heavy pipelines (pruning + bounded parallelism)

dag_exec: a std-only DAG executor for CPU-heavy pipelines (pruning + bounded parallelism)

3/3/20264 min • rust
RustConcurrencyParallelismBackpressureExecution EnginesDAG

TL;DR — dag_exec is a small, std-only executor for CPU-heavy pipelines that are naturally graph-shaped.
You request the outputs you care about, it prunes the rest, and it can run independent nodes in parallel with explicit bounds (no async runtime, no Rayon).

Links: crate + docs + repo

Why this crate exists

A lot of “real” pipelines aren’t linear — they’re dependency graphs:

  • hashing / commitments / decoding
  • multi-stage indexing
  • proof plumbing (witness/trace → commitments → aggregation)
  • “compute multiple artifacts from the same input” workflows

Two things show up over and over:

  1. You often don’t need all outputs. Computing everything and throwing most away is wasteful.
  2. When nodes are CPU-heavy, you want parallelism — but with predictable bounds (no unbounded dispatch / memory blowups).

dag_exec is a small executor for that niche: bounded parallel execution + partial evaluation in a tiny, dependency-free package.

When to use it (and when not to)

Use dag_exec if:

  • your pipeline is naturally a dependency graph (DAG)
  • nodes are CPU-ish (hashing, decoding, proof steps, indexing stages)
  • you want a std-only solution
  • you care about explicit bounds: max_in_flight, per-worker queue caps

Don’t use it if:

  • your workload is I/O bound and naturally async (Tokio/async is a better fit)
  • you just want “parallel map” over independent items (Rayon is simpler)
  • you need distributed execution / caching / retries / persistence (that’s a bigger system)

What you get

1) Partial evaluation (pruning)

You run the DAG requesting only the outputs you want; the executor evaluates only the ancestors needed to produce them.

Expected shape (merkle example demo):

  • request root → leaf_hash=16, internal_hash=15
  • request subtree → leaf_hash=4, internal_hash=3

This is the most important property: fewer requested outputs ⇒ fewer executed nodes.

2) Bounded parallelism (std threads)

The parallel executor uses a worker pool and explicit caps:

  • max_in_flight: global bound on queued + running tasks
  • worker_queue_cap: per-worker bounded queue

This keeps dispatch predictable instead of “spawn forever”.

Results at a glance (from examples/rollup.rs):

  • Pruning: request one chunk proof ⇒ 16 → 4 leaf hashes
  • Parallel (heavy mode): ~3× wall-time improvement @4 workers
  • Backpressure: parallel scheduler bounded by max_in_flight + per-worker queue caps

Minimal API (60 seconds)

use dag_exec::{DagBuilder, Executor, ExecutorConfig};
use std::sync::Arc;
 
let mut b = DagBuilder::<String, i32, ()>::new();
b.add_source("a".into(), 1)?;
b.add_source("b".into(), 2)?;
b.add_task("c".into(), vec!["a".into(), "b".into()], |xs: &[Arc<i32>]| Ok(*xs[0] + *xs[1]))?;
let dag = b.build()?;
 
let exec = Executor::new(ExecutorConfig::default());
let out = exec.run_sequential(&dag, ["c".to_string()])?;
assert_eq!(*out["c"], 3);
 
# Ok::<(), dag_exec::BuildError<String>>(())

Examples (copy/paste runnable)

I kept the examples intentionally “infra-shaped”:

1) examples/pipeline.rs — branch pruning by output selection

Demonstrates output-driven pruning across branches (commitment vs receipts, etc).

cargo run --example pipeline
DAG_EXEC_HASH_ITERS=200000 cargo run --release --example pipeline

2) examples/merkle.rs — pruning-only demo (pure and fast)

Showcases pruning in a basic Merkle tree example:

cargo run --example merkle

3) examples/rollup.rs — rollup-shaped pipeline (prune + parallel + real numbers)

This is the “flagship”: leaf commitments → chunk roots → batch root, plus an inclusion-proof artifact node. Also includes honest perf numbers showing when parallelism loses (tiny work) and when it wins (heavy work).

cargo run --release --example rollup
DAG_EXEC_MAX_WORKERS=4 DAG_EXEC_HASH_ITERS=200000 cargo run --release --example rollup
DAG_EXEC_MAX_WORKERS=4 DAG_EXEC_HASH_ITERS=2000000 cargo run --release --example rollup

What to look for (in heavy mode):

  • requesting one chunk proof drops work from 16 leaves4 leaves
  • parallel batch root can drop wall-clock by ~3× at 4 workers (~24ms → ~8ms in heavy mode)

Benchmarks (criterion)

cargo bench

Notes:

  • chain_* has no inherent parallelism → measures scheduler overhead.
  • fanout_*_heavy shows the parallel win when nodes are CPU-heavy.

If you want the HTML report locally:

# linux
xdg-open target/criterion/report/index.html

Design notes & honest tradeoffs

  • For tiny per-node work, parallel can be slower (thread + channel overhead dominates).
  • This is not for I/O-bound async workloads — Tokio/async is a better fit.
  • This is not “Rayon replacement” — it’s a DAG scheduler with pruning + explicit backpressure.

Build log / posts from the implementation

This crate came out of a few focused engineering problems:


If you try dag_exec in a real pipeline (proof systems, indexing, commitments), I’d love to hear what worked and what didn’t.

Stay Updated

Get notified when I publish new articles about Web3 development, hackathon experiences, and cryptography insights.

You might also like

Designing Backpressure in a Parallel DAG Executor

How I introduced bounded backpressure into a parallel DAG scheduler using sync channels and an in-flight cap.

Rollup Commitments as a DAG: Pruning, Parallelism, and Proof Plumbing (Rust std-only)

A rollup-shaped batch pipeline modeled as a DAG: compute only the chunk/proof you need, and scale CPU-heavy hashing with bounded parallelism — all in std.

Testing Concurrency Invariants in a Parallel Executor

How to verify max_in_flight bounds using AtomicUsize, CAS loops, and deterministic gating without sleep().