
Designing Backpressure in a Parallel DAG Executor
TL;DR — My parallel DAG scheduler could over-dispatch tasks.
The fix required bounded queues, a global in-flight cap, and careful avoidance of deadlocks.
When I first implemented parallel execution for my DAG executor, I used unbounded channels between the scheduler and worker threads.
It worked.
But it had a silent problem: nothing prevented the scheduler from enqueueing an arbitrary number of ready tasks.
Under load, that means unbounded memory growth.
The fix was not adding async or introducing a runtime.
It was backpressure.
Contents
- The Naive Parallel Design
- The Hidden Problem: Over-Dispatch
- Introducing Backpressure
- Physical Cap vs Logical Cap
- Why the Result Channel Remains Unbounded
- Testing the Concurrency Bound
- Final Thoughts
1. The Naive Parallel Design
My DAG scheduler executes tasks with dependencies, so execution follows a Kahn-style topological progression: whenever a node’s in-degree becomes zero, it becomes ready and can be dispatched.
The design consists of:
- A single scheduler thread.
- N worker threads.
- One task channel per worker (scheduler → worker).
- One shared results channel (workers → scheduler).
Each worker owns its own queue. The scheduler distributes tasks round-robin across workers. Workers execute tasks and push results back to the scheduler via a shared mpsc channel.
In a first implementation, all task channels could be created using std::sync::mpsc::channel(), which is unbounded.
Architecturally, this is a classic producer–consumer pattern with a dispatcher:
- The scheduler produces tasks.
- Workers consume tasks.
- Workers produce results.
- The scheduler consumes results.
An alternative design would be a single shared task queue consumed by all workers. However, std::sync::mpsc does not allow cloning the receiver. Wrapping a single receiver inside Arc<Mutex<_>> would serialize recv() calls, introducing contention and eliminating true parallel dequeue. Independent per-worker queues avoid this bottleneck and reduce lock contention.
2. The Hidden Problem: Over-Dispatch
The initial implementation works functionally, but it has a structural flaw.
Because task channels are unbounded, the scheduler can enqueue tasks as fast as it discovers them ready.
In a wide DAG, where many nodes become ready simultaneously, this means:
- The scheduler can enqueue thousands of tasks.
- Worker queues grow without limit.
- Memory usage scales with graph width.
- Execution latency increases due to queue growth and cache pressure.
Nothing prevents the scheduler from outpacing the workers.
This is the classic unbounded producer problem:
If producers are faster than consumers and the buffer is unbounded, memory becomes the pressure valve.
In small test cases this is invisible. Under load, it becomes a stability risk.
The system has parallelism — but no flow control.
3. Introducing Backpressure
To prevent unbounded growth, I introduced bounded per-worker queues using std::sync::mpsc::sync_channel.
Unlike channel(), which is unbounded, sync_channel(cap) enforces a fixed buffer size. When full, send() blocks or try_send() returns TrySendError::Full.
The scheduler now dispatches tasks using try_send() instead of send().
This allows the scheduler to:
- Detect when a worker queue is full.
- Avoid blocking blindly.
- Retry later after receiving a completed result.
The dispatch result is mapped to:
enum DispatchOutcome<O, E> {
Queued,
AllFull(Task<O, E>),
}If all worker queues are full, the scheduler stores the task in a pending slot and switches to receiving results.
This transforms the scheduling loop into a two-phase system:
- Attempt dispatch while below capacity.
- If saturated, wait for completion and retry.
In addition to bounded per-worker queues, I introduced a global max_in_flight counter.
This ensures that the total number of running + queued tasks never exceeds a configurable system-wide cap.
4. Physical Cap vs Logical Cap
Two independent limits now exist in the system:
- Per-worker queue capacity.
- Global in-flight capacity.
Each worker can hold:
- 1 running task.
worker_queue_capbuffered tasks.
Therefore:
physical_cap = n_workers * (queue_cap + 1)This is the structural maximum number of tasks the system can physically hold.
The scheduler also enforces:
effective_cap = min(max_in_flight, physical_cap)The global max_in_flight defines a logical execution limit.
The physical cap defines a hard architectural limit imposed by channel capacity.
This distinction is important:
- Physical cap prevents memory explosion.
- Logical cap controls parallel pressure and resource usage.
Together, they provide explicit backpressure and predictable memory bounds.
5. Why the Results Channel Remains Unbounded
Task channels are bounded.
The results channel is not.
If the results channel were bounded, a circular wait could occur:
- The scheduler stops dispatching because worker queues are full.
- Workers complete tasks and attempt to send results.
- The results channel is full.
- Workers block on sending results.
- The scheduler is waiting to receive results, but cannot progress.
This creates a deadlock cycle.
By keeping the results channel unbounded, completed tasks can always report back to the scheduler, ensuring forward progress.
6. Testing the Concurrency Bound
To prevent regressions, I introduced an integration test that measures maximum concurrent execution using AtomicUsize.
Instead of using sleep(), which is nondeterministic, tasks block on a shared AtomicBool gate to force overlap.
The test tracks:
- Active tasks.
- Maximum observed concurrency.
It asserts:
max_active <= max_in_flightThis converts a scheduler invariant into a measurable runtime property.
Final Thoughts
Backpressure is often associated with async or reactive systems.
But the principle is more general:
If producers can outpace consumers and buffers are unbounded, memory becomes the failure mode.
Explicit bounds make parallel systems predictable.
In this scheduler, bounded channels and an in-flight cap turned parallel execution from “it works” into “it is controlled.”
Stay Updated
Get notified when I publish new articles about Web3 development, hackathon experiences, and cryptography insights.
You might also like

Rust — Trait Objects, Sized, and Why My DAG Needed `Box<dyn Fn>`
Why heterogeneous closures require type erasure, how trait objects become unsized, and why 'static is necessary when storing tasks.

Crescent Bench Lab: Measuring ZK Presentations for Real Credentials (JWT + mDL)
A small Rust lab that vendors microsoft/crescent-credentials, generates Crescent test vectors, and benchmarks zksetup/prove/show/verify across several parameters — including proof sizes and selective disclosure variants.

TEE Auction Coprocessor: Replay-Safe Attested Auction Receipt with Gramine SGX — Tutorial
A Rust mini-lab that turns a Vickrey (second-price) auction into a TEE coprocessor: deterministic core, bid commitments, replay protection, and a policy-driven verifier—leaving full DCAP collateral/TCB verification (PCS chain, revocation, freshness rules) for a follow-up.