Designing Backpressure in a Parallel DAG Executor

TL;DR — My parallel DAG scheduler could over-dispatch tasks.
The fix required bounded queues, a global in-flight cap, and careful avoidance of deadlocks.

When I first implemented parallel execution for my DAG executor, I used unbounded channels between the scheduler and worker threads.

It worked.

But it had a silent problem: nothing prevented the scheduler from enqueueing an arbitrary number of ready tasks.

Under load, that means unbounded memory growth.

The fix was not adding async or introducing a runtime.

It was backpressure.

The Naive Parallel Design
The Hidden Problem: Over-Dispatch
Introducing Backpressure
Physical Cap vs Logical Cap
Why the Result Channel Remains Unbounded
Testing the Concurrency Bound
Final Thoughts

1. The Naive Parallel Design

My DAG scheduler executes tasks with dependencies, so execution follows a Kahn-style topological progression: whenever a node’s in-degree becomes zero, it becomes ready and can be dispatched.

The design consists of:

A single scheduler thread.
N worker threads.
One task channel per worker (scheduler → worker).
One shared results channel (workers → scheduler).

Each worker owns its own queue. The scheduler distributes tasks round-robin across workers. Workers execute tasks and push results back to the scheduler via a shared mpsc channel.

In a first implementation, all task channels could be created using std::sync::mpsc::channel(), which is unbounded.

Architecturally, this is a classic producer–consumer pattern with a dispatcher:

The scheduler produces tasks.
Workers consume tasks.
Workers produce results.
The scheduler consumes results.

An alternative design would be a single shared task queue consumed by all workers. However, std::sync::mpsc does not allow cloning the receiver. Wrapping a single receiver inside Arc<Mutex<_>> would serialize recv() calls, introducing contention and eliminating true parallel dequeue. Independent per-worker queues avoid this bottleneck and reduce lock contention.

2. The Hidden Problem: Over-Dispatch

The initial implementation works functionally, but it has a structural flaw.

Because task channels are unbounded, the scheduler can enqueue tasks as fast as it discovers them ready.

In a wide DAG, where many nodes become ready simultaneously, this means:

The scheduler can enqueue thousands of tasks.
Worker queues grow without limit.
Memory usage scales with graph width.
Execution latency increases due to queue growth and cache pressure.

Nothing prevents the scheduler from outpacing the workers.

This is the classic unbounded producer problem:

If producers are faster than consumers and the buffer is unbounded, memory becomes the pressure valve.

In small test cases this is invisible. Under load, it becomes a stability risk.

The system has parallelism — but no flow control.

3. Introducing Backpressure

To prevent unbounded growth, I introduced bounded per-worker queues using std::sync::mpsc::sync_channel.

Unlike channel(), which is unbounded, sync_channel(cap) enforces a fixed buffer size. When full, send() blocks or try_send() returns TrySendError::Full.

The scheduler now dispatches tasks using try_send() instead of send().

This allows the scheduler to:

Detect when a worker queue is full.
Avoid blocking blindly.
Retry later after receiving a completed result.

The dispatch result is mapped to:

enum DispatchOutcome<O, E> {
    Queued,
    AllFull(Task<O, E>),
}

If all worker queues are full, the scheduler stores the task in a pending slot and switches to receiving results.

This transforms the scheduling loop into a two-phase system:

Attempt dispatch while below capacity.
If saturated, wait for completion and retry.

In addition to bounded per-worker queues, I introduced a global max_in_flight counter.

This ensures that the total number of running + queued tasks never exceeds a configurable system-wide cap.

4. Physical Cap vs Logical Cap

Two independent limits now exist in the system:

Per-worker queue capacity.
Global in-flight capacity.

Each worker can hold:

1 running task.
worker_queue_cap buffered tasks.

Therefore:

physical_cap = n_workers * (queue_cap + 1)

This is the structural maximum number of tasks the system can physically hold.

The scheduler also enforces:

effective_cap = min(max_in_flight, physical_cap)

The global max_in_flight defines a logical execution limit.

The physical cap defines a hard architectural limit imposed by channel capacity.

This distinction is important:

Physical cap prevents memory explosion.
Logical cap controls parallel pressure and resource usage.

Together, they provide explicit backpressure and predictable memory bounds.

5. Why the Results Channel Remains Unbounded

Task channels are bounded.

The results channel is not.

If the results channel were bounded, a circular wait could occur:

The scheduler stops dispatching because worker queues are full.
Workers complete tasks and attempt to send results.
The results channel is full.
Workers block on sending results.
The scheduler is waiting to receive results, but cannot progress.

This creates a deadlock cycle.

By keeping the results channel unbounded, completed tasks can always report back to the scheduler, ensuring forward progress.

6. Testing the Concurrency Bound

To prevent regressions, I introduced an integration test that measures maximum concurrent execution using AtomicUsize.

Instead of using sleep(), which is nondeterministic, tasks block on a shared AtomicBool gate to force overlap.

The test tracks:

Active tasks.
Maximum observed concurrency.

It asserts:

max_active <= max_in_flight

This converts a scheduler invariant into a measurable runtime property.

Final Thoughts

Backpressure is often associated with async or reactive systems.

But the principle is more general:

If producers can outpace consumers and buffers are unbounded, memory becomes the failure mode.

Explicit bounds make parallel systems predictable.

In this scheduler, bounded channels and an in-flight cap turned parallel execution from “it works” into “it is controlled.”

Designing Backpressure in a Parallel DAG Executor

Contents

1. The Naive Parallel Design

2. The Hidden Problem: Over-Dispatch

3. Introducing Backpressure

4. Physical Cap vs Logical Cap

5. Why the Results Channel Remains Unbounded

6. Testing the Concurrency Bound

Final Thoughts

Stay Updated

You might also like

Rust — Trait Objects, Sized, and Why My DAG Needed `Box<dyn Fn>`

Crescent Bench Lab: Measuring ZK Presentations for Real Credentials (JWT + mDL)

TEE Auction Coprocessor: Replay-Safe Attested Auction Receipt with Gramine SGX — Tutorial