Skip to content
v1.0.0-zig0.15.2

Chase-Lev Work-Stealing Deque

Each worker in Volt owns a WorkStealQueue — a 256-slot lock-free ring buffer based on the Chase-Lev work-stealing deque. The owner thread has fast, uncontested access for push and pop operations, while other threads can safely steal half the queue without locks.

The Chase-Lev deque (David Chase and Yossi Lev, 2005) is the standard data structure for work-stealing schedulers. It provides:

  • Push (owner only): O(1) amortized, wait-free in the common case
  • Pop (owner only): O(1), lock-free (CAS may retry on steal contention)
  • Steal (any thread): O(1), lock-free (single CAS to claim tasks)

The key insight is that the owner thread and stealers operate on opposite ends of the deque, so they rarely contend. The owner pushes and pops from the back (tail), while stealers take from the front (head).

stealers owner
| |
v v
+---+---+---+---+---+---+---+---+---+
head -> | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | <- tail
+---+---+---+---+---+---+---+---+---+
FIFO (steal from front) LIFO (push/pop at back)

In Volt’s variant, the local queue is used in FIFO order (pop from head) for fairness, with LIFO behavior provided by a separate lifo_slot atomic pointer on the Worker struct. This separation simplifies the ring buffer logic.

The standard Chase-Lev deque uses separate atomic variables for head and tail. Volt packs two head values into a single 64-bit atomic to support a three-phase steal protocol:

head (u64) = [ steal_head (upper 32 bits) | real_head (lower 32 bits) ]
tail (u32) = separate atomic, only written by owner
  • real_head: The true consumer position. The owner pops from here.
  • steal_head: Tracks how far stealers have confirmed their steals. During a steal, steal_head lags behind real_head.
  • When steal_head == real_head: No steal in progress.
  • When steal_head != real_head: A steal is active. The stealer has claimed slots [steal_head, real_head) but has not yet finished copying them.

This design enables the three-phase steal protocol without requiring additional synchronization.

pub const WorkStealQueue = struct {
pub const CAPACITY: u32 = 256;
const MASK: u32 = CAPACITY - 1;
head: std.atomic.Value(u64) align(CACHE_LINE_SIZE),
tail: std.atomic.Value(u32) align(CACHE_LINE_SIZE),
buffer: [CAPACITY]std.atomic.Value(?*Header) align(CACHE_LINE_SIZE),
pub inline fn pack(steal: u32, real: u32) u64 {
return (@as(u64, steal) << 32) | @as(u64, real);
}
pub inline fn unpack(val: u64) struct { steal: u32, real: u32 } {
return .{
.steal = @truncate(val >> 32),
.real = @truncate(val),
};
}
};

The owner pushes a new task to the tail of the ring buffer. This is wait-free in the common case (just a store and tail increment). If the buffer is full, the owner overflows half the queue to the global queue.

pub fn push(self: *Self, task: *Header, global_queue: *GlobalTaskQueue) void {
var tail = self.tail.load(.monotonic);
while (true) {
const head_packed = self.head.load(.acquire);
const head = unpack(head_packed);
if (tail -% head.steal < CAPACITY) {
// Room available -- write and advance tail
self.buffer[tail & MASK].store(task, .release);
self.tail.store(tail +% 1, .release);
return;
}
if (head.steal != head.real) {
// Stealer active -- can't safely overflow, push to global
global_queue.push(task);
return;
}
// No stealer -- overflow half to global queue, then retry
const half = CAPACITY / 2;
const new_head = pack(head.steal +% half, head.real +% half);
if (self.head.cmpxchgWeak(head_packed, new_head, .release, .acquire)) |_| {
tail = self.tail.load(.monotonic);
continue;
}
// Push overflowed tasks to global queue
var i: u32 = 0;
while (i < half) : (i += 1) {
const idx = (head.real +% i) & MASK;
const t = self.buffer[idx].load(.acquire) orelse continue;
global_queue.push(t);
}
self.buffer[tail & MASK].store(task, .release);
self.tail.store(tail +% 1, .release);
return;
}
}

Overflow handling: When the queue is full and no stealer is active, the owner atomically advances both heads by CAPACITY / 2 = 128, effectively discarding the oldest 128 entries from the ring buffer. Those 128 tasks are then pushed to the global queue in a batch. This amortizes the cost of global queue mutex acquisition over 128 tasks.

If a stealer is active (heads differ), the owner cannot safely overflow because the stealer is still reading from those slots. In this case, the single new task goes directly to the global queue.

The owner pops from the head for FIFO ordering. This ensures tasks are executed in roughly the order they were enqueued, providing fairness. LIFO behavior (executing the newest task first for cache locality) is handled by the separate lifo_slot on the Worker struct.

pub fn pop(self: *Self) ?*Header {
var head_packed = self.head.load(.acquire);
while (true) {
const head = unpack(head_packed);
const tail = self.tail.load(.monotonic);
if (head.real == tail) return null; // Empty
const idx = head.real & MASK;
const next_real = head.real +% 1;
const new_head = if (head.steal == head.real)
pack(next_real, next_real) // No stealer: advance both
else
pack(head.steal, next_real); // Stealer active: only advance real
if (self.head.cmpxchgWeak(head_packed, new_head, .release, .acquire)) |actual| {
head_packed = actual;
continue; // Lost race with stealer, retry
}
return self.buffer[idx].load(.acquire);
}
}

Stealing is the most complex operation. It transfers half of the victim’s queue into the stealer’s queue. The protocol uses three CAS operations to ensure correctness without locks.

Phase 1 (Claim): Advance victim's real_head, keep steal_head
Phase 2 (Copy): Copy tasks from victim's buffer to stealer's buffer
Phase 3 (Release): Advance victim's steal_head to match real_head
Victim's head during steal:
Before: [steal=5 | real=5] (no steal in progress)
Phase 1: [steal=5 | real=9] (claimed 4 tasks: slots 5,6,7,8)
Phase 2: (copy slots 5,6,7,8 to stealer's buffer)
Phase 3: [steal=9 | real=9] (steal complete)

The full implementation:

pub fn stealInto(src: *Self, dst: *Self) ?*Header {
const dst_tail = dst.tail.load(.monotonic);
const src_head_packed = src.head.load(.acquire);
const src_head = unpack(src_head_packed);
// A steal is already in progress
if (src_head.steal != src_head.real) return null;
const src_tail = src.tail.load(.acquire);
const available = src_tail -% src_head.real;
if (available == 0) return null;
// Steal half (ceiling), capped to CAPACITY/2
var to_steal = available -% (available / 2);
if (to_steal > CAPACITY / 2) to_steal = CAPACITY / 2;
// Phase 1: CAS to claim -- advance real_head, keep steal_head
const new_src_head = pack(src_head.steal, src_head.real +% to_steal);
if (src.head.cmpxchgWeak(src_head_packed, new_src_head, .acq_rel, .acquire)) |_| {
return null; // Lost race
}
// Phase 2: Copy tasks
const first_task = src.buffer[src_head.real & MASK].load(.acquire);
var i: u32 = 1;
while (i < to_steal) : (i += 1) {
const src_idx = (src_head.real +% i) & MASK;
const dst_idx = (dst_tail +% (i - 1)) & MASK;
const task = src.buffer[src_idx].load(.acquire);
dst.buffer[dst_idx].store(task, .release);
}
if (to_steal > 1) {
dst.tail.store(dst_tail +% (to_steal - 1), .release);
}
// Phase 3: Release -- advance steal_head to match real_head
var release_head_packed = src.head.load(.acquire);
while (true) {
const release_head = unpack(release_head_packed);
const final_head = pack(release_head.real, release_head.real);
if (src.head.cmpxchgWeak(
release_head_packed, final_head, .acq_rel, .acquire
)) |actual| {
release_head_packed = actual;
continue;
}
break;
}
return first_task;
}

Why three phases? A single CAS cannot atomically claim, copy, and release. The two-head design allows the claim and release to be separate atomic operations, with the copy happening in between. During the copy phase, the victim’s push can still proceed as long as the tail does not wrap past steal_head. The victim’s pop CAS may fail and retry, but this is safe.

Each atomic operation uses the minimum ordering needed for correctness:

OperationOrderingRationale
tail.load (owner)monotonicOnly the owner writes tail; reading own writes needs no fence
tail.store (owner)releaseStealers must see the buffer write before the tail update
head.load (owner or stealer)acquireMust see all buffer writes that happened before the head was updated
head.cmpxchgWeak (claim)acq_relRelease: publish our claim. Acquire: see current state
buffer[i].store (push)releasePair with acquire loads in pop/steal
buffer[i].load (pop/steal)acquireSee the value written by the pusher

The head and tail fields are aligned to separate cache lines (CACHE_LINE_SIZE) to prevent false sharing between the owner thread (writing tail) and stealer threads (reading/writing head).

The capacity of 256 is a compile-time constant chosen for several reasons:

  1. Power of two enables bitwise masking (idx & 0xFF) instead of modulo division.
  2. Small enough to fit in L1/L2 cache (256 pointers = 2KB on 64-bit).
  3. Large enough to absorb bursts without immediate overflow.
  4. Matches Tokio which also uses 256 slots, enabling direct performance comparison.

The buffer stores std.atomic.Value(?*Header) — nullable pointers to type-erased task headers. The nullable type allows detecting empty slots.

When the ring buffer is full, the overflow strategy depends on whether a steal is in progress:

Queue full, no stealer:
1. Atomically advance head by CAPACITY/2 (claim 128 tasks)
2. Push those 128 tasks to global queue (single lock acquisition)
3. Write new task to buffer, advance tail
Result: 128 tasks moved to global, buffer half empty
Queue full, stealer active:
1. Push new task directly to global queue
Result: Single task to global, no disruption to stealer

This design ensures that overflow is always possible and that the global queue absorbs excess work. The batch overflow of 128 tasks amortizes the mutex cost of the global queue, making it negligible per task.

Before overflow (full):
+---+---+---+---+---+---+ ... +---+---+
| 0 | 1 | 2 | 3 | 4 | 5 | |254|255|
+---+---+---+---+---+---+ ... +---+---+
^ ^
head=0 tail=256
After overflow:
+---+---+---+---+---+---+ ... +---+---+
| | | |128|129|130| |254|255| NEW
+---+---+---+---+---+---+ ... +---+---+
^ ^
head=128 tail=257
Tasks 0-127 moved to global queue

The WorkStealQueue itself is a FIFO queue. LIFO behavior comes from the separate lifo_slot on each Worker. When the worker receives a new task (e.g., from a waker callback), the new task goes into the LIFO slot, and any previously held LIFO task is evicted to the ring buffer:

Worker receives task C:
Before: LIFO: [B] Ring: [... A ...]
After: LIFO: [C] Ring: [... A ... B] (B evicted to ring)

This separation means the ring buffer is pure FIFO, stealers always get the oldest tasks (fairest), and the owner always executes the newest task first (best cache locality).

  • WorkStealQueue: src/internal/scheduler/Header.zig
  • Worker integration: src/internal/scheduler/Scheduler.zig