Vyukov MPMC Bounded Queue
Volt’s Channel(T) implements a bounded MPMC (multi-producer, multi-consumer) queue using Dmitry Vyukov’s lock-free ring buffer algorithm. This is the same algorithm used by crossbeam-channel in the Rust ecosystem. The data path (send/receive) is fully lock-free; a separate mutex protects only the waiter lists for async operations.
The Core Idea
Section titled “The Core Idea”Each slot in the ring buffer has a sequence number that encodes the slot’s state. Producers and consumers use CAS (compare-and-swap) on shared head and tail positions to claim slots, then use the sequence number to confirm the slot is in the expected state. No locks are needed on the data path.
Ring buffer with capacity 4:
slot[0] slot[1] slot[2] slot[3]+----------+ +----------+ +----------+ +----------+| seq: 0 | | seq: 1 | | seq: 2 | | seq: 3 || value: - | | value: - | | value: - | | value: - |+----------+ +----------+ +----------+ +----------+ ^ ^ head=0 tail=0Initially, each slot’s sequence equals its index. This means all slots are “writable” from the producer’s perspective.
The Algorithm
Section titled “The Algorithm”Send (Producer)
Section titled “Send (Producer)”To send a value at position tail:
- Load the current
tailposition. - Compute
idx = tail & buf_mask(bitmask — buffer is always power-of-2). - Load
slot[idx].sequence. - Compare
sequencewithtail:- If
sequence == tail: The slot is writable. CAStailtotail + 1to claim it. - If
sequence < tail(as signed difference): The slot still has unconsumed data. Queue is full. - If
sequence > tail: Another producer claimed this slot. Reloadtailand retry.
- If
- After claiming: write the value, then store
sequence = tail + 1(marking “has data”).
pub fn trySend(self: *Self, value: T) SendResult { var tail = self.tail.load(.monotonic); while (true) { // Closed state encoded in tail bit 63 — free check on already-loaded value if (isClosedBit(tail)) return .closed;
// Bitmask index: 1 cycle vs ~20 for modulo on ARM64 const idx: usize = @intCast(tail & self.buf_mask); const slot = &self.slots[idx]; const seq = slot.sequence.load(.acquire);
const diff: i64 = @bitCast(seq -% tail);
if (diff == 0) { // Slot writable -- claim by advancing tail if (self.tail.cmpxchgWeak(tail, tail +% 1, .acq_rel, .monotonic)) |new_tail| { tail = new_tail; continue; } // Claimed! Write value and publish slot.value = value; slot.sequence.store(tail +% 1, .release);
// seq_cst: Dekker protocol — pairs with seq_cst store // in recvWait(). See design/channel-wakeup-protocol. if (self.has_recv_waiters.load(.seq_cst)) { self.wakeOneRecvWaiter(); } return .ok; } else if (diff < 0) { return .full; } else { tail = self.tail.load(.monotonic); } }}Receive (Consumer)
Section titled “Receive (Consumer)”To receive a value at position head:
- Load the current
headposition. - Compute
idx = head & buf_mask(bitmask — buffer is always power-of-2). - Load
slot[idx].sequence. - Compare
sequencewithhead + 1:- If
sequence == head + 1: The slot has data. CASheadtohead + 1to claim it. - If
sequence < head + 1(as signed difference): The slot is empty. Queue is empty. - If
sequence > head + 1: Another consumer claimed this slot. Reloadheadand retry.
- If
- After claiming: read the value, then store
sequence = head + capacity(marking “writable again”).
pub fn tryRecv(self: *Self) RecvResult { var head = self.head.load(.monotonic); while (true) { // Bitmask index: 1 cycle vs ~20 for modulo on ARM64 const idx: usize = @intCast(head & self.buf_mask); const slot = &self.slots[idx]; const seq = slot.sequence.load(.acquire);
const diff: i64 = @bitCast(seq -% (head +% 1));
if (diff == 0) { // Slot has data -- claim by advancing head if (self.head.cmpxchgWeak(head, head +% 1, .acq_rel, .monotonic)) |new_head| { head = new_head; continue; } // Claimed! Read value and release slot const value = slot.value; slot.sequence.store(head +% self.buf_cap, .release);
// seq_cst: Dekker protocol — pairs with seq_cst store // in sendWait(). See design/channel-wakeup-protocol. if (self.has_send_waiters.load(.seq_cst)) { self.wakeOneSendWaiter(); } return .{ .value = value }; } else if (diff < 0) { if (isClosedBit(self.tail.load(.acquire))) return .closed; return .empty; } else { head = self.head.load(.monotonic); } }}Sequence Number State Machine
Section titled “Sequence Number State Machine”The sequence number cycles through states as the slot is used. For a slot at index i with channel capacity C:
Producer claimsWritable: seq = i ---> Has data: seq = tail + 1 ^ | | | Consumer claims | v +--- seq = head + C --- Read completeEach full cycle advances the sequence by C. After N complete send/receive cycles, the sequence for slot i is i + N * C. The wrapping arithmetic ensures this works correctly even when positions overflow u64.
Visual Walkthrough
Section titled “Visual Walkthrough”Starting state (capacity 4, all slots writable):
slot[0].seq=0 slot[1].seq=1 slot[2].seq=2 slot[3].seq=3head=0, tail=0After send(A) at tail=0:
slot[0].seq=1 slot[1].seq=1 slot[2].seq=2 slot[3].seq=3 ^--- has data (seq == old_tail + 1)head=0, tail=1After send(B) at tail=1:
slot[0].seq=1 slot[1].seq=2 slot[2].seq=2 slot[3].seq=3head=0, tail=2After recv() at head=0, returns A:
slot[0].seq=4 slot[1].seq=2 slot[2].seq=2 slot[3].seq=3 ^--- writable again (seq = head + capacity = 0 + 4)head=1, tail=2ABA Prevention Through Sequence Numbers
Section titled “ABA Prevention Through Sequence Numbers”The classic ABA problem in lock-free data structures occurs when a CAS succeeds even though the value was changed and then changed back. The Vyukov queue is immune to ABA because sequence numbers monotonically increase.
Consider the scenario:
- Thread A reads
head=5,slot[5 & mask].seq = 6(data ready,6 == 5+1). - Thread A is preempted.
- Thread B receives from head=5, advances head to 6. Slot 5’s seq becomes
5 + C. - Thread B sends again, slot wraps, seq becomes
5 + C + 1. - Thread A resumes, tries CAS on head from 5 to 6.
At step 5, head is now 6, not 5. Thread A’s CAS fails. Even if by some contortion head wrapped back to 5, the sequence number would be 5 + C (writable), not 6 (has data), so Thread A would see diff != 0 and retry. The sequence counter makes each slot visit unique.
Cache Line Padding
Section titled “Cache Line Padding”The head and tail fields are on separate cache lines to prevent false sharing between producers and consumers:
head: std.atomic.Value(u64) align(CACHE_LINE_SIZE),tail: std.atomic.Value(u64) align(CACHE_LINE_SIZE),Without this padding, every CAS on head would invalidate the cache line containing tail on other cores, and vice versa. Since producers write tail and consumers write head, false sharing would cause severe performance degradation under contention.
Each slot interleaves its sequence counter and value in a single struct:
const Slot = struct { sequence: std.atomic.Value(u64), value: T = undefined,};slots: []Slot, // Interleaved sequence + value (crossbeam layout)This interleaved layout gives spatial locality: the sequence check and value read/write hit the same cache line instead of bouncing between two separate arrays. This matches crossbeam-channel’s slot layout.
Power-of-2 Buffer Size
Section titled “Power-of-2 Buffer Size”The internal buffer size is always rounded up to the next power of 2 (minimum 2):
const buf_cap: u64 = nextPow2(@max(cap, 2));This serves two purposes:
-
Bitmask indexing:
tail & buf_mask(1 cycle on ARM64) replacestail % buf_cap(~20 cycles on ARM64). The mask is simplybuf_cap - 1. -
Minimum 2 slots for correctness: With a single slot (buf_cap=1), the sequence number after a send would be
tail + 1 = 1, and after a receive it would behead + capacity = 0 + 1 = 1. The “has data” state and “writable” state would have the same sequence number, making them indistinguishable.
For non-power-of-2 user capacities (e.g., capacity=100 with buf_cap=128), an explicit capacity check ensures the channel does not exceed the user-requested limit:
if (self.buf_cap != self.capacity) { const head = self.head.load(.acquire); if (tail -% head >= self.capacity) return .full;}This check is skipped when buf_cap == capacity (already power-of-2), so the common case of power-of-2 capacities pays no extra cost.
Waiter Integration for Async Operations
Section titled “Waiter Integration for Async Operations”The lock-free ring buffer handles the fast path. When the channel is full (for senders) or empty (for receivers), tasks must wait. This is handled by a separate waiter system that does not touch the lock-free data path:
Fast path (lock-free): Slow path (mutex-protected): trySend() / tryRecv() waiter_mutex CAS on head/tail send_waiters: intrusive list sequence number check recv_waiters: intrusive listThe waiter system uses a “flag-before-check” protocol to prevent lost wakeups:
- Lock
waiter_mutex. - Set the
has_recv_waiters(orhas_send_waiters) flag with seq_cst. - Re-check the ring buffer under lock (
trySend/tryRecv). - If the re-check succeeds, clear the flag and return immediately.
- If still blocked, add the waiter to the list and release the lock.
After a successful trySend, the sender checks has_recv_waiters (seq_cst load) and wakes one receiver if set. After a successful tryRecv, the receiver checks has_send_waiters and wakes one sender. This ensures no waiter is stranded.
Why Sequential Consistency?
Section titled “Why Sequential Consistency?”The waiter flags implement a Dekker pattern — the fast path writes the buffer then reads the flag, while the slow path writes the flag then re-reads the buffer. These are two separate variables on different cache lines. With acquire/release, there is no cross-variable ordering guarantee: on ARM64, both sides can see stale values simultaneously, causing a lost wakeup that orphans a task forever.
Sequential consistency (.seq_cst) provides a total order that prevents
this: at least one side is guaranteed to see the other’s write. This is
the same approach used by crossbeam-channel’s SyncWaker::is_empty flag.
On x86 (TSO), the overhead is negligible (~1 cycle MFENCE on stores).
On ARM64, it adds a DMB barrier that is essential for correctness.
For a detailed walkthrough with step-by-step diagrams, see Channel Wakeup Protocol.
The wakeup avoids use-after-free by copying the waker function pointer and context before setting the status flag. A lock-free single-waiter fast slot (fast_recv_waiter) avoids the mutex entirely for the common case of one blocked consumer:
fn wakeOneRecvWaiter(self: *Self) void { // Fast path: atomic swap on single-waiter slot (avoids mutex) const fast_ptr = self.fast_recv_waiter.swap(0, .acq_rel); if (fast_ptr != 0) { const w: *RecvWaiter = @ptrFromInt(fast_ptr); const waker_fn = w.waker; const waker_ctx = w.waker_ctx; w.status.store(WAITER_COMPLETE, .release); // Owner may destroy after this // ... clear flag under mutex if no list waiters remain ... if (waker_fn) |wf| if (waker_ctx) |ctx| wf(ctx); return; }
// Slow path: pop from linked list under mutex self.waiter_mutex.lock(); const waiter = self.recv_waiters.popFront(); // ... clear has_recv_waiters flag if no waiters remain ... self.waiter_mutex.unlock();
if (waiter) |w| { const waker_fn = w.waker; const waker_ctx = w.waker_ctx; w.status.store(WAITER_COMPLETE, .release); // Owner may destroy after this if (waker_fn) |wf| if (waker_ctx) |ctx| wf(ctx); }}Comparison with Other Approaches
Section titled “Comparison with Other Approaches”| Approach | Send | Recv | Bounded | Lock-free |
|---|---|---|---|---|
| Vyukov bounded (Volt) | O(1) CAS | O(1) CAS | Yes | Yes (data path) |
| Mutex + VecDeque | O(1) + lock | O(1) + lock | Optional | No |
| Michael-Scott queue | O(1) CAS + alloc | O(1) CAS + free | No | Yes |
| LCRQ (linked CRQs) | O(1) CAS | O(1) CAS | No | Yes |
The Vyukov bounded queue is ideal for Volt’s use case:
- Bounded: provides backpressure (senders wait when full), preventing unbounded memory growth.
- Lock-free data path: no mutex on send/receive, only on waiter management.
- No allocation per operation: the ring buffer is pre-allocated.
- Efficient under both low and high contention: low contention means CAS succeeds on first try; high contention means retry loops converge quickly because successful CAS operations advance the position.
Source Files
Section titled “Source Files”- Channel implementation:
src/channel/Channel.zig
References
Section titled “References”- Dmitry Vyukov, “Bounded MPMC Queue,” 1024cores.net, 2010. http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
- crossbeam-channel (Rust), which uses the same Vyukov bounded queue: https://github.com/crossbeam-rs/crossbeam/tree/master/crossbeam-channel
- Maurice Herlihy and Nir Shavit, The Art of Multiprocessor Programming, Morgan Kaufmann, 2012 — Chapter 10 (concurrent queues).