Counting Semaphore Algorithm
Volt’s Semaphore is a counting semaphore that limits concurrent access to a resource. Tasks can acquire N permits (decrement) and release N permits (increment). When permits are unavailable, tasks yield to the scheduler rather than blocking the OS thread. The Mutex and RwLock primitives are built on top of the Semaphore.
The algorithm was carefully designed to prevent permit starvation — a subtle bug where permits “slip through” between a failed CAS and waiter queue insertion, causing waiters to starve indefinitely.
The Starvation Problem
Section titled “The Starvation Problem”Consider a naive semaphore implementation:
Thread A (acquirer): Thread B (releaser):1. CAS: try subtract N permits -> fails, not enough2. 3. release(N): add N permits to atomic4. lock mutex, add to waiter queue5. unlock mutex, sleepBetween steps 1 and 4, Thread B releases permits (step 3). Those permits go to the atomic counter. Thread A does not see them because it already decided to wait. The permits are now “floating” in the atomic while a waiter sits in the queue. If all subsequent release() calls follow the same pattern (check waiters, find none yet queued, add to atomic), the waiter starves.
The Volt Solution: Two-Tier Release
Section titled “The Volt Solution: Two-Tier Release”Volt eliminates the race with two mechanisms: (1) lock-before-CAS on the acquire slow path, and (2) a two-tier release with a lock-free fast waiter slot.
acquireWait() release() +-----------------------+ +----------------------------+Fast | CAS for full amount | | swap fast_waiter → 0 |path: | (lock-free, no mutex) | | If non-zero: | +-----------+-----------+ | serve waiter directly | | | (no mutex needed) | (insufficient) +-------------+--------------+ | | +-----------v-----------+ (slot was empty) | acquireWaitDirect() | |Slow | Lock mutex | +-------------v--------------+path: | CAS to drain what's | | Lock mutex | | available | | If waiters in list: | | If still not enough: | | serve directly | | CAS fast_waiter | | If no waiters: | | 0 → ptr | | add to atomic | | If CAS fails: | | Unlock mutex | | push to list | +----------------------------+ | Unlock mutex | +-----------------------+Key invariant: Permits NEVER float in the atomic counter when waiters are queued. release() serves waiters directly; only surplus permits go to the atomic.
graph TD R["release(N)"] --> SW["swap fast_waiter → 0"] SW -->|non-zero| FW["Serve waiter directly\n(lock-free)"] FW -->|fully satisfied| SU{"Surplus\npermits?"} SU -->|yes| RS["releaseSurplus()\n(takes mutex)"] SU -->|no| WK["Wake waiter"] FW -->|partial| PB["Put waiter back\nin fast slot or list"] SW -->|zero| MX["Lock mutex"] MX --> CK{"Waiters\nin list?"} CK -->|yes| RW["releaseToWaiters()\nserve FIFO"] CK -->|no| FA["fetchAdd to atomic"] RS --> WK
style FW fill:#22c55e,color:#000 style MX fill:#6b7280,color:#fff style R fill:#3b82f6,color:#fffData Structures
Section titled “Data Structures”The semaphore uses four components:
pub const Semaphore = struct { /// Available permits (modified under mutex OR via CAS). /// Cache-line aligned to avoid false sharing with mutex/waiters /// under contention — permits is CAS'd by every tryAcquire on the fast path. permits: std.atomic.Value(usize) align(std.atomic.cache_line),
/// Atomic single-waiter fast slot for lock-free release. /// Value: 0 = empty, non-zero = pointer to Waiter. /// NOT the same as the has_waiters flag that deadlocked (see below). fast_waiter: std.atomic.Value(usize),
/// Protects the waiter list. Taken by release() slow path and /// acquireWait()'s slow path. NOT taken by tryAcquire(). mutex: std.Thread.Mutex,
/// FIFO waiter queue (push front, serve from back) waiters: WaiterList,};The permits field is align(std.atomic.cache_line) because every tryAcquire CAS touches it on the fast path. Without alignment, the CAS would bounce the cache line containing mutex and waiters, adding ~20ns per operation under contention.
The fast_waiter field is an atomic pointer that stores at most one waiter for lock-free release. See Fast Waiter Slot for the full design rationale and why this is fundamentally different from the has_waiters flag that deadlocked.
Each waiter tracks how many permits it still needs:
pub const Waiter = struct { /// Remaining permits needed (starts at requested, decrements to 0) state: std.atomic.Value(usize),
/// Original request (for cancellation math) permits_requested: usize,
/// Wake callback waker: ?WakerFn, waker_ctx: ?*anyopaque,
/// Intrusive list pointers (zero allocation) pointers: Pointers(Waiter),};Waiters are intrusive — they are embedded in the AcquireFuture struct that the caller owns. No heap allocation is needed per wait operation.
The tryAcquire Fast Path
Section titled “The tryAcquire Fast Path”For the uncontended case, tryAcquire is a simple lock-free CAS loop with no mutex involvement:
pub fn tryAcquire(self: *Self, num: usize) bool { var current = self.permits.load(.acquire); while (true) { if (current < num) return false;
if (self.permits.cmpxchgWeak(current, current - num, .acq_rel, .acquire)) |new_val| { current = new_val; continue; } return true; }}This is the common case for an uncontended semaphore and compiles down to a tight CAS loop with no function calls.
The acquireWait Algorithm
Section titled “The acquireWait Algorithm”When tryAcquire fails and the caller is willing to wait:
pub fn acquireWait(self: *Self, waiter: *Waiter) bool { const needed = waiter.state.load(.acquire);
// === Fast path: lock-free CAS for full acquisition === var curr = self.permits.load(.acquire); while (curr >= needed) { if (self.permits.cmpxchgWeak(curr, curr - needed, .acq_rel, .acquire)) |updated| { curr = updated; continue; } waiter.state.store(0, .release); return true; // Got all permits, no mutex needed }
// === Slow path: lock BEFORE CAS === self.mutex.lock();
// Under lock: drain available permits via CAS var acquired: usize = 0; curr = self.permits.load(.acquire); while (acquired < needed and curr > 0) { const take = @min(curr, needed - acquired); if (self.permits.cmpxchgWeak(curr, curr - take, .acq_rel, .acquire)) |updated| { curr = updated; continue; } acquired += take; curr = self.permits.load(.acquire); }
if (acquired >= needed) { waiter.state.store(0, .release); self.mutex.unlock(); return true; // Got all permits under lock }
// Partially fulfilled -- update waiter state and queue if (acquired > 0) { var rem = acquired; _ = waiter.assignPermits(&rem); }
self.waiters.pushFront(waiter); // FIFO: push front, serve from back self.mutex.unlock(); return false; // Caller must yield}The CAS in the slow path is still needed because tryAcquire() can concurrently succeed (it is lock-free). But the critical property is that we hold the mutex when we decide to queue the waiter. Since release() also holds the mutex when checking the waiter queue, the race window is closed.
The acquireWaitDirect Algorithm
Section titled “The acquireWaitDirect Algorithm”acquireWaitDirect is called by AcquireFuture.poll(.init) which already called tryAcquire() and failed. It skips the lock-free CAS fast path (which would almost never succeed nanoseconds after tryAcquire just failed) and goes straight to the mutex slow path.
The key addition: after the mutex slow path determines the waiter must wait, it tries the fast waiter slot before falling back to the linked list:
fn acquireWaitDirect(self: *Self, waiter: *Waiter) bool { const needed = waiter.state.load(.acquire);
self.mutex.lock();
// Under lock: CAS to drain available permits // (same as acquireWait slow path) // ...
// Try single-waiter fast slot (avoids mutex on wake path) if (self.fast_waiter.cmpxchgStrong(0, @intFromPtr(waiter), .release, .monotonic) == null) { self.mutex.unlock(); return false; }
// Fast slot occupied — fall back to linked list self.waiters.pushFront(waiter); self.mutex.unlock(); return false;}The CAS 0 → ptr succeeds only if the fast slot is empty. If another waiter already occupies it, this waiter goes to the linked list. This guarantees at most one waiter in the fast slot at any time.
The release Algorithm
Section titled “The release Algorithm”release() uses a two-tier approach: try the fast waiter slot first (lock-free), then fall back to the mutex.
Tier 1: Fast waiter slot (lock-free)
Section titled “Tier 1: Fast waiter slot (lock-free)”pub fn release(self: *Self, num: usize) void { if (num == 0) return;
// Try to serve fast_waiter without mutex const fast_ptr = self.fast_waiter.swap(0, .acq_rel); if (fast_ptr != 0) { const w: *Waiter = @ptrFromInt(fast_ptr);
// CRITICAL: Copy waker BEFORE modifying state. // AcquireFuture may deinit after isComplete() returns true. const waker_fn = w.waker; const waker_ctx = w.waker_ctx; const remaining = w.state.load(.monotonic);
if (num >= remaining) { // Fully satisfy the waiter w.state.store(0, .release); const surplus = num - remaining; if (surplus > 0) { self.releaseSurplus(surplus); } // Wake AFTER state is visible if (waker_fn) |wf| if (waker_ctx) |ctx| wf(ctx); return; }
// Partial: assign what we can, put waiter back w.state.store(remaining - num, .release); if (self.fast_waiter.cmpxchgStrong(0, @intFromPtr(w), .release, .monotonic) == null) { return; // Back in fast slot } // Fast slot taken by new waiter — push to list self.mutex.lock(); self.waiters.pushFront(w); self.mutex.unlock(); return; }
// Tier 2: no fast_waiter, fall through to mutex path // ...}The swap atomically extracts the waiter pointer. If non-zero, the releaser owns the waiter and can serve it directly without any mutex. The “copy before modify” pattern prevents use-after-free: once w.state.store(0, .release) executes, the polling task may see isComplete() == true and deinit the future immediately.
Tier 2: releaseSurplus (mutex, for remaining list waiters)
Section titled “Tier 2: releaseSurplus (mutex, for remaining list waiters)”When the fast waiter was fully satisfied but there are surplus permits:
fn releaseSurplus(self: *Self, surplus: usize) void { self.mutex.lock(); if (self.waiters.back() == null) { _ = self.permits.fetchAdd(surplus, .release); self.mutex.unlock(); return; } self.releaseToWaiters(surplus);}Tier 3: Mutex slow path (linked list)
Section titled “Tier 3: Mutex slow path (linked list)”When the fast slot was empty (swap returned 0), fall back to the standard mutex-protected path:
// (continued from release) self.mutex.lock();
if (self.waiters.back() == null) { // No waiters -- surplus goes to atomic _ = self.permits.fetchAdd(num, .release); self.mutex.unlock(); return; }
// Serve queued waiters from the linked list self.releaseToWaiters(num);The releaseToWaiters function serves waiters from the back (oldest first, FIFO), collects wakers into a WakeList, releases the mutex, then wakes all collected waiters outside the lock:
fn releaseToWaiters(self: *Self, num: usize) void { var wake_list: WakeList = .{}; var rem = num;
while (rem > 0) { const waiter = self.waiters.back() orelse break; const remaining = waiter.state.load(.monotonic);
if (remaining == 0) { _ = self.waiters.popBack(); // Collect waker for batch wake after unlock // ... continue; }
const assign = @min(rem, remaining); waiter.state.store(remaining - assign, .release); rem -= assign;
if (remaining - assign == 0) { _ = self.waiters.popBack(); // Collect waker } else { break; // Partially satisfied, no more permits } }
if (rem > 0) { _ = self.permits.fetchAdd(rem, .release); }
self.mutex.unlock(); wake_list.wakeAll(); // Wake AFTER releasing the lock}Important details:
- FIFO ordering: Waiters are pushed to the front and served from the back, ensuring fairness.
- Batch waking: Wakers are collected into a
WakeListand invoked after the mutex is released. This prevents holding the lock during potentially expensive wake operations. - Partial fulfillment: A waiter requesting N permits may receive them incrementally across multiple
release()calls. Thewaiter.statetracks remaining needed permits. - Direct handoff: Permits go directly from the releaser to the waiter without passing through the atomic counter. This is what prevents starvation.
Cancellation
Section titled “Cancellation”When a future is cancelled (e.g., via select or timeout), any partially acquired permits must be returned. Cancellation tries the fast waiter slot first (lock-free), then falls back to the mutex:
pub fn cancelAcquire(self: *Self, waiter: *Waiter) void { if (waiter.isComplete()) return;
// Try lock-free removal from fast slot if (self.fast_waiter.cmpxchgStrong(@intFromPtr(waiter), 0, .acq_rel, .monotonic) == null) { // Removed from fast slot — return partial permits const remaining = waiter.state.load(.monotonic); const acquired = waiter.permits_requested - remaining; if (acquired > 0) { self.release(acquired); } return; }
// Not in fast slot — take mutex, remove from list self.mutex.lock(); if (waiter.isComplete()) { self.mutex.unlock(); return; }
self.waiters.remove(waiter); waiter.pointers.reset();
const remaining = waiter.state.load(.monotonic); const acquired = waiter.permits_requested - remaining; self.mutex.unlock();
// Return partial permits (may wake other waiters!) if (acquired > 0) { self.release(acquired); }}The CAS ptr → 0 succeeds only if the waiter is still in the fast slot. If release() already swapped it out, the CAS fails and we fall through to the mutex path. The partial-permit return calls release(), which may wake other waiters that were blocked.
The Mutex as Semaphore(1)
Section titled “The Mutex as Semaphore(1)”The Volt Mutex is implemented as a single-permit semaphore:
pub const Mutex = struct { sem: Semaphore,
pub fn init() Mutex { return .{ .sem = Semaphore.init(1) }; }
pub fn tryLock(self: *Self) bool { return self.sem.tryAcquire(1); }
pub fn unlock(self: *Self) void { self.sem.release(1); }};This means the Mutex inherits all of the Semaphore’s correctness guarantees: starvation prevention, FIFO ordering, safe cancellation, zero-allocation waiters, and the lock-free fast waiter slot.
Performance Characteristics
Section titled “Performance Characteristics”| Operation | Path | Uncontended | Contended |
|---|---|---|---|
| tryAcquire | CAS loop | Single CAS (~5-10ns) | CAS retry loop |
| acquireWait | Fast path (CAS) | Single CAS (~5-10ns) | Falls to slow path |
| acquireWait | Slow path (mutex) | Mutex + CAS (~50ns) | Mutex + queue (~100ns) |
| acquireWaitDirect | Mutex + fast slot CAS | Mutex + CAS (~50ns) | Mutex + queue (~100ns) |
| release | Fast waiter slot | Swap (~10ns) | Swap + serve (~30ns) |
| release | No fast waiter (mutex, no waiters) | Mutex + fetchAdd (~30ns) | Same |
| release | Mutex + list waiters | Mutex + direct handoff (~50ns) | Mutex + wake (~100ns) |
| cancelAcquire | Fast slot CAS | CAS (~5ns) | Falls to mutex path |
The two-tier release adds a single atomic swap to every release() call (~2ns when the slot is empty). In exchange, the common contended case (one waiter) avoids the mutex entirely, saving ~30-50ns per operation. Under the contended semaphore benchmark (8 tasks, 2 permits), this optimization reduced latency from ~220 ns/op to ~139 ns/op.
Source Files
Section titled “Source Files”- Semaphore:
src/sync/Semaphore.zig - Mutex (built on Semaphore):
src/sync/Mutex.zig - Fast waiter slot design: Fast Waiter Slot
References
Section titled “References”- Edsger W. Dijkstra, “Cooperating Sequential Processes,” Programming Languages, NATO Advanced Study Institute, 1968. https://www.cs.utexas.edu/~EWD/transcriptions/EWD01xx/EWD123.html
- Tokio semaphore implementation:
tokio/src/sync/batch_semaphore.rs - Maurice Herlihy and Nir Shavit, The Art of Multiprocessor Programming, Morgan Kaufmann, 2012 — Chapter 8 (monitors and blocking synchronization).