Performance Tuning
Volt is designed to be fast by default, but real workloads benefit from tuning. This guide covers the knobs available and when to turn them.
Worker Count
Section titled “Worker Count”The num_workers config controls how many OS threads run the async scheduler. The default (0) auto-detects based on CPU count.
const volt = @import("volt");
// Auto-detect (one worker per CPU core)try volt.runWith(allocator, .{}, myServer);
// Explicit: 4 workerstry volt.runWith(allocator, .{ .num_workers = 4 }, myServer);
// Manual runtime setupvar io = try volt.Io.init(allocator, .{ .num_workers = 8, .max_blocking_threads = 256,});defer io.deinit();Guidelines
Section titled “Guidelines”| Workload | Recommended Workers |
|---|---|
| I/O-bound (web server, proxy) | CPU count (default) |
| Mixed I/O + compute | CPU count - 1 (leave room for blocking pool) |
| Compute-heavy with I/O | CPU count / 2 (offload compute to blocking pool) |
| Testing / debugging | 1 (deterministic execution) |
More workers is not always better. Each worker adds memory overhead (local queue, stack, LIFO slot) and increases contention on the global queue.
LIFO Slot and Cache Locality
Section titled “LIFO Slot and Cache Locality”Each worker has a LIFO slot — a single-task fast path that bypasses the local queue entirely. When a task wakes another task on the same worker, the woken task goes into the LIFO slot and runs next.
This matters because:
- The woken task likely accesses the same cache lines as the waker (temporal locality).
- Skipping the queue reduces latency for ping-pong patterns (mutex lock/unlock, channel send/recv).
- The scheduler caps LIFO polls at
MAX_LIFO_POLLS_PER_TICK = 3to prevent starvation of queued tasks.
When LIFO Helps
Section titled “When LIFO Helps”- Mutex contention:
unlock()wakes the next waiter. If that waiter runs immediately on the same core, the mutex’s memory is still hot in L1 cache. - Channel ping-pong: Producer sends, consumer runs on same worker, consumes, producer runs again.
- Barrier release: Leader wakes all participants; the first one runs via LIFO.
When LIFO Hurts
Section titled “When LIFO Hurts”If many tasks constantly wake each other, LIFO can cause a small set of tasks to monopolize workers while other tasks starve. The MAX_LIFO_POLLS_PER_TICK cap exists for this reason.
You cannot disable LIFO via config — it is always active. If you see starvation in profiling, redesign the wake pattern (e.g., batch work into fewer tasks).
Cooperative Budgeting
Section titled “Cooperative Budgeting”The scheduler enforces a budget of 128 polls per tick (BUDGET_PER_TICK). After 128 task polls, the worker performs maintenance:
- Check the global queue for new tasks
- Process I/O completions from the backend
- Fire expired timers
- Update the adaptive global queue interval
This prevents a single long-running future chain from starving I/O and timers.
Implications for Your Code
Section titled “Implications for Your Code”- Futures that call many sub-futures in a single
poll()consume budget rapidly. - If your future does O(1000) work per poll, consider yielding manually.
volt.task.yield()is an engine internal — it is a no-op hint that the scheduler may use to preempt.
Adaptive Global Queue Interval
Section titled “Adaptive Global Queue Interval”The scheduler uses EWMA (exponentially weighted moving average) to estimate average task poll duration. It adjusts how often it checks the global queue:
- Fast tasks (< 1us each) —> check every ~128 polls
- Slow tasks (> 100us each) —> check every ~8 polls
This self-tuning happens automatically. You do not need to configure it.
Reducing Contention
Section titled “Reducing Contention”Contention is the primary bottleneck in async runtimes. Here is how to minimize it.
Choose the Right Primitive
Section titled “Choose the Right Primitive”| Pattern | Wrong Choice | Right Choice |
|---|---|---|
| Shared config | Mutex protecting a struct | Watch channel |
| Request/response | Channel with capacity 1 | Oneshot |
| Rate limiting | Mutex + counter | Semaphore |
| One-time init | Mutex + bool flag | OnceCell |
| Read-heavy data | Mutex | RwLock |
See the Choosing a Primitive guide for a full decision tree.
Avoid Shared State When Possible
Section titled “Avoid Shared State When Possible”Channels move data between tasks without sharing. If you can model your problem as message passing instead of shared state, do it.
// WORSE: Shared state with mutexvar mutex = volt.sync.Mutex.init();var shared_counter: u64 = 0;
fn incrementCounter() void { if (mutex.tryLock()) { defer mutex.unlock(); shared_counter += 1; }}
// BETTER: Channel-based accumulationvar ch = try volt.channel.bounded(u64, allocator, 1024);
fn sendIncrement() void { _ = ch.trySend(1);}
fn accumulator() void { var total: u64 = 0; while (true) { switch (ch.tryRecv()) { .value => |v| total += v, .empty => return, .closed => return, } }}Partition State Across Workers
Section titled “Partition State Across Workers”Instead of one Mutex-protected map, use N maps (one per worker) and route by key hash:
const NUM_SHARDS = 16;var shards: [NUM_SHARDS]struct { mutex: volt.sync.Mutex, data: SomeMap,} = undefined;
fn getShard(key: u64) *@TypeOf(shards[0]) { return &shards[key % NUM_SHARDS];}Memory Allocation in Hot Paths
Section titled “Memory Allocation in Hot Paths”Allocation is the hidden enemy of async performance. Each std.heap.page_allocator.alloc() is a syscall. In hot paths:
- Pre-allocate buffers before entering the event loop.
- Use arena allocators for request-scoped data.
- Pool long-lived objects (connections, task contexts).
- Avoid
ArrayList.appendin poll functions — it may reallocate.
// Pre-allocate a buffer pool before the event loopvar pool: [256][4096]u8 = undefined;var free_list: std.ArrayList(usize) = .empty;try free_list.ensureTotalCapacity(allocator, 256);for (0..256) |i| free_list.appendAssumeCapacity(i);The blocking pool (io.concurrent) is the right place for allocation-heavy work. It runs on separate threads that will not starve the async scheduler.
Blocking Pool Tuning
Section titled “Blocking Pool Tuning”The blocking pool spawns OS threads on demand for CPU-intensive or blocking I/O work.
var io = try volt.Io.init(allocator, .{ .max_blocking_threads = 512, // Max concurrent blocking tasks .blocking_keep_alive_ns = 10 * std.time.ns_per_s, // Idle thread timeout});max_blocking_threads: Upper bound on OS threads. Default 512. Lower this if memory is constrained.blocking_keep_alive_ns: How long idle blocking threads survive before exiting. Default 10 seconds. Increase for bursty workloads; decrease to reclaim memory faster.
Benchmark-Driven Optimization Workflow
Section titled “Benchmark-Driven Optimization Workflow”Never optimize without measuring. Follow this workflow:
- Establish a baseline using
zig build benchor a custom benchmark. - Profile with
perf record(Linux) or Instruments (macOS). - Identify the bottleneck — is it lock contention? Allocation? Syscalls? Cache misses?
- Make one change and re-benchmark.
- Run correctness tests (
zig build test-all) after every optimization.
Built-in Benchmarks
Section titled “Built-in Benchmarks”# Full benchmark suite (sync, channel, async, task scheduling)zig build bench
# Compare against Tokio (Rust) baselineszig build compareWhat to Look For
Section titled “What to Look For”| Metric | Healthy | Concerning |
|---|---|---|
| Uncontended mutex | < 15ns | > 50ns |
| Channel send/recv roundtrip | < 10ns | > 100ns |
| Semaphore acquire/release | < 15ns | > 50ns |
| Contended mutex (4 threads) | < 200ns | > 1000ns |
| Task spawn + await | < 10,000ns | > 50,000ns |
The benchmarks under bench/ compare Volt against Tokio. Volt wins 17/21 benchmarks. Tokio leads in contended semaphore (1.2x), MPMC channel (2.2x), and blocking pool spawn (2.2x). Volt’s architecture is derived from Tokio’s design.