Backpressure for the impatient: channels vs. semaphores vs. tokens

A pipeline has backpressure when the consumer can make the producer slow down, or refuse the work, or both. A bigger buffer does neither. It just defers the same problem, with more memory held hostage in the meantime. The two get confused a lot, and the difference shows up unmistakably in the tail latency numbers, so this post does the comparison directly.

When load exceeds capacity you have three sane options:

Block the caller until a slot frees. Throughput pins to whatever the consumer sustains; tail latency degrades cleanly.
Reject the caller so they retry, fall back, or shed. Latency stays low for accepted work; some work never happens.
Rate-limit the caller so they never get close to overloading you. Predictable throughput, the rest is queued or dropped.

Go has a direct idiom for each: a bounded channel for option 1, the golang.org/x/sync/semaphore package for option 1 with per-item weighting, and a time.Ticker token bucket for option 3 (and trivially option 2). They are not interchangeable. They produce three very different latency curves under the same workload, which is what we will see below.

The workload

I built one harness and reused the same request schedule against all three patterns. Knobs:

2000 requests at a steady 1000 req/s burst (so the arrival stream completes in 2 seconds).
Each request does ~100 ms of work (time.Sleep with a small jitter so the consumer is the bottleneck, not the harness).
16 worker slots for the blocking patterns: 160 req/s sustained against a 1000 req/s burst, a 6.25x overload.
Token bucket at 160 tokens/s, burst 32, matched to the consumer’s actual capacity.
Latency measured from the moment the producer attempts to submit to the moment the work completes. That is what the caller sees.

Hardware: 12 logical CPUs, Go 1.20, GOMAXPROCS=12. Each pattern ran twice with different RNG seeds and the numbers below reproduced to within a couple of milliseconds.

Pattern A: bounded channel

The simplest possible thing that works. A buffered chan is the slot pool. A fixed number of workers reads from it. Producers send blocks when the channel is full. That send blocking is the backpressure.

func runBounded(reqs []time.Time, workers int) {
    jobs := make(chan int, workers)
    var wg sync.WaitGroup
    wg.Add(workers)
    for i := 0; i < workers; i++ {
        go func() {
            defer wg.Done()
            for idx := range jobs {
                doWork(idx) // ~100ms
            }
        }()
    }
    for i := range reqs {
        // pace producer to the arrival schedule, then submit
        jobs <- i // blocks when full -> caller waits
    }
    close(jobs)
    wg.Wait()
}

Capacity is the channel buffer plus the in-flight workers, which is why I set the buffer equal to workers. Make it bigger and you are just preallocating queue depth in front of the same consumer.

Fair (FIFO on the same channel), cheap, no imports, no goroutine per request.

Pattern B: weighted semaphore

golang.org/x/sync/semaphore.Weighted is what you reach for when your work items have different costs. Image resizing, encoding, an “import a 10 MB CSV” handler next to a “lookup by id” handler. You do not want one fat job hogging a slot another job could share.

import "golang.org/x/sync/semaphore"

func runSemaphore(reqs []time.Time, capacity int64) {
    sem := semaphore.NewWeighted(capacity)
    var wg sync.WaitGroup
    for i := range reqs {
        wg.Add(1)
        go func(idx int) {
            defer wg.Done()
            if err := sem.Acquire(ctx, 1); err != nil {
                return
            }
            defer sem.Release(1)
            doWork(idx)
        }(i)
    }
    wg.Wait()
}

Big difference from Pattern A: the producer is not blocked. It spawns a goroutine per request and the goroutines park in Acquire. That gives you per-item weighting (sem.Acquire(ctx, 5) for a heavy job) and context cancellation for free, but nothing throttles the rate at which goroutines accumulate. If the producer outruns the consumer by 6x, you get 6x more parked goroutines than the bounded channel would. Sounds harmless. It is not.

Pattern C: token bucket with `time.Ticker`

The two patterns above are admission-based: the consumer decides who gets in based on whether there is a free slot. A token bucket is rate-based: somebody drips tokens at a fixed rate, and you must hold one to be admitted. No tokens, no admission.

func runTokenBucket(reqs []time.Time, ratePerSec, burst int) {
    tokens := make(chan struct{}, burst)
    for i := 0; i < burst; i++ {
        tokens <- struct{}{}
    }
    go func() {
        t := time.NewTicker(time.Second / time.Duration(ratePerSec))
        defer t.Stop()
        for range t.C {
            select {
            case tokens <- struct{}{}:
            default: // bucket full, drop new token
            }
        }
    }()

    // worker pool (sized generously; admission is what shapes the curve)
    jobs := make(chan int, 1024)
    for i := 0; i < 64; i++ {
        go func() {
            for idx := range jobs {
                doWork(idx)
            }
        }()
    }
    for i := range reqs {
        select {
        case <-tokens:
            jobs <- i
        default:
            // shed: caller's problem now
        }
    }
}

Classic time.Ticker + buffered channel. The buffer is the burst budget, the ticker rate is the steady-state ceiling. The select on the producer side decides whether to block (omit default) or shed (default returns immediately). I picked shed because that makes the contrast with the first two patterns clearest.

The worker pool is intentionally oversized: the bucket, not the workers, is the throttle. Throttle in two places and you have two regulators arguing.

One workload, three answers

Identical 2000-request burst at 1000 req/s. Each consumer slot does ~100ms of work. Latency is end to end (submission to completion). All numbers below are from real runs on the same box.

Pattern	Served	Dropped	p50	p95	p99	Wall
bounded	2000	0	206 ms	224 ms	242 ms	12.66 s
semaphore	2000	0	5346 ms	10077 ms	10502 ms	12.63 s
tokenbkt	351	1649	101 ms	105 ms	106 ms	2.10 s

Latency by backpressure pattern (log scale)

Three very different stories.

Bounded channel: graceful queue, predictable tail

p50 of 206 ms is what you would expect. 16 workers at 100 ms per job is 160 jobs/s steady-state. The arrival burst piles up in the channel-plus-producer wait, and the average wait grows linearly with queue depth. p99 of 242 ms is the queue length at the worst moment. Memory stayed flat: heap allocation delta was 0.2 MiB for the whole run.

The producer’s send blocks. That blocking propagates up the call stack and naturally rate-limits whatever generated the arrivals. No goroutine accumulates.

Semaphore: same throughput, dramatically worse tail

This one surprises people. Same workers, same work, same wall time, but p50 of 5.3 seconds and p99 above 10 seconds.

The reason is in the code. Pattern B spawns a goroutine per request and parks it in Acquire. The arrival schedule is 1000 req/s for 2 seconds, so by the time the burst ends, roughly 1840 goroutines are waiting on the semaphore. They drain at 160 req/s. The goroutine that arrives at t=1.9 s waits for ~1800 jobs ahead of it. That is your p99.

The bounded channel never has more than 16 jobs queued because the producer is blocked on the 17th. The semaphore has all of them queued because the producer was never blocked.

The semaphore is still doing its job: concurrency capped, no OOM, same wall time. The throughput is identical; only the latency distribution is different, because everything queues in-process instead of upstream. If you have weight-asymmetric work, the trade is often worth it. If you do not, a bounded channel is strictly better.

(In a real server the difference is worse, because each parked goroutine holds the request object, headers, body buffers, and so on. 1840 of those is not free. The harness uses trivial state, so memory delta was only 1.6 MiB.)

Token bucket: low latency for accepted work, lots of drops

p99 of 106 ms, which is essentially just the work time. The token bucket admitted 351 of 2000 requests and shed the other 1649. There is no queueing on the accepted-work side because admission was already paced to the consumer’s rate.

Wall time is 2.10 s, which is the arrival burst duration. The consumer goes idle as soon as the arrivals stop, because nothing was queued.

This is the opposite extreme from the semaphore. There you queue everything in-process. Here you queue nothing and just say no.

Decision matrix

You generally want a bounded channel. The interesting question is when to reach for the other two.

If you…	Reach for
have homogeneous work and one producer	bounded channel
have weight-asymmetric work (CPU, memory, payload size)	semaphore
need per-context cancellation while waiting	semaphore
must protect a downstream system at a fixed rate	token bucket
can afford to drop, would rather shed than queue	token bucket
have nothing to push back on the producer (e.g. HTTP)	token bucket + 429

A pattern I have used in production: a token bucket at the edge, shedding 429s to clients during sudden overload, with a bounded channel inside the service to make sure that even if the bucket admits a burst, the internal pipeline never grows past a fixed size. Two layers, each doing the thing it is best at.

The semaphore is a specialist. Use it when you mean it: items truly differ in cost, or you need context.Context plumbing through the wait.

What I would not do

Buffering. Specifically, an unbounded channel, or a chan with capacity “big enough that this never blocks.” That is not backpressure; that is deferring the problem until the heap runs out, which on a managed runtime means a long GC pause, then OOM, then a page.

If you must buffer (bursty arrivals with a smoothing window that makes sense), bound it explicitly, and decide ahead of time what happens at the bound. Block? Drop? Both are fine. Silently growing is not.

If you want this productionized

The above is intentionally minimal. Real codebases will want golang.org/x/time/rate for the token bucket. Same algorithm, plus per-key limits, Wait(ctx) that respects deadlines, and Reserve for “tell me how long until I would be allowed”. I just wanted to show what the patterns underneath the library look like, and why picking the wrong one quietly puts 10 seconds onto your tail.

The point is not the absolute numbers. It is that the same workload produces three completely different latency distributions depending on which knob you turn, and “I added a bigger buffer” is not on the list.

Segflow