Bounds-check elimination in Go: making the prover happy

I had a hot loop I could not get any faster. Plain for i := 0; i < n; i++, two slice reads, one add, return. On paper that is three or four x86 instructions per iteration. In practice it was running at about half the throughput I expected, and perf kept pointing at the same two lines.

When I finally dumped the assembly, the answer was sitting right there: a CMPQ followed by a conditional jump to runtime.panicIndex on every read. The compiler was leaving the runtime bounds check in. The “work” of the loop was 8 bytes of useful instructions, plus 6 bytes of “is this index still safe”, every single iteration.

This is the fourth post in what was supposed to be a trilogy on the Go compiler: after following an int-from-string lookup down into the compiler itself, budgeting inlining, and reading escape-analysis output, it would be rude to skip the one where rewriting two lines wins back 22% on the same hardware. Quartet it is.

All numbers below are real, on go1.21.13 linux/amd64, AMD Ryzen 5 9600X, benchstat on 15 runs at -cpu=1. The scratch project is a handful of files long.

The flag that tells you what is going on

The bounds-check elimination pass (BCE) lives in the SSA backend. It has a debug flag that prints every IsInBounds and IsSliceInBounds the prover could not discharge:

go build -gcflags="-d=ssa/check_bce/debug=1" ./...

Each line of output is one runtime bounds check that survived into the generated code. Zero lines means every index in the file was proved safe at compile time. The flag is older than module support and it is still the fastest way to find loops worth tightening.

The second tool is just -gcflags="-S". Read the disassembly. If you see a CALL runtime.panicIndex in the trailer of your hot function, the prover gave up somewhere.

Patterns the prover already handles

Before showing off the wins, the honest part. The prover on 1.21 is good enough that a lot of the “tricks” you find on older blog posts no longer buy anything. These three loops all generate the same hot path with zero bounds checks:

// 1
for i := 0; i < len(s); i++ { sum = sum*31 + uint32(s[i]) }

// 2
for _, b := range s { sum = sum*31 + uint32(b) }

// 3
n := len(s)
_ = s[n-1]            // the "classic" hoist hint
for i := 0; i < n; i++ { sum = sum*31 + uint32(s[i]) }

The first one already gets BCE because i is bounded by len(s) in the exact form the prover understands. The third one used to be the canonical recipe; on 1.21 it is just clutter. If a hot loop matches one of these shapes, leave it alone.

The patterns the prover still misses fall into two buckets I keep hitting in real code.

Pattern 1: look-ahead, s[i+1]

The textbook one. We are computing a sum of adjacent pairs, which shows up in delta encoders, simple stencils, and any “compare with neighbour” loop:

func SumNextNaive(s []uint32) uint32 {
    var sum uint32
    for i := 0; i < len(s)-1; i++ {
        sum += s[i] + s[i+1]
    }
    return sum
}

s[i] is fine. s[i+1] is not. The loop guard says i < len(s)-1, but the prover does not propagate that into a “i+1 < len(s)” fact for the second read. So we get this in the body:

ADDL   4(AX)(CX*4), DI      ; s[i+1], speculatively
ADDL   DI, DX                ; sum += ...
MOVQ   SI, CX                ; i = i+1
LEAQ   -1(BX), SI            ; len(s)-1
CMPQ   CX, SI
JGE    out
LEAQ   1(CX), SI             ; i+1
MOVL   (AX)(CX*4), DI        ; s[i]
CMPQ   BX, SI
JHI    body                  ; bounds check for s[i+1]
JMP    panic
...
CALL   runtime.panicIndex(SB)

Two compares per iteration, plus a tail that calls into the runtime. ssa/check_bce/debug=1 flags exactly one IsInBounds here, on the s[i+1] access. The prover is honest about what it could not prove.

A 3-second CPU profile of just this benchmark looks like a single tall stack:

CPU flamegraph of SumNextNaive showing bounds-check overhead

pprof -top reports 99.15% of samples inside bce.SumNextNaive and essentially nothing anywhere else, no runtime.panicIndex frame, no scheduler. That is exactly what you would expect: the extra compare and the conditional jump to the panic trailer are inlined into the function itself, they only show up as wider per-iteration cost on the same frame. The point of the flamegraph is the shape: one function, one block, all the time. There is no kernel or runtime to blame, the cycles really are being burned on bounds-checking arithmetic.

The fix is to re-slice the input once, up front, into the two views the loop actually wants:

func SumNextHint(s []uint32) uint32 {
    if len(s) < 2 {
        return 0
    }
    n := len(s) - 1
    head := s[:n:n]
    tail := s[1 : n+1 : n+1]
    var sum uint32
    for i := 0; i < n; i++ {
        sum += head[i] + tail[i]
    }
    return sum
}

The 3-index slice expressions (s[:n:n] and s[1:n+1:n+1]) are doing the real work. They tell the compiler “this is a slice whose length is exactly n”, and the slice operation itself is checked once on entry. Inside the loop, head[i] and tail[i] with i < n are both proved safe in one shot. The debug flag reports a single IsSliceInBounds outside the loop and zero inside it.

The body collapses to five instructions:

MOVL   (AX)(DX*4), SI       ; head[i]
ADDL   4(AX)(DX*4), SI      ; + tail[i]   (same backing array, +4 byte offset)
INCQ   DX
ADDL   SI, BX               ; sum += ...
CMPQ   DX, CX
JLT    body

No panic trailer, no second compare. benchstat, 15 runs each:

name        old time/op    new time/op    delta
SumNext     1.04µs ± 8%    0.85µs ± 3%   -18.13%  (p=0.000 n=15+14)

name        old speed      new speed      delta
SumNext     15.8GB/s ± 7%  19.2GB/s ± 3%  +22.07%  (p=0.000 n=15+14)

Reading 4 KiB of uint32s goes from 15.8 GB/s to 19.2 GB/s. The CPU was not the bottleneck; the redundant branch was.

Pattern 2: length passed as an argument

This one bit me in real code. Imagine a helper that some caller has already decided to walk only n of the elements of s:

func SumExternalNaive(s []uint32, n int) uint32 {
    var sum uint32
    for i := 0; i < n; i++ {
        sum += s[i]
    }
    return sum
}

It looks fine. The caller is well behaved. But the prover sees n as a black box int parameter, unrelated to len(s). So every s[i] is checked, and the loop carries the panic tail:

MOVL   (AX)(CX*4), SI
INCQ   CX
ADDL   SI, DX
CMPQ   DI, CX
JLE    out
CMPQ   BX, CX               ; the actual bounds check
JHI    body
JMP    panic

The fix is one line:

func SumExternalHint(s []uint32, n int) uint32 {
    s = s[:n]
    var sum uint32
    for i := 0; i < n; i++ {
        sum += s[i]
    }
    return sum
}

The re-slice teaches the prover “from now on, len(s) == n”. Indexing s[i] with i < n is now obvious. The check moves from inside the loop to a single panicSliceAcap on entry, executed once per call. The hot body drops to four instructions plus the back-edge compare. Numbers:

name        old time/op    new time/op    delta
SumExternal  883ns ± 9%     803ns ± 3%    -9.12%  (p=0.000 n=15+13)

Ten percent on a loop that was already as tight as I knew how to make it. Side by side, both patterns:

Throughput: naive vs BCE-hinted (higher is better)

Things the prover still does not get

While I had the SSA debug flag turned on I went looking for what else it flags. Two patterns that come up over and over:

  • Indexing two slices of the same length without re-slicing. if len(a) != len(b) { ... } is not enough; the prover does not carry that invariant into the body. Re-slice b = b[:len(a):len(a)] and the check on b[i] disappears.
  • Index arithmetic with a constant offset other than +1. s[i-1], s[i+2], s[i*4] all trip it. The same re-slice trick applies, you just slice out s[1:] or s[2:] or whatever the offset is.

Both of these are mechanical rewrites. The cost is two or three extra lines of slice juggling at the top of the function; the win is a tighter inner loop and, often, the compiler suddenly deciding the loop is small enough to inline somewhere else (which is the connection back to the inlining-budget post, where every byte of generated code matters).

The honest caveat

Two things to remember.

First: I have caught past versions of myself “optimising” away a bounds check that the prover had already eliminated. When I bisect across Go releases, hand-tuned hot loops occasionally regress because the prover got smarter and my workaround now looks like dead code that the compiler cannot quite eliminate. Always re-run -d=ssa/check_bce/debug=1 after a Go upgrade. If the only IsInBounds lines are in cold init code, your hint is now noise.

Second: bounds-check elimination is a peephole optimisation in the truest sense. It moves a compare out of the inner loop. It does not vectorise your code, it does not change cache behaviour, it does not let the compiler reorder anything fundamental. The wins are 10 to 25 percent on loops that were already CPU-bound and tiny, and zero on loops that were bound by memory or doing real work per iteration. The biggest practical benefit, in my experience, is that the loop body becomes small enough to inline, which then unlocks the gains from the escape-analysis and inlining work.

So the recipe is: -d=ssa/check_bce/debug=1 first, then -S to confirm, then benchstat. If the prover is already happy, do nothing and go work on something that matters. If it is not, a 3-index slice expression is almost always the smallest change that fixes it.

comments powered by Disqus