Performance on Segflow

Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go

Fri, 29 May 2026 00:40:00 +0200

I had a 4 GiB file that’s almost entirely zeros, exactly one non-zero int64 is hiding at offset Size - 8 (the last aligned slot). The task: find that offset, as fast as possible, in Go on Linux.

It’s a deliberately silly problem. There’s no parsing, no indexing, no cleverness on the algorithm side. The only thing it measures is how much data we can pull through a CPU per second. Exactly the kind of micro-task that exposes every layer of the stack: the Go runtime, the standard library, the kernel, the page cache, the memory hierarchy, and SIMD, including Go 1.26’s brand-new simd/archsimd package that lets you write AVX-512 in pure Go.

Starting from the most obvious os.ReadFile + for range we get 0.75 GB/s. Thirteen variants later we’re at 49 GB/s, a 66× speedup, and we’ll know exactly which wall we hit and why.

Zero-copy in Go: sendfile, splice, and the cost of io.Copy

Mon, 22 Jan 2024 10:00:00 +0100

A small file-serving service of mine slowed to a crawl one afternoon after a “harmless” middleware change. CPU on the server box doubled, throughput roughly halved. The diff was a single line: instead of handing a *os.File to io.Copy, somebody had wrapped it in a tiny logging reader to count bytes.

That one wrap quietly turned off sendfile(2).

This post is about that fast path: what Go does for you for free, how to see it actually fire, and the surprisingly easy ways to lose it.

Bounds-check elimination in Go: making the prover happy

Sat, 08 Jul 2023 13:30:00 +0100

I had a hot loop I could not get any faster. Plain for i := 0; i < n; i++, two slice reads, one add, return. On paper that is three or four x86 instructions per iteration. In practice it was running at about half the throughput I expected, and perf kept pointing at the same two lines.

When I finally dumped the assembly, the answer was sitting right there: a CMPQ followed by a conditional jump to runtime.panicIndex on every read. The compiler was leaving the runtime bounds check in. The “work” of the loop was 8 bytes of useful instructions, plus 6 bytes of “is this index still safe”, every single iteration.

This is the fourth post in what was supposed to be a trilogy on the Go compiler: after following an int-from-string lookup down into the compiler itself, budgeting inlining, and reading escape-analysis output, it would be rude to skip the one where rewriting two lines wins back 22% on the same hardware. Quartet it is.

Backpressure for the impatient: channels vs. semaphores vs. tokens

Sun, 19 Feb 2023 16:00:00 +0100

A pipeline has backpressure when the consumer can make the producer slow down, or refuse the work, or both. A bigger buffer does neither. It just defers the same problem, with more memory held hostage in the meantime. The two get confused a lot, and the difference shows up unmistakably in the tail latency numbers, so this post does the comparison directly.

When load exceeds capacity you have three sane options:

Block the caller until a slot frees. Throughput pins to whatever the consumer sustains; tail latency degrades cleanly.
Reject the caller so they retry, fall back, or shed. Latency stays low for accepted work; some work never happens.
Rate-limit the caller so they never get close to overloading you. Predictable throughput, the rest is queued or dropped.

Go has a direct idiom for each: a bounded channel for option 1, the golang.org/x/sync/semaphore package for option 1 with per-item weighting, and a time.Ticker token bucket for option 3 (and trivially option 2). They are not interchangeable. They produce three very different latency curves under the same workload, which is what we will see below.

What strace -c taught me about a fast CLI

Mon, 05 Sep 2022 11:20:00 +0100

The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the repo, and it finished in well under a second. Then a coworker pointed it at a real monorepo, the kind with 30,000 files spread across a few thousand directories, and the thing crawled. Same code, same machine class, just more files. The user-visible work had not changed. The wall clock had.

This is the story of the half hour I spent figuring out why, what strace -c showed me, and why I now reach for it before any profiler when something “feels slow” on Linux.

My first instinct was wrong, by the way. I assumed disk. The repo was big, the laptop has an NVMe drive but it is not magic, and “more files” sounds like “more IO.” So I ran the program twice in a row, expecting the second run to be fast off the page cache. It was not. Both runs took roughly the same time. Whatever was slow, it was not waiting on the disk.

Escape analysis, demystified by 6 tiny examples

Sun, 17 Apr 2022 14:45:00 +0100

A while back I had a hot loop that profiled like a heap-allocation machine gun. pprof blamed runtime.mallocgc. The code looked innocent. The fix turned out to be a one-line signature change, and the compiler had been quietly screaming at me about it the whole time via -gcflags=-m.

This post is the second half of a tour I started in My journey optimizing the Go Compiler and continued in What actually fits in Go’s inlining budget. Inlining decides where code runs; escape analysis decides where data lives. Six 10-line examples cover roughly 90% of what you’ll see in real Go.

Inlining budgets, and why your one-liner stayed slow

Mon, 14 Sep 2020 10:00:00 +0100

After my map-lookup contribution to the Go compiler back in April, I kept poking the toolchain whenever a benchmark surprised me. Last week it surprised me again, and the lesson is short enough to fit in one post: Go inlines aggressively, but not infinitely, and the shape of your function matters more than its length.

The story starts with a three-line helper that I was sure the compiler would inline. pprof disagreed.