Let's pwn!
Zero-copy in Go: sendfile, splice, and the cost of io.Copy
22 January 2024
A small file-serving service of mine slowed to a crawl one afternoon after a
“harmless” middleware change. CPU on the server box doubled, throughput
roughly halved. The diff was a single line: instead of handing a *os.File
to io.Copy, somebody had wrapped it in a tiny logging reader to count
bytes.
That one wrap quietly turned off sendfile(2).
This post is about that fast path: what Go does for you for free, how to see it actually fire, and the surprisingly easy ways to lose it.
Read More…I had a hot loop I could not get any faster. Plain for i := 0; i < n; i++,
two slice reads, one add, return. On paper that is three or four x86
instructions per iteration. In practice it was running at about half the
throughput I expected, and perf kept pointing at the same two lines.
When I finally dumped the assembly, the answer was sitting right there: a
CMPQ followed by a conditional jump to runtime.panicIndex on every read.
The compiler was leaving the runtime bounds check in. The “work” of the
loop was 8 bytes of useful instructions, plus 6 bytes of “is this index
still safe”, every single iteration.
This is the fourth post in what was supposed to be a trilogy on the Go
compiler: after following an int-from-string lookup down into the
compiler itself, budgeting
inlining, and reading escape-analysis
output, it would be rude to skip the one
where rewriting two lines wins back 22% on the same hardware. Quartet it
is.
A pipeline has backpressure when the consumer can make the producer slow down, or refuse the work, or both. A bigger buffer does neither. It just defers the same problem, with more memory held hostage in the meantime. The two get confused a lot, and the difference shows up unmistakably in the tail latency numbers, so this post does the comparison directly.
When load exceeds capacity you have three sane options:
- Block the caller until a slot frees. Throughput pins to whatever the consumer sustains; tail latency degrades cleanly.
- Reject the caller so they retry, fall back, or shed. Latency stays low for accepted work; some work never happens.
- Rate-limit the caller so they never get close to overloading you. Predictable throughput, the rest is queued or dropped.
Go has a direct idiom for each: a bounded channel for option 1, the
golang.org/x/sync/semaphore package for option 1 with per-item
weighting, and a time.Ticker token bucket for option 3 (and
trivially option 2). They are not interchangeable. They produce three
very different latency curves under the same workload, which is what we
will see below.
What strace -c taught me about a fast CLI
5 September 2022
The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the repo, and it finished in well under a second. Then a coworker pointed it at a real monorepo, the kind with 30,000 files spread across a few thousand directories, and the thing crawled. Same code, same machine class, just more files. The user-visible work had not changed. The wall clock had.
This is the story of the half hour I spent figuring out why, what strace -c
showed me, and why I now reach for it before any profiler when something
“feels slow” on Linux.
My first instinct was wrong, by the way. I assumed disk. The repo was big, the laptop has an NVMe drive but it is not magic, and “more files” sounds like “more IO.” So I ran the program twice in a row, expecting the second run to be fast off the page cache. It was not. Both runs took roughly the same time. Whatever was slow, it was not waiting on the disk.
Read More…Escape analysis, demystified by 6 tiny examples
17 April 2022
A while back I had a hot loop that profiled like a heap-allocation
machine gun. pprof blamed runtime.mallocgc. The code looked
innocent. The fix turned out to be a one-line signature change, and
the compiler had been quietly screaming at me about it the whole
time via -gcflags=-m.
This post is the second half of a tour I started in My journey optimizing the Go Compiler and continued in What actually fits in Go’s inlining budget. Inlining decides where code runs; escape analysis decides where data lives. Six 10-line examples cover roughly 90% of what you’ll see in real Go.
Read More…