Let's pwn!

Zero-copy in Go: sendfile, splice, and the cost of io.Copy
22 January 2024

A small file-serving service of mine slowed to a crawl one afternoon after a “harmless” middleware change. CPU on the server box doubled, throughput roughly halved. The diff was a single line: instead of handing a *os.File to io.Copy, somebody had wrapped it in a tiny logging reader to count bytes.

That one wrap quietly turned off sendfile(2).

This post is about that fast path: what Go does for you for free, how to see it actually fire, and the surprisingly easy ways to lose it.

Bounds-check elimination in Go: making the prover happy
8 July 2023

I had a hot loop I could not get any faster. Plain for i := 0; i < n; i++, two slice reads, one add, return. On paper that is three or four x86 instructions per iteration. In practice it was running at about half the throughput I expected, and perf kept pointing at the same two lines.

When I finally dumped the assembly, the answer was sitting right there: a CMPQ followed by a conditional jump to runtime.panicIndex on every read. The compiler was leaving the runtime bounds check in. The “work” of the loop was 8 bytes of useful instructions, plus 6 bytes of “is this index still safe”, every single iteration.

This is the fourth post in what was supposed to be a trilogy on the Go compiler: after following an int-from-string lookup down into the compiler itself, budgeting inlining, and reading escape-analysis output, it would be rude to skip the one where rewriting two lines wins back 22% on the same hardware. Quartet it is.

Go compiler performance SSA BCE

Backpressure for the impatient: channels vs. semaphores vs. tokens
19 February 2023

A pipeline has backpressure when the consumer can make the producer slow down, or refuse the work, or both. A bigger buffer does neither. It just defers the same problem, with more memory held hostage in the meantime. The two get confused a lot, and the difference shows up unmistakably in the tail latency numbers, so this post does the comparison directly.

When load exceeds capacity you have three sane options:

Block the caller until a slot frees. Throughput pins to whatever the consumer sustains; tail latency degrades cleanly.
Reject the caller so they retry, fall back, or shed. Latency stays low for accepted work; some work never happens.
Rate-limit the caller so they never get close to overloading you. Predictable throughput, the rest is queued or dropped.

Go has a direct idiom for each: a bounded channel for option 1, the golang.org/x/sync/semaphore package for option 1 with per-item weighting, and a time.Ticker token bucket for option 3 (and trivially option 2). They are not interchangeable. They produce three very different latency curves under the same workload, which is what we will see below.

Go Concurrency Backpressure Performance

What strace -c taught me about a fast CLI
5 September 2022

The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the repo, and it finished in well under a second. Then a coworker pointed it at a real monorepo, the kind with 30,000 files spread across a few thousand directories, and the thing crawled. Same code, same machine class, just more files. The user-visible work had not changed. The wall clock had.

This is the story of the half hour I spent figuring out why, what strace -c showed me, and why I now reach for it before any profiler when something “feels slow” on Linux.

My first instinct was wrong, by the way. I assumed disk. The repo was big, the laptop has an NVMe drive but it is not magic, and “more files” sounds like “more IO.” So I ran the program twice in a row, expecting the second run to be fast off the page cache. It was not. Both runs took roughly the same time. Whatever was slow, it was not waiting on the disk.

Go Linux strace syscalls performance

Escape analysis, demystified by 6 tiny examples
17 April 2022

A while back I had a hot loop that profiled like a heap-allocation machine gun. pprof blamed runtime.mallocgc. The code looked innocent. The fix turned out to be a one-line signature change, and the compiler had been quietly screaming at me about it the whole time via -gcflags=-m.

This post is the second half of a tour I started in My journey optimizing the Go Compiler and continued in What actually fits in Go’s inlining budget. Inlining decides where code runs; escape analysis decides where data lives. Six 10-line examples cover roughly 90% of what you’ll see in real Go.

Go compiler escape analysis performance

Segflow

Zero-copy in Go: sendfile, splice, and the cost of io.Copy22 January 2024

Bounds-check elimination in Go: making the prover happy8 July 2023

Backpressure for the impatient: channels vs. semaphores vs. tokens19 February 2023

What strace -c taught me about a fast CLI5 September 2022

Escape analysis, demystified by 6 tiny examples17 April 2022

Zero-copy in Go: sendfile, splice, and the cost of io.Copy
22 January 2024

Bounds-check elimination in Go: making the prover happy
8 July 2023

Backpressure for the impatient: channels vs. semaphores vs. tokens
19 February 2023

What strace -c taught me about a fast CLI
5 September 2022

Escape analysis, demystified by 6 tiny examples
17 April 2022