Linux on Segflow

Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go

Fri, 29 May 2026 00:40:00 +0200

I had a 4 GiB file that’s almost entirely zeros, exactly one non-zero int64 is hiding at offset Size - 8 (the last aligned slot). The task: find that offset, as fast as possible, in Go on Linux.

It’s a deliberately silly problem. There’s no parsing, no indexing, no cleverness on the algorithm side. The only thing it measures is how much data we can pull through a CPU per second. Exactly the kind of micro-task that exposes every layer of the stack: the Go runtime, the standard library, the kernel, the page cache, the memory hierarchy, and SIMD, including Go 1.26’s brand-new simd/archsimd package that lets you write AVX-512 in pure Go.

Starting from the most obvious os.ReadFile + for range we get 0.75 GB/s. Thirteen variants later we’re at 49 GB/s, a 66× speedup, and we’ll know exactly which wall we hit and why.

Zero-copy in Go: sendfile, splice, and the cost of io.Copy

Mon, 22 Jan 2024 10:00:00 +0100

A small file-serving service of mine slowed to a crawl one afternoon after a “harmless” middleware change. CPU on the server box doubled, throughput roughly halved. The diff was a single line: instead of handing a *os.File to io.Copy, somebody had wrapped it in a tiny logging reader to count bytes.

That one wrap quietly turned off sendfile(2).

This post is about that fast path: what Go does for you for free, how to see it actually fire, and the surprisingly easy ways to lose it.

What strace -c taught me about a fast CLI

Mon, 05 Sep 2022 11:20:00 +0100

The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the repo, and it finished in well under a second. Then a coworker pointed it at a real monorepo, the kind with 30,000 files spread across a few thousand directories, and the thing crawled. Same code, same machine class, just more files. The user-visible work had not changed. The wall clock had.

This is the story of the half hour I spent figuring out why, what strace -c showed me, and why I now reach for it before any profiler when something “feels slow” on Linux.

My first instinct was wrong, by the way. I assumed disk. The repo was big, the laptop has an NVMe drive but it is not magic, and “more files” sounds like “more IO.” So I ran the program twice in a row, expecting the second run to be fast off the page cache. It was not. Both runs took roughly the same time. Whatever was slow, it was not waiting on the disk.

Tuning a Go TCP server toward 1M idle connections on a laptop

Mon, 22 Mar 2021 18:30:00 +0100

I had been telling people for months that Go can “trivially” hold a million idle TCP connections. The runtime uses epoll, goroutines are cheap, what could go wrong. Then a colleague asked me to actually do it, and I realised I had never tried. So I sat down with my laptop, a fresh net.Listen, and a client that just wants to open a lot of sockets.

The first wall I hit was 1024 file descriptors. After that came five more walls in quick succession, none of them in user code. This post is the log of every wall I walked into and how to move it. Code, logs, and scripts are in the scratch directory.