<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Performance on Segflow</title><link>https://segflow.github.io/tags/performance/</link><description>Recent content in Performance on Segflow</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 29 May 2026 00:40:00 +0200</lastBuildDate><atom:link href="https://segflow.github.io/tags/performance/index.xml" rel="self" type="application/rss+xml"/><item><title>Finding a needle in a 4 GB haystack: from 0.75 GB/s to 49 GB/s in Go</title><link>https://segflow.github.io/post/fast-file-search-go/</link><pubDate>Fri, 29 May 2026 00:40:00 +0200</pubDate><guid>https://segflow.github.io/post/fast-file-search-go/</guid><description>&lt;p&gt;I had a 4 GiB file that&amp;rsquo;s almost entirely zeros, exactly &lt;strong&gt;one&lt;/strong&gt; non-zero
&lt;code&gt;int64&lt;/code&gt; is hiding at offset &lt;code&gt;Size - 8&lt;/code&gt; (the last aligned slot). The task: find
that offset, as fast as possible, in Go on Linux.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s a deliberately silly problem. There&amp;rsquo;s no parsing, no indexing, no
cleverness on the algorithm side. The only thing it measures is how much data
we can pull through a CPU per second. Exactly the kind of micro-task that
exposes every layer of the stack: the Go runtime, the standard library, the
kernel, the page cache, the memory hierarchy, and SIMD, including Go 1.26&amp;rsquo;s
brand-new &lt;code&gt;simd/archsimd&lt;/code&gt; package that lets you write AVX-512 in pure Go.&lt;/p&gt;
&lt;p&gt;Starting from the most obvious &lt;code&gt;os.ReadFile&lt;/code&gt; + &lt;code&gt;for range&lt;/code&gt; we get &lt;strong&gt;0.75
GB/s&lt;/strong&gt;. Thirteen variants later we&amp;rsquo;re at &lt;strong&gt;49 GB/s&lt;/strong&gt;, a 66× speedup, and
we&amp;rsquo;ll know exactly which wall we hit and why.&lt;/p&gt;</description></item><item><title>Zero-copy in Go: sendfile, splice, and the cost of io.Copy</title><link>https://segflow.github.io/post/zero-copy-sendfile-splice/</link><pubDate>Mon, 22 Jan 2024 10:00:00 +0100</pubDate><guid>https://segflow.github.io/post/zero-copy-sendfile-splice/</guid><description>&lt;p&gt;A small file-serving service of mine slowed to a crawl one afternoon after a
&amp;ldquo;harmless&amp;rdquo; middleware change. CPU on the server box doubled, throughput
roughly halved. The diff was a single line: instead of handing a &lt;code&gt;*os.File&lt;/code&gt;
to &lt;code&gt;io.Copy&lt;/code&gt;, somebody had wrapped it in a tiny logging reader to count
bytes.&lt;/p&gt;
&lt;p&gt;That one wrap quietly turned off &lt;code&gt;sendfile(2)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This post is about that fast path: what Go does for you for free, how to see
it actually fire, and the surprisingly easy ways to lose it.&lt;/p&gt;</description></item><item><title>Bounds-check elimination in Go: making the prover happy</title><link>https://segflow.github.io/post/bounds-check-elimination/</link><pubDate>Sat, 08 Jul 2023 13:30:00 +0100</pubDate><guid>https://segflow.github.io/post/bounds-check-elimination/</guid><description>&lt;p&gt;I had a hot loop I could not get any faster. Plain &lt;code&gt;for i := 0; i &amp;lt; n; i++&lt;/code&gt;,
two slice reads, one add, return. On paper that is three or four x86
instructions per iteration. In practice it was running at about half the
throughput I expected, and &lt;code&gt;perf&lt;/code&gt; kept pointing at the same two lines.&lt;/p&gt;
&lt;p&gt;When I finally dumped the assembly, the answer was sitting right there: a
&lt;code&gt;CMPQ&lt;/code&gt; followed by a conditional jump to &lt;code&gt;runtime.panicIndex&lt;/code&gt; on every read.
The compiler was leaving the runtime bounds check in. The &amp;ldquo;work&amp;rdquo; of the
loop was 8 bytes of useful instructions, plus 6 bytes of &amp;ldquo;is this index
still safe&amp;rdquo;, every single iteration.&lt;/p&gt;
&lt;p&gt;This is the fourth post in what was supposed to be a trilogy on the Go
compiler: after &lt;a href="../../post/go-compiler-optimization/"&gt;following an &lt;code&gt;int&lt;/code&gt;-from-&lt;code&gt;string&lt;/code&gt; lookup down into the
compiler itself&lt;/a&gt;, &lt;a href="../../post/inlining-budgets/"&gt;budgeting
inlining&lt;/a&gt;, and &lt;a href="../../post/escape-analysis-examples/"&gt;reading escape-analysis
output&lt;/a&gt;, it would be rude to skip the one
where rewriting two lines wins back 22% on the same hardware. Quartet it
is.&lt;/p&gt;</description></item><item><title>Backpressure for the impatient: channels vs. semaphores vs. tokens</title><link>https://segflow.github.io/post/backpressure-patterns/</link><pubDate>Sun, 19 Feb 2023 16:00:00 +0100</pubDate><guid>https://segflow.github.io/post/backpressure-patterns/</guid><description>&lt;p&gt;A pipeline has backpressure when the consumer can make the producer
slow down, or refuse the work, or both. A bigger buffer does neither.
It just defers the same problem, with more memory held hostage in the
meantime. The two get confused a lot, and the difference shows up
unmistakably in the tail latency numbers, so this post does the
comparison directly.&lt;/p&gt;
&lt;p&gt;When load exceeds capacity you have three sane options:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Block the caller&lt;/strong&gt; until a slot frees. Throughput pins to whatever
the consumer sustains; tail latency degrades cleanly.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reject the caller&lt;/strong&gt; so they retry, fall back, or shed. Latency
stays low for accepted work; some work never happens.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rate-limit the caller&lt;/strong&gt; so they never get close to overloading you.
Predictable throughput, the rest is queued or dropped.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Go has a direct idiom for each: a &lt;strong&gt;bounded channel&lt;/strong&gt; for option 1, the
&lt;strong&gt;&lt;code&gt;golang.org/x/sync/semaphore&lt;/code&gt;&lt;/strong&gt; package for option 1 with per-item
weighting, and a &lt;strong&gt;&lt;code&gt;time.Ticker&lt;/code&gt; token bucket&lt;/strong&gt; for option 3 (and
trivially option 2). They are not interchangeable. They produce three
very different latency curves under the same workload, which is what we
will see below.&lt;/p&gt;</description></item><item><title>What strace -c taught me about a fast CLI</title><link>https://segflow.github.io/post/strace-c-fast-cli/</link><pubDate>Mon, 05 Sep 2022 11:20:00 +0100</pubDate><guid>https://segflow.github.io/post/strace-c-fast-cli/</guid><description>&lt;p&gt;The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the
repo, and it finished in well under a second. Then a coworker pointed it at a
real monorepo, the kind with 30,000 files spread across a few thousand
directories, and the thing crawled. Same code, same machine class, just more
files. The user-visible work had not changed. The wall clock had.&lt;/p&gt;
&lt;p&gt;This is the story of the half hour I spent figuring out why, what &lt;code&gt;strace -c&lt;/code&gt;
showed me, and why I now reach for it before any profiler when something
&amp;ldquo;feels slow&amp;rdquo; on Linux.&lt;/p&gt;
&lt;p&gt;My first instinct was wrong, by the way. I assumed disk. The repo was big,
the laptop has an NVMe drive but it is not magic, and &amp;ldquo;more files&amp;rdquo; sounds
like &amp;ldquo;more IO.&amp;rdquo; So I ran the program twice in a row, expecting the second
run to be fast off the page cache. It was not. Both runs took roughly the
same time. Whatever was slow, it was not waiting on the disk.&lt;/p&gt;</description></item><item><title>Escape analysis, demystified by 6 tiny examples</title><link>https://segflow.github.io/post/escape-analysis-examples/</link><pubDate>Sun, 17 Apr 2022 14:45:00 +0100</pubDate><guid>https://segflow.github.io/post/escape-analysis-examples/</guid><description>&lt;p&gt;A while back I had a hot loop that profiled like a heap-allocation
machine gun. &lt;code&gt;pprof&lt;/code&gt; blamed &lt;code&gt;runtime.mallocgc&lt;/code&gt;. The code looked
innocent. The fix turned out to be a one-line signature change, and
the compiler had been quietly screaming at me about it the whole
time via &lt;code&gt;-gcflags=-m&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;This post is the second half of a tour I started in &lt;a href="../../post/go-compiler-optimization/"&gt;My journey
optimizing the Go Compiler&lt;/a&gt; and
continued in &lt;a href="../../post/inlining-budgets/"&gt;What actually fits in Go&amp;rsquo;s inlining
budget&lt;/a&gt;. Inlining decides &lt;em&gt;where&lt;/em&gt; code runs;
escape analysis decides &lt;em&gt;where data lives&lt;/em&gt;. Six 10-line examples
cover roughly 90% of what you&amp;rsquo;ll see in real Go.&lt;/p&gt;</description></item><item><title>Inlining budgets, and why your one-liner stayed slow</title><link>https://segflow.github.io/post/inlining-budgets/</link><pubDate>Mon, 14 Sep 2020 10:00:00 +0100</pubDate><guid>https://segflow.github.io/post/inlining-budgets/</guid><description>&lt;p&gt;After &lt;a href="../../post/go-compiler-optimization/"&gt;my map-lookup contribution to the Go compiler&lt;/a&gt;
back in April, I kept poking the toolchain whenever a benchmark surprised me.
Last week it surprised me again, and the lesson is short enough to fit in one
post: Go inlines aggressively, but not infinitely, and the &lt;em&gt;shape&lt;/em&gt; of your
function matters more than its length.&lt;/p&gt;
&lt;p&gt;The story starts with a three-line helper that I was sure the compiler would
inline. &lt;code&gt;pprof&lt;/code&gt; disagreed.&lt;/p&gt;</description></item></channel></rss>