What strace -c taught me about a fast CLI

The CLI was fast. I had benchmarked it on my laptop, on a fresh clone of the repo, and it finished in well under a second. Then a coworker pointed it at a real monorepo, the kind with 30,000 files spread across a few thousand directories, and the thing crawled. Same code, same machine class, just more files. The user-visible work had not changed. The wall clock had.

This is the story of the half hour I spent figuring out why, what strace -c showed me, and why I now reach for it before any profiler when something “feels slow” on Linux.

My first instinct was wrong, by the way. I assumed disk. The repo was big, the laptop has an NVMe drive but it is not magic, and “more files” sounds like “more IO.” So I ran the program twice in a row, expecting the second run to be fast off the page cache. It was not. Both runs took roughly the same time. Whatever was slow, it was not waiting on the disk.

The suspect

The CLI is small. It walks a directory tree, counts non-empty lines in every text file, and prints the total. The Go reads like the kind of code you write once and never look at again:

err := filepath.Walk(root, func(p string, info os.FileInfo, err error) error {
    if err != nil || info.IsDir() {
        return nil
    }
    if !parentsOK(p) {
        return nil
    }
    n, err := countLines(p)
    if err == nil {
        total += n
    }
    return nil
})

parentsOK walks up from the file, calling os.Stat on every ancestor, because two years ago a bug report blamed a missing parent directory and somebody added a belt-and-braces check. Nobody ever took it out.

countLines opens the file and uses a bufio.Scanner with a 1 KiB scratch buffer, because that is what the example in the docs uses.

On a synthetic tree of 30,600 files in 1,861 directories, totalling 224 MiB, the CLI takes about 320 ms with a warm cache. That sounds fine. It is not fine, and strace -c is going to tell us why.

The 30-second tool

strace -c does not print every syscall. It runs your program to completion and prints a histogram of which syscalls were made, how often, and how much time the kernel spent in each. The crucial flag for Go is -f, “follow child threads,” because the Go runtime is multi-threaded and without -f you only see the main thread’s syscalls.

$ strace -cf ./slow /tmp/many >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 81.19  154.531378        6883     22448      1839 futex
 11.21   21.329266         189    112511           nanosleep
  2.92    5.565586          25    216061           newfstatat
  2.53    4.819692          24    194731           read
  0.56    1.071126          32     32462           openat
  0.47    0.901634         168      5364         5 epoll_pwait
  0.42    0.807935          24     32462           close
  0.37    0.697828          21     32462     32461 epoll_ctl
  ...
------ ----------- ----------- --------- --------- ------------------
100.00  190.331611         286    664447     34393 total

A few notes on reading this table before we panic at the numbers. The seconds column is heavily inflated by strace itself; ptrace is expensive and the wall time you see here is not the wall time of the unstraced program. The columns I actually trust are calls, errors, and the relative shape of the histogram. Always cross-check wall time with /usr/bin/time outside of strace.

With that caveat, here is what jumps out.

Look at the rightmost columns first. calls is how many times the kernel was entered, period. It does not lie, it is not warped by ptrace overhead, and it is the number you can reason about. The errors column is the count of calls that returned a non-zero error code; a huge value there usually means the program is using errors as control flow, which is its own kind of slow.

The smoking gun

For 30,600 files, the program made 216,061 newfstatat calls. That is roughly seven stat per file. Where do seven stats per file come from?

parentsOK("/tmp/many/grp00/sub00/file_00.txt")
  stat("/tmp/many/grp00/sub00/file_00.txt")
  stat("/tmp/many/grp00/sub00")
  stat("/tmp/many/grp00")
  stat("/tmp/many")
  stat("/tmp")
  stat("/")

Six stats from parentsOK, plus one lstat from filepath.Walk as it descends into each entry. Six plus one is seven. The math closes on the dot.

The second worst row is read: 194,731 calls, which is about 6.3 reads per file for files that average 6 KiB. The bufio.Scanner is doing its job, but its scratch buffer is so small that even modest files come back in five or six chunks. Each chunk is a separate syscall, each syscall is a separate trip into the kernel and back, and for tiny files that round trip is the dominant cost.

openat and close come in at 32,462 each. That is one open and one close per file, plus a couple of thousand for directories. There is nothing to fix there. You cannot read a file without opening it. But the ratio is telling: read is 6× the file count, stat is 7× the file count. Every extra syscall is overhead the program does not need.

Also worth noting: futex is at the top with 81% of the (inflated) seconds. That is the Go runtime’s scheduler talking to itself across goroutines. On a ptrace-slowed run it always looks dramatic, because every context switch becomes a futex storm of its own. Without strace, the runtime barely shows up. This is one of those places where you have to know what your language emits as syscalls before you can read its histogram.

Five syscalls you do not want to see in the top three

Before the fix, a short rule of thumb that has held up for me:

stat, lstat, newfstatat: usually means you are checking the same paths over and over. Cache it, or trust the DirEntry you already have.
read with a huge call count and a small usecs/call: your buffer is too small or you are reading one byte at a time. Bump the buffer.
openat / close pairs hugely exceeding the file count: you are opening the same file more than once. Open once, pass the fd.
getdents64 running into the thousands for a flat directory: you are listing the same directory repeatedly. Cache the entries.
futex dominating wall time with hundreds of usecs/call: lock contention. Profile with pprof, not strace.

The fix

Two changes. The first deletes parentsOK entirely. filepath.Walk already gives me an os.FileInfo per entry and WalkDir does even better by handing me a fs.DirEntry that was populated from getdents64 without a stat. If the user mis-typed the root, I will find out at os.Open time.

The second swaps the bufio.Scanner with a tiny buffer for a bufio.Reader sized at 64 KiB. Almost every text file in this tree fits in a single read.

err := filepath.WalkDir(root, func(p string, d fs.DirEntry, err error) error {
    if err != nil || d.IsDir() {
        return nil
    }
    n, err := countLines(p) // now uses bufio.NewReaderSize(f, 64*1024)
    if err == nil {
        total += n
    }
    return nil
})

Same input, same output (2,447,945 non-empty lines either way). New strace -c summary:

$ strace -cf ./fast /tmp/many >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 78.04   43.335523        2863     15135      1625 futex
 13.32    7.394392         247     29903           nanosleep
  2.60    1.442270          23     61203           read
  1.81    1.004006          30     32462           openat
  1.42    0.786105          24     32462           close
  1.17    0.651540          20     32462     32461 epoll_ctl
  ...
  0.00    0.000102         102         1           newfstatat
------ ----------- ----------- --------- --------- ------------------
100.00   55.532420         255    217161     34134 total

Top syscalls before vs after (log scale)

newfstatat went from 216,061 to 1. That single stat is Go’s runtime checking the root directory I passed on the command line. The whole rest of the walk runs off getdents64 and the d_type byte. read dropped from 194,731 to 61,203, which is closer to two reads per file: one to grab the contents, one to hit EOF and stop. Total syscalls dropped from 664,447 to 217,161, a 3× reduction. Wall time outside strace dropped from a steady 320 ms to a steady 270 ms.

The wall-time win is real but modest, and that is worth saying plainly. With a warm page cache, a syscall is cheap; the kernel’s stat fast path is a handful of hundred nanoseconds. The reason this matters anyway is two fold. First, syscall count scales with input size. On a tree ten times bigger the slow version would have made 2.16 million stats and the fast version still just one; the gap would not have stayed at fifty milliseconds. Second, on a cold cache, every stat that misses the dentry cache turns into a directory read on disk, and that is where this style of code falls off a cliff. I cannot drop the page cache on this machine to demo it, but I have been bitten by it on production CI runners more than once.

What `strace -c` is and is not

strace -c is not a profiler. It is a histogram. It will not tell you that 80% of your CPU is in a regex compiler or that a map lookup is going quadratic. For that, go test -cpuprofile and pprof are still the right tools.

What strace -c is good at is one specific question: is my program spending its life in syscalls it does not need? That is a question CPU profilers answer badly, because most of the time hides inside the kernel and barely registers in a user-space profile. A 30-second strace -c run gives you a syscall histogram, and once you see the histogram, the obvious calls jump out: the stat storm, the tiny read, the redundant openat.

Two warnings before you go.

First, strace uses ptrace. Ptrace adds a context switch on every syscall, which is why my wall times inside strace looked like 190 seconds for a program that runs in 0.3 seconds. Never quote strace wall times as if they were real. Use it for the call counts and ratios, then measure wall time separately.

Second, on a production box, strace is the wrong tool. It pauses every syscall and can wreck latency. For production triage, reach for perf trace --summary, which uses tracepoints and is far cheaper. But on my laptop, on a Friday afternoon, with a CLI that “feels slow,” strace -cf is still the first thing I run.

The fix to my slow CLI took ten minutes once I had the histogram. Finding the histogram took thirty seconds. That ratio is why I keep coming back to strace -c.

Segflow