Tuning a Go TCP server toward 1M idle connections on a laptop

I had been telling people for months that Go can “trivially” hold a million idle TCP connections. The runtime uses epoll, goroutines are cheap, what could go wrong. Then a colleague asked me to actually do it, and I realised I had never tried. So I sat down with my laptop, a fresh net.Listen, and a client that just wants to open a lot of sockets.

The first wall I hit was 1024 file descriptors. After that came five more walls in quick succession, none of them in user code. This post is the log of every wall I walked into and how to move it. Code, logs, and scripts are in the scratch directory.

The setup

An idle connection is one that has finished the three-way handshake and is just sitting there. In a vanilla Go server that means one parked goroutine in Conn.Read, one file descriptor, one entry in the kernel TCP table, and a pair of socket buffers. Each costs something, and one runs out first.

For numbers I trust: /proc/$PID/status for VmRSS, runtime.MemStats for StackInuse, ss -s for the live TCP table. My laptop has 12 logical CPUs, 16 GiB of RAM, and Go 1.16 pinned (the runtime details I lean on are version-specific).

Wall 1: 1024 file descriptors

The server is ~100 lines of net.Listen plus goroutine-per-conn. First run, default ulimit -n of 1024:

$ ulimit -n
1024
$ ./server -addr :9100 &
$ ./client -addr 127.0.0.1:9100 -n 5000 -c 64
...
done dialing: ok=1018 fail=3982 in 139ms

And the server log fills with:

accept: accept tcp4 0.0.0.0:9100: accept4: too many open files

EMFILE. 1018 is suspiciously close to 1024 minus the five fds the runtime holds (listener, stdin/out/err, the netpoller’s epoll fd). Bump ulimit -n to 1048576 and this wall moves. The harder caps are /proc/sys/fs/nr_open (the ceiling for setrlimit) and fs/file-max (system-wide); on this box both are already plenty.

Wall 2: ephemeral ports (the surprising one)

Raised ulimit, asking for 50 000 connections:

$ cat /proc/sys/net/ipv4/ip_local_port_range
44620   48715
$ ./client -addr 127.0.0.1:9300 -n 50000 -c 512
...
ok=4089 fail=30885 ...
  1. Not 50 000. The server is happy; the client is hitting the wall. ip_local_port_range is 44620-48715, exactly 4095 numbers, which is the pool the kernel picks from when connect(2) is called without a bound source port. Once those are in use against the same destination 4-tuple, connect returns EADDRNOTAVAIL. This is the same reason “stress from one box” benchmarks plateau around 28 000 on a stock kernel (default range 32768-60999).

Two fixes:

  1. Widen the range: sysctl -w net.ipv4.ip_local_port_range="1024 65535". Gets you ~64 000 ports against one destination, no more.
  2. Add source IPs. The 4-tuple is (srcIP, srcPort, dstIP, dstPort), so each extra source IP buys you a fresh port pool. On Linux the entire 127.0.0.0/8 block is loopback with no aliasing required, so 127.0.0.2, 127.0.0.3, … just work.

Binding each dialer to one of 127.0.0.2..127.0.0.31 (30 IPs):

$ ./client -addr 127.0.0.1:9300 -n 30000 -c 256 \
    -src 127.0.0.2,127.0.0.3,...,127.0.0.31
...
ok=23659 fail=0 elapsed=4m2s

23 000 conns, zero failures. The remaining gap is not ports, it is the listen backlog (somaxconn=4096) throttling the accept queue.

Wall 3: per-goroutine stacks

I expected this to be the biggest wall. It was the smallest.

Folklore says “8 KiB × 1M = 8 GiB”, which would not fit on this laptop alongside the socket buffers. I measured. Smallest possible test: park N goroutines on a channel, no sockets, read runtime.MemStats.

$ for N in 10000 50000 100000 500000 1000000; do ./goroutines -n $N; done
n=10000    StackInuse=20.2 MiB    Sys=71.3 MiB
n=50000    StackInuse=98.4 MiB    Sys=138.2 MiB
n=100000   StackInuse=196.1 MiB   Sys=272.0 MiB
n=500000   StackInuse=977.4 MiB   Sys=1336.2 MiB
n=1000000  StackInuse=1954.0 MiB  Sys=2601.5 MiB

Almost perfectly linear. Per goroutine: 2 KiB of stack. Not 8. That is the Go 1.4-and-later minimum: each goroutine starts with a 2 KiB stack and grows on demand. A goroutine that only ever blocks inside the netpoller never grows.

A million bare goroutines cost ~2 GiB of stack plus ~500 MiB of miscellaneous heap, for Sys of 2.6 GiB. Workable on a 16 GiB box. The 8 KiB myth predates the 2014 move to copy-and-grow stacks and refuses to die.

The same scaling, but with each goroutine owning a real TCP conn:

Server-side connsRSSSysPer-conn RSS
10 00038.8 MiB73.5 MiB3.9 KiB
23 00083.0 MiB144.4 MiB3.7 KiB

3.7-3.9 KiB of RSS per idle conn. About 2 KiB is goroutine stack; the rest is the fd slot, the net.TCPConn object, a netpoller entry, and the faulted-in slice of the kernel socket buffers. Extrapolating from the slope: 1M conns ≈ 3.8 GiB of server RSS, plus whatever the kernel charges for sk_buffs and TCP control blocks (those do not show up in process RSS but they count against MemTotal).

Memory scaling: goroutine stacks vs real TCP connections (log x-axis)

Wall 4: kernel socket buffers

Each TCP socket has send and receive buffers in kernel memory:

$ cat /proc/sys/net/ipv4/tcp_rmem
4096    131072  6291456
$ cat /proc/sys/net/ipv4/tcp_wmem
4096    16384   4194304

Min, default, max. The default is what a fresh socket gets before autotuning. 131 KiB rcv + 16 KiB snd = 147 KiB per socket. At 1M sockets that is 140 GiB. We do not have 140 GiB.

The kernel autotunes idle buffers downward and has a global ceiling in tcp_mem, so the worst case is theoretical, but as soon as you push a byte the buffers grow. The realistic move is to shrink them per socket before they grow:

if tc, ok := c.(*net.TCPConn); ok {
    _ = tc.SetReadBuffer(2048)
    _ = tc.SetWriteBuffer(2048)
}

That maps to setsockopt(SO_RCVBUF, ...). The kernel clamps to its tcp_rmem[0] minimum (4096, doubled internally for bookkeeping, so ~8 KiB real), but that is an order of magnitude smaller than the default. The server in the scratch repo does this unconditionally.

Walls 5 and 6: lurking just past where my laptop gave up

conntrack, the kernel connection-tracking table, gets loaded whenever iptables/nftables rules ask for it. Default nf_conntrack_max here is 262144. Past that, new conns silently drop with nf_conntrack: table full in dmesg. Fix: bump nf_conntrack_max and nf_conntrack_buckets, or skip conntrack if you have no firewall rules.

The listen backlog. net.Listen calls listen(fd, somaxconn) where somaxconn is read from /proc/sys/net/core/somaxconn (stock 4096, older distros 128). If accept() falls behind, the SYN queue fills, the kernel drops SYNs, the client retries with backoff. Bump somaxconn to 65535.

The checklist

Everything I touched, in one place:

KnobFile / commandDefault-ishSet to
Per-process fd soft limitulimit -n10241048576
Per-process fd hard limitulimit -Hn40961048576
System fd ceiling/proc/sys/fs/nr_open10485761048576
Ephemeral port rangenet.ipv4.ip_local_port_range32768 609991024 65535
Multiple source IPsclient-side1N (use 127.0.0.0/8 for tests)
Listen backlognet.core.somaxconn409665535
SYN backlognet.ipv4.tcp_max_syn_backlog102465535
TCP rcv buffertcp_rmem middle / SO_RCVBUF1310724096 (per-socket)
TCP snd buffertcp_wmem middle / SO_SNDBUF163844096 (per-socket)
Conntrack tablenet.netfilter.nf_conntrack_max2621442000000 (or disable)
TIME_WAIT reusenet.ipv4.tcp_tw_reuse21

Ten or so knobs. None of them are in Go.

How Go’s runtime helps you here

The reason any of this works is that Go’s netpoller uses edge-triggered epoll, not one OS thread per connection. When a goroutine calls Conn.Read with nothing to read, the runtime parks it, registers the fd with epoll, and returns the thread to the scheduler. The goroutine is not on a thread, not on a CPU, not in a syscall. It is a ~200-byte struct pointed at by an epoll watch.

The OS sees GOMAXPROCS threads, one epoll fd, and a million TCP fds, not a million threads. That alone is what makes 1M conns possible on a laptop. To go below 2 KiB per conn you would drop goroutine-per-conn and drive epoll yourself, the way libraries like gnet and evio do. For most servers, trading a much harder programming model to save 1.5 KiB per conn is not worth it. For a chat fan-out of millions of mostly-idle long polls, it might be.

The honest extrapolation

I did not literally hit 1M conns. The largest run I sustained was around 24 000 active loopback conns, capped by my accept-rate / dial-rate dance on a small somaxconn and by patience. What I did do, end to end and reproducibly:

  • Demonstrated the EMFILE wall at exactly ~1018 conns.
  • Demonstrated ephemeral-port exhaustion at exactly 4095 conns per source IP, then broke through with multi-IP binding.
  • Measured stack cost at 10k, 50k, 100k, 500k, and 1M parked goroutines and confirmed a linear 2 KiB-per-goroutine slope all the way to 1M goroutines using 1.95 GiB of stack.
  • Measured per-conn RSS at 10k and 23k real conns: 3.8 KiB, dead flat.

The 1.95 GiB stack at 1M goroutines is real. The 3.8 GiB total RSS at 1M conns is honest extrapolation from that slope. The most uncertain piece is kernel-side socket buffers: leave them at defaults and that one knob alone will OOM your laptop, not anything in Go.

When the conns are not idle

Idle is the easy case. The moment each conn ticks a heartbeat, even once a minute, the picture shifts: every wake-up is a goroutine going runnable, a trip through the scheduler, possibly a syscall. A 1M-conn server with a 1 Hz heartbeat is doing 1M timer events per second, well past what one box should be asked to do.

The realistic shape of that server is not “1M goroutines on time.NewTicker”. It is one ticker fanning out to N batches, or a single epoll loop with a sorted heartbeat heap. That is a different post. The point of this one: with ten boring knobs, the Go runtime is not the bottleneck. Linux is.

comments powered by Disqus