I had been telling people for months that Go can “trivially” hold a million
idle TCP connections. The runtime uses epoll, goroutines are cheap, what
could go wrong. Then a colleague asked me to actually do it, and I realised
I had never tried. So I sat down with my laptop, a fresh net.Listen, and
a client that just wants to open a lot of sockets.
The first wall I hit was 1024 file descriptors. After that came five more walls in quick succession, none of them in user code. This post is the log of every wall I walked into and how to move it. Code, logs, and scripts are in the scratch directory.
The setup
An idle connection is one that has finished the three-way handshake and
is just sitting there. In a vanilla Go server that means one parked
goroutine in Conn.Read, one file descriptor, one entry in the kernel
TCP table, and a pair of socket buffers. Each costs something, and one
runs out first.
For numbers I trust: /proc/$PID/status for VmRSS, runtime.MemStats
for StackInuse, ss -s for the live TCP table. My laptop has 12
logical CPUs, 16 GiB of RAM, and Go 1.16 pinned (the runtime details
I lean on are version-specific).
Wall 1: 1024 file descriptors
The server is ~100 lines of net.Listen plus goroutine-per-conn. First
run, default ulimit -n of 1024:
$ ulimit -n
1024
$ ./server -addr :9100 &
$ ./client -addr 127.0.0.1:9100 -n 5000 -c 64
...
done dialing: ok=1018 fail=3982 in 139ms
And the server log fills with:
accept: accept tcp4 0.0.0.0:9100: accept4: too many open files
EMFILE. 1018 is suspiciously close to 1024 minus the five fds the
runtime holds (listener, stdin/out/err, the netpoller’s epoll fd). Bump
ulimit -n to 1048576 and this wall moves. The harder caps are
/proc/sys/fs/nr_open (the ceiling for setrlimit) and fs/file-max
(system-wide); on this box both are already plenty.
Wall 2: ephemeral ports (the surprising one)
Raised ulimit, asking for 50 000 connections:
$ cat /proc/sys/net/ipv4/ip_local_port_range
44620 48715
$ ./client -addr 127.0.0.1:9300 -n 50000 -c 512
...
ok=4089 fail=30885 ...
- Not 50 000. The server is happy; the client is hitting the wall.
ip_local_port_rangeis 44620-48715, exactly 4095 numbers, which is the pool the kernel picks from whenconnect(2)is called without a bound source port. Once those are in use against the same destination 4-tuple,connectreturnsEADDRNOTAVAIL. This is the same reason “stress from one box” benchmarks plateau around 28 000 on a stock kernel (default range 32768-60999).
Two fixes:
- Widen the range:
sysctl -w net.ipv4.ip_local_port_range="1024 65535". Gets you ~64 000 ports against one destination, no more. - Add source IPs. The 4-tuple is
(srcIP, srcPort, dstIP, dstPort), so each extra source IP buys you a fresh port pool. On Linux the entire127.0.0.0/8block is loopback with no aliasing required, so127.0.0.2,127.0.0.3, … just work.
Binding each dialer to one of 127.0.0.2..127.0.0.31 (30 IPs):
$ ./client -addr 127.0.0.1:9300 -n 30000 -c 256 \
-src 127.0.0.2,127.0.0.3,...,127.0.0.31
...
ok=23659 fail=0 elapsed=4m2s
23 000 conns, zero failures. The remaining gap is not ports, it is the
listen backlog (somaxconn=4096) throttling the accept queue.
Wall 3: per-goroutine stacks
I expected this to be the biggest wall. It was the smallest.
Folklore says “8 KiB × 1M = 8 GiB”, which would not fit on this laptop
alongside the socket buffers. I measured. Smallest possible test: park N
goroutines on a channel, no sockets, read runtime.MemStats.
$ for N in 10000 50000 100000 500000 1000000; do ./goroutines -n $N; done
n=10000 StackInuse=20.2 MiB Sys=71.3 MiB
n=50000 StackInuse=98.4 MiB Sys=138.2 MiB
n=100000 StackInuse=196.1 MiB Sys=272.0 MiB
n=500000 StackInuse=977.4 MiB Sys=1336.2 MiB
n=1000000 StackInuse=1954.0 MiB Sys=2601.5 MiB
Almost perfectly linear. Per goroutine: 2 KiB of stack. Not 8. That is the Go 1.4-and-later minimum: each goroutine starts with a 2 KiB stack and grows on demand. A goroutine that only ever blocks inside the netpoller never grows.
A million bare goroutines cost ~2 GiB of stack plus ~500 MiB of
miscellaneous heap, for Sys of 2.6 GiB. Workable on a 16 GiB box. The
8 KiB myth predates the 2014 move to copy-and-grow stacks and refuses to
die.
The same scaling, but with each goroutine owning a real TCP conn:
| Server-side conns | RSS | Sys | Per-conn RSS |
|---|---|---|---|
| 10 000 | 38.8 MiB | 73.5 MiB | 3.9 KiB |
| 23 000 | 83.0 MiB | 144.4 MiB | 3.7 KiB |
3.7-3.9 KiB of RSS per idle conn. About 2 KiB is goroutine stack; the
rest is the fd slot, the net.TCPConn object, a netpoller entry, and the
faulted-in slice of the kernel socket buffers. Extrapolating from the
slope: 1M conns ≈ 3.8 GiB of server RSS, plus whatever the kernel
charges for sk_buffs and TCP control blocks (those do not show up in
process RSS but they count against MemTotal).
Wall 4: kernel socket buffers
Each TCP socket has send and receive buffers in kernel memory:
$ cat /proc/sys/net/ipv4/tcp_rmem
4096 131072 6291456
$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 4194304
Min, default, max. The default is what a fresh socket gets before autotuning. 131 KiB rcv + 16 KiB snd = 147 KiB per socket. At 1M sockets that is 140 GiB. We do not have 140 GiB.
The kernel autotunes idle buffers downward and has a global ceiling in
tcp_mem, so the worst case is theoretical, but as soon as you push a
byte the buffers grow. The realistic move is to shrink them per socket
before they grow:
if tc, ok := c.(*net.TCPConn); ok {
_ = tc.SetReadBuffer(2048)
_ = tc.SetWriteBuffer(2048)
}
That maps to setsockopt(SO_RCVBUF, ...). The kernel clamps to its
tcp_rmem[0] minimum (4096, doubled internally for bookkeeping, so ~8 KiB
real), but that is an order of magnitude smaller than the default. The
server in the scratch repo does this unconditionally.
Walls 5 and 6: lurking just past where my laptop gave up
conntrack, the kernel connection-tracking table, gets loaded whenever
iptables/nftables rules ask for it. Default nf_conntrack_max here
is 262144. Past that, new conns silently drop with nf_conntrack: table full in dmesg. Fix: bump nf_conntrack_max and nf_conntrack_buckets,
or skip conntrack if you have no firewall rules.
The listen backlog. net.Listen calls listen(fd, somaxconn) where
somaxconn is read from /proc/sys/net/core/somaxconn (stock 4096,
older distros 128). If accept() falls behind, the SYN queue fills, the
kernel drops SYNs, the client retries with backoff. Bump somaxconn to
65535.
The checklist
Everything I touched, in one place:
| Knob | File / command | Default-ish | Set to |
|---|---|---|---|
| Per-process fd soft limit | ulimit -n | 1024 | 1048576 |
| Per-process fd hard limit | ulimit -Hn | 4096 | 1048576 |
| System fd ceiling | /proc/sys/fs/nr_open | 1048576 | 1048576 |
| Ephemeral port range | net.ipv4.ip_local_port_range | 32768 60999 | 1024 65535 |
| Multiple source IPs | client-side | 1 | N (use 127.0.0.0/8 for tests) |
| Listen backlog | net.core.somaxconn | 4096 | 65535 |
| SYN backlog | net.ipv4.tcp_max_syn_backlog | 1024 | 65535 |
| TCP rcv buffer | tcp_rmem middle / SO_RCVBUF | 131072 | 4096 (per-socket) |
| TCP snd buffer | tcp_wmem middle / SO_SNDBUF | 16384 | 4096 (per-socket) |
| Conntrack table | net.netfilter.nf_conntrack_max | 262144 | 2000000 (or disable) |
| TIME_WAIT reuse | net.ipv4.tcp_tw_reuse | 2 | 1 |
Ten or so knobs. None of them are in Go.
How Go’s runtime helps you here
The reason any of this works is that Go’s netpoller uses
edge-triggered epoll, not one OS thread per connection. When a
goroutine calls Conn.Read with nothing to read, the runtime parks it,
registers the fd with epoll, and returns the thread to the scheduler.
The goroutine is not on a thread, not on a CPU, not in a syscall. It is
a ~200-byte struct pointed at by an epoll watch.
The OS sees GOMAXPROCS threads, one epoll fd, and a million TCP fds,
not a million threads. That alone is what makes 1M conns possible on a
laptop. To go below 2 KiB per conn you would drop goroutine-per-conn and
drive epoll yourself, the way libraries like gnet and evio do. For
most servers, trading a much harder programming model to save 1.5 KiB
per conn is not worth it. For a chat fan-out of millions of mostly-idle
long polls, it might be.
The honest extrapolation
I did not literally hit 1M conns. The largest run I sustained was around
24 000 active loopback conns, capped by my accept-rate / dial-rate dance
on a small somaxconn and by patience. What I did do, end to end and
reproducibly:
- Demonstrated the EMFILE wall at exactly ~1018 conns.
- Demonstrated ephemeral-port exhaustion at exactly 4095 conns per source IP, then broke through with multi-IP binding.
- Measured stack cost at 10k, 50k, 100k, 500k, and 1M parked goroutines and confirmed a linear 2 KiB-per-goroutine slope all the way to 1M goroutines using 1.95 GiB of stack.
- Measured per-conn RSS at 10k and 23k real conns: 3.8 KiB, dead flat.
The 1.95 GiB stack at 1M goroutines is real. The 3.8 GiB total RSS at 1M conns is honest extrapolation from that slope. The most uncertain piece is kernel-side socket buffers: leave them at defaults and that one knob alone will OOM your laptop, not anything in Go.
When the conns are not idle
Idle is the easy case. The moment each conn ticks a heartbeat, even once a minute, the picture shifts: every wake-up is a goroutine going runnable, a trip through the scheduler, possibly a syscall. A 1M-conn server with a 1 Hz heartbeat is doing 1M timer events per second, well past what one box should be asked to do.
The realistic shape of that server is not “1M goroutines on
time.NewTicker”. It is one ticker fanning out to N batches, or a single
epoll loop with a sorted heartbeat heap. That is a different post. The
point of this one: with ten boring knobs, the Go runtime is not the
bottleneck. Linux is.
