Home

Posts

Tricky parts of Golang. GOMAXPROCS, CPU Pinning, and the Scheduler You Can't See

March 12, 2025

#golang

You set GOMAXPROCS=8 on a 16-core machine because you read that matching CPU count improves throughput. Your CPU-bound service gets slower. The goroutine count on the dashboard rises steadily. Nothing in the logs explains it, and the Go documentation does not mention this case.

This article explains what GOMAXPROCS actually controls, how the Go scheduler interacts with the Linux CFS scheduler underneath it, and how containerised deployments make all of this subtly wrong by default.

What `GOMAXPROCS` Actually Controls

The Go runtime multiplexes goroutines onto OS threads using a work-stealing scheduler built around three concepts: G (goroutines), M (OS threads, “machines”), and P (logical processors).

┌─────────────────────────────────────────────┐
│                Go Scheduler                  │
│                                             │
│  G G G G ──► P ──► M ──► OS core            │
│  G G G G ──► P ──► M ──► OS core            │
│  G G G G ──► P ──► M ──► OS core            │
│                                             │
│  GOMAXPROCS controls the number of Ps       │
└─────────────────────────────────────────────┘

GOMAXPROCS sets the number of Ps — the number of goroutines that can execute Go code simultaneously. It does not set the number of OS threads (Ms), which can be much larger and is bounded separately by runtime/debug.SetMaxThreads.

A P holds a local run queue of goroutines. When a goroutine blocks on a syscall or channel operation, its P is handed to another M so it can keep running other goroutines. This is why Go can handle millions of goroutines with far fewer OS threads.

When More Ps Makes Things Worse

The problem is that a P is a resource the Linux kernel does not know about. The kernel schedules M (OS threads), not Ps. If you create 8 Ps but the kernel has only given your process time slices for 4 cores, those 8 Ps compete for 4 execution slots.

The result:

More context switches between OS threads than necessary
More cache invalidation, since threads migrate between cores
Higher scheduler overhead on the Go side (more work stealing, more failed steal attempts on empty queues)
Longer tail latencies as goroutines wait longer for a P to become runnable

This is most visible in CPU-bound workloads. For I/O-bound workloads, goroutines spend most of their time blocked, so extra Ps are less costly — they are idle most of the time anyway.

The Container Trap

The problem becomes acute in containerised environments, which is where most Go services run in 2025.

When you run a container with --cpus=2 (or the Kubernetes equivalent resources.limits.cpu: 2), the container sees all cores on the host via /proc/cpuinfo — say, 64 cores on a cloud VM. The Go runtime reads /proc/cpuinfo at startup to set its default GOMAXPROCS, and sets it to 64.

Your container is allowed 2 CPUs of time. The Go runtime has created 64 Ps to run on them.

Host: 64 cores
Container CPU quota: 2.0 (200% of one core)
Default GOMAXPROCS: 64  ← reads host cpuinfo

Result: 64 Ps fighting for ~2 cores worth of scheduler time

The overhead of 64 Ps on 2 effective cores is measurable. In latency-sensitive services, this can cause p99 latencies to double or triple versus running with GOMAXPROCS=2.

Diagnosing It

`runtime/trace`

The Go execution tracer shows you exactly what your Ps are doing:

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()

// ... your workload

go tool trace trace.out

Look at the Goroutine analysis view. If you see many Ps with long gaps between runnable periods, you have more Ps than effective cores. The Scheduler latency view shows how long goroutines wait before getting scheduled — high values here confirm the problem.

`runtime.NumCPU()` vs actual quota

You can log the mismatch at startup:

import (
    "log"
    "runtime"
)

func main() {
    log.Printf("GOMAXPROCS=%d, NumCPU=%d",
        runtime.GOMAXPROCS(0), runtime.NumCPU())
    // ...
}

If NumCPU vastly exceeds your actual CPU quota, you are in the trap.

The Fix: `automaxprocs`

Uber’s automaxprocs library reads the Linux CFS bandwidth quota from /sys/fs/cgroup and sets GOMAXPROCS to match. It is a one-line import:

import _ "go.uber.org/automaxprocs"

That import’s init() function runs at startup, reads the real CPU quota, and calls runtime.GOMAXPROCS with the correct value. No configuration, no code changes beyond the import.

If you are not in a container, it falls back to runtime.NumCPU() — so it is safe to add unconditionally.

go get go.uber.org/automaxprocs

For environments where you cannot add dependencies, set GOMAXPROCS explicitly in your deployment manifest:

# Kubernetes
env:
  - name: GOMAXPROCS
    valueFrom:
      resourceFieldRef:
        resource: limits.cpu

Note that Kubernetes rounds CPU limits to integers here, so limits.cpu: 1500m would set GOMAXPROCS=1. automaxprocs handles fractional CPUs correctly (it rounds up, so 1.5 cores → 2 Ps).

Go 1.25 and later: The runtime now reads container CPU limits from cgroups by default and sets GOMAXPROCS accordingly, with periodic updates if the limit changes. If your go.mod targets Go 1.25+, you may not need automaxprocs — but explicit tuning or the library is still useful when you want fractional-CPU rounding control or run on older toolchains.

When to Tune Beyond the Default

For most services, matching GOMAXPROCS to your effective CPU count is all you need. There are two cases where you might deliberately diverge:

Mixed CPU- and I/O-bound work. If your service does heavy I/O (database calls, HTTP requests) alongside CPU-bound processing, extra Ps can help — idle Ps waiting on I/O do not cost much, and they allow CPU-bound work to proceed without waiting. A modest overshoot of 1.5× effective cores is sometimes beneficial here.

Thread-local requirements. runtime.LockOSThread() binds a goroutine to a specific OS thread (M), not to a logical processor (P). Use it when you need thread-local state — certain C libraries, OpenGL, or other APIs that require a stable OS thread identity. It does not reserve a P, prevent other goroutines from running, or stop asynchronous preemption of the locked goroutine.

func threadLocalWork() {
    runtime.LockOSThread()
    defer runtime.UnlockOSThread()
    // this goroutine always runs on the same OS thread
}

Use this sparingly. A locked thread cannot be reused by the scheduler for other goroutines, so if the goroutine blocks, that OS thread sits idle rather than running other work.

The Mental Model That Prevents the Bug

GOMAXPROCS should equal the number of CPU cores the OS actually gives you — not the number of cores on the machine.

The Go scheduler is designed for the assumption that each P maps to a real core it can run on without preemption. When that assumption breaks — because the kernel is scheduling your threads onto fewer cores than you have Ps — the scheduler’s work-stealing and queue management become overhead rather than optimisation.

On Go versions before 1.25, containerised deployments should use automaxprocs or set GOMAXPROCS explicitly — the default reads the host core count, not the cgroup quota. Go 1.25+ fixes this by default, but verifying runtime.GOMAXPROCS(0) against your actual quota at startup is still worth doing.

Summary

Scenario	Default behaviour	Correct setting
Bare-metal, all cores available	`GOMAXPROCS = NumCPU` ✓	No change needed
Container with CPU limits (Go < 1.25)	`GOMAXPROCS = host NumCPU` ✗	Match CPU quota
Container with CPU limits (Go ≥ 1.25)	Matches cgroup quota by default ✓	Verify at startup
Kubernetes with fractional CPUs	Env var rounds down	Use `automaxprocs` or explicit value
Thread-local C/API requirements	Goroutine may migrate across threads	`LockOSThread()` when API requires it
I/O-heavy with some CPU work	Accurate match may under-utilise	Modest overshoot (1.5×)

The GOMAXPROCS trap is invisible because the mismatched setting does not error — it just makes your service slower in a way that looks like load, not configuration. Once you know to look at the gap between NumCPU and your actual CPU quota, it becomes one of the first things to check on any performance investigation.