Self-Hosted Runners Are a Maintenance Trap

Hayssem Vazquez-Elsayedproduct

The costs that don't show up in the spreadsheet
The breakeven math at three team sizes
The failure mode nobody talks about
What managed runners handle for you
When self-hosted is genuinely the right call
The real comparison: your team's time vs. a vendor's bill

The pitch for self-hosted runners is simple: you control the hardware, you skip GitHub's per-minute charges, and you get to customize everything. What the pitch leaves out is everything that happens after day one.

I've watched teams spin up self-hosted runner fleets with genuine enthusiasm, only to find themselves debugging runner registration failures at 2 AM six months later. The compute is free. The maintenance is not. And for most engineering organizations above roughly 20 people, the total cost of ownership tips decisively toward managed runners.

The costs that don't show up in the spreadsheet

GitHub's documentation is upfront about what self-hosted runners give you: control over hardware, OS, and software. It's also upfront about what they don't give you: "you are responsible for updating the operating system and all other software." That single sentence hides a surprising amount of work.

Fleet management. Someone has to provision new runners when load increases, decommission stale ones, and monitor whether they're actually accepting jobs. GitHub's Actions Runner Controller (ARC) on Kubernetes helps with orchestration, but ARC itself is another system to maintain, upgrade, and debug. You're trading one maintenance surface for another.

Image freshness. GitHub-hosted runners ship with updated images weekly. When you self-host, you own that update cycle. That means tracking which versions of Node, Python, Docker, and every other build dependency are installed, testing image changes against your entire workflow matrix, and rolling them out without breaking in-flight jobs. Miss a cycle and your runners drift from what developers expect. Miss enough cycles and you're debugging failures caused by stale OpenSSL versions instead of shipping features.

Security patching. Every self-hosted runner is an attack surface. It has network access, it pulls code from repositories, and it executes arbitrary commands defined in workflow files. A CVE in the runner's OS kernel or in a pre-installed tool doesn't patch itself. Your security team now has another fleet of machines to include in their vulnerability management program.

Capacity planning. Monday morning at 9 AM, your entire engineering org pushes code. Friday at 6 PM, the fleet sits idle. Self-hosted runners don't scale to zero unless you build or buy an autoscaler. You're either over-provisioning (paying for idle machines) or under-provisioning (queuing jobs and blocking developers).

On-call incident response. When a runner goes dark mid-CI, someone gets paged. Not a vendor's SRE team. Your team. The failure mode is particularly nasty because it often presents as a stuck workflow rather than an explicit error, so developers wait 10 or 15 minutes before realizing the job isn't slow, it's dead.

The breakeven math at three team sizes

Let's put rough numbers on this. The compute cost of a self-hosted runner is straightforward: an AWS m5.xlarge (4 vCPU, 16 GB RAM) runs about $140/month on-demand, or around $85/month with a one-year reserved instance. That's the number people compare against GitHub's $0.008/minute for a 2-core Linux runner.

But compute is maybe half the real cost. The rest is people.

10 engineers. At this size, CI usage is low enough that GitHub's included minutes (3,000 on Team, 50,000 on Enterprise Cloud) often cover the bill. A self-hosted fleet of two runners might save $50-100/month in compute but costs 5-10 hours/month in maintenance from an engineer whose time is worth $80-150/hour. The math doesn't work.

50 engineers. Now you're burning through included minutes faster and running 8-12 concurrent runners. The compute savings start to look real: maybe $1,000-2,000/month versus GitHub-hosted. But the maintenance scales too. You need dedicated Terraform or Ansible configs, a CI for your CI (building and testing runner images), monitoring dashboards, and someone who can troubleshoot ARC pod scheduling issues. Conservatively, that's 20-30 hours/month of DevOps time across the team. At $100/hour fully loaded, you're spending $2,000-3,000/month on labor that largely cancels out the compute savings.

200 engineers. This is where the picture gets complicated. Compute savings are substantial, potentially $8,000-15,000/month. But so is the operational overhead. At this scale, you're likely dedicating one to two full-time engineers to runner infrastructure. You're dealing with multi-region deployments, GPU runners for ML workloads, and compliance audits that want to know exactly what's running on those machines. The break-even point depends heavily on whether you already have a mature platform engineering team. If you do, the marginal cost of adding runner management is lower. If you don't, you're building one just to save on CI compute.

The failure mode nobody talks about

GitHub's own documentation acknowledges that self-hosted runners "don't need to have a clean instance for every job execution." That's a feature framed as flexibility. In practice, it's a trap.

Persistent runners accumulate state between jobs. A build writes to /tmp and the next job inherits those files. An npm install leaves a node_modules directory that gets picked up by a later workflow that didn't expect it. A Docker build leaves cached layers that make the next build succeed locally but fail on any other machine.

The result is non-reproducible builds. Your CI passes on runner-07 but fails on runner-12 because of leftover state. Developers start adding "clean workspace" steps to every workflow. Those steps add minutes to every run. Eventually someone writes a cron job to wipe the runners nightly, which introduces its own failure modes when a long-running job gets killed at midnight.

This is also a security problem. If one job is compromised via a malicious pull_request trigger, the attacker's payload can persist on the runner and affect subsequent jobs. Ephemeral runners eliminate this entire category of risk by destroying the environment after every job completes.

What managed runners handle for you

A managed runner service takes the entire operational surface described above and makes it someone else's problem. Not in a hand-wavy way, but specifically:

Provisioning and scaling happen automatically. Runners spin up when jobs are queued and spin down when they're done. No capacity planning spreadsheets.
Image updates are the vendor's responsibility. Security patches, new tool versions, dependency updates: all handled without your team filing Jira tickets against themselves.
Ephemeral isolation means every job gets a clean VM. No state leakage, no non-reproducible builds, no cross-job contamination.
Security hardening is baked into the platform. The attack surface of an ephemeral, vendor-managed VM is categorically smaller than a long-lived machine sitting in your AWS account with SSH keys your team rotates "eventually."

Tenki's managed runners are a concrete example. They run on bare-metal infrastructure owned by Tenki, not shared cloud VMs. Every job executes in an ephemeral VM that's destroyed after completion, so there's zero state leakage between runs. The migration path is a single line change in your workflow YAML: swap the runs-on label and your existing workflows keep working. Tenki's published benchmarks show builds running 30-67% faster than GitHub-hosted runners at $0.002 per core/minute for x64, compared to GitHub's $0.003 per core/minute for standard Linux runners.

When self-hosted is genuinely the right call

I don't think self-hosted runners are always wrong. There are real scenarios where they're the only viable option.

Air-gapped environments. If your compliance requirements forbid CI workloads from touching the public internet, no managed runner service can help you. Defense contractors, certain healthcare organizations, and financial institutions with strict data residency rules fall into this category. You need runners inside your network perimeter. Full stop.

Specialized hardware. If your build process requires FPGAs, specific GPU models for inference testing, or physical devices connected via USB for embedded firmware validation, managed runners won't have that hardware. You're going to need your own machines.

Extreme-volume cost optimization. At truly massive scale (thousands of engineers, hundreds of thousands of CI minutes per month), the compute savings of self-hosted can be large enough to justify a dedicated platform team. But "we're big enough to afford the team" is a very different argument than "self-hosted is cheaper." You're not eliminating the maintenance cost; you're amortizing it across enough volume that the per-minute rate drops below managed alternatives.

For everyone else, the question isn't whether you can run your own fleet. It's whether you should.

The real comparison: your team's time vs. a vendor's bill

The self-hosted vs. managed debate usually gets framed as a compute cost comparison. That framing is wrong. The correct comparison is between the fully loaded cost of maintaining runner infrastructure (compute + engineering time + opportunity cost of what those engineers aren't building) and the per-minute price of a managed service.

When you factor in image maintenance, security patching, capacity planning, on-call rotations, and the developer productivity lost to non-reproducible builds on stateful runners, managed runners aren't just competitive. For most teams, they're cheaper.

The DIY runner ecosystem (RunsOn, Ubicloud, and the various open-source autoscalers) reduces some of the maintenance overhead. These are legitimate tools. But they don't eliminate it. You're still responsible for the underlying infrastructure, the images, and the operational response when things break. You've automated the provisioning; you haven't outsourced the responsibility.

If your team is above 20 engineers, doesn't operate in an air-gapped environment, and doesn't need exotic hardware, start with managed runners. Save the infrastructure engineering for problems that are actually unique to your business.