What Stripe's 1,300 Agent PRs Per Week Reveal About CI at Scale

Eddie Wangengineering

How Stripe's Minions Actually Work
Where CI/CD Breaks Under Agent Volume
Runner capacity hits a wall
Queue depth and merge conflicts compound
Test suites weren't built for this frequency
Code Review at Agent Scale
Capacity Planning for Agent-Scale CI
The Infrastructure Gap
Preparing Your Pipeline

Stripe's autonomous coding agents now produce over 1,300 pull requests per week. Every one of those PRs is human-reviewed but contains zero human-written code. The system, called Minions, operates across a codebase that processes over $1 trillion in annual payment volume. That's not a research project. It's production at scale.

Most engineering teams aren't thinking about what happens when agent-generated PRs outnumber human ones by 5x or 10x. But that's the trajectory. And the bottlenecks that emerge at that volume aren't in the agent itself. They're in everything around it: CI runners, test queues, merge conflict resolution, and code review bandwidth.

How Stripe's Minions Actually Work

Minions aren't code completion tools. They're fully unattended agents that take a task description, execute it end-to-end, and deliver a finished pull request. An engineer types a request in Slack, and the agent handles everything: reading the codebase, writing code, running linters, iterating against tests, and submitting the PR for review.

The system started as an internal fork of Block's open-source agent Goose, then was heavily customized for Stripe's infrastructure. Each Minion run launches on a dedicated cloud development environment (a "devbox") that spins up in about 10 seconds with the full monorepo, warm caches, and pre-loaded services. That isolation is critical: agents can run with full permissions because they can't touch production.

Stripe orchestrates Minions using what they call blueprints: state machines that interleave deterministic steps (run linters, push to branch) with agentic steps (implement the task, fix failing tests). This hybrid approach means the LLM handles the creative parts while deterministic code guarantees that required steps like linting always happen.

Where CI/CD Breaks Under Agent Volume

A team of 50 engineers might push 200 PRs per week. Add autonomous agents, and that number can jump to 1,000 or 2,000 without adding a single person. The infrastructure implications are real and specific.

Runner capacity hits a wall

Every agent PR triggers CI. If your runners are sized for human-speed PR volume, agent-speed volume will saturate them. GitHub's hosted runners already have concurrency limits, and self-hosted pools sized for 200 PRs per week won't absorb 1,300. Stripe solves this with their devbox infrastructure, but most teams don't have pre-warmed cloud environments sitting in a pool.

The cost math changes too. Stripe deliberately limits Minions to at most two CI runs per task because "CI runs cost tokens, compute, and time, and there are diminishing marginal returns if an LLM is running against indefinitely many rounds of a full CI loop." That constraint only works because they've invested heavily in local pre-push validation. Without that investment, agents will burn through CI minutes iterating on failures that could have been caught earlier.

Queue depth and merge conflicts compound

When 20 agents push branches simultaneously, merge queues back up. Each PR that lands changes the base, potentially invalidating the CI results of PRs still waiting. At human pace, this happens occasionally. At agent pace, it's constant. Stripe handles this by giving each agent its own isolated devbox with a clean checkout, but the merge problem still exists at the point where branches converge into the main line.

Test suites weren't built for this frequency

Stripe has over three million tests. Running the full suite for every agent PR would be absurd, so they use selective test execution: CI runs only the tests relevant to the changes. This is the kind of infrastructure investment that most teams haven't made yet. If your test suite takes 40 minutes end-to-end and you're running it 1,300 times a week, you need either very fast runners, very good test selection, or both.

Code Review at Agent Scale

Stripe reviews every agent-generated PR with a human. That's a deliberate choice for a codebase handling payments at their scale. But reviewing 1,300 additional PRs per week is a significant load, and it only works because of two things: the PRs tend to be well-scoped tasks (config changes, dependency upgrades, minor refactors), and the agents produce code that already passes CI before the human sees it.

For most teams, though, reviewing AI-generated code at this volume means rethinking how review time is allocated. Some questions worth asking:

Can you triage agent PRs separately? A config change generated by an agent doesn't need the same review depth as a human-written architectural refactor. Labeling agent PRs and routing them through a faster review track can keep the queue moving.
What can automated review catch before a human looks? Static analysis, security scanning, and AI-powered code review can filter out obvious issues so human reviewers focus on logic, architecture, and edge cases.
When is human review non-negotiable? Changes touching payment logic, security boundaries, or public APIs should always have a human in the loop. The task scope matters more than whether an agent or a person wrote the code.

Stripe also feeds agent-generated PRs through their existing quality gates: Minions use the same linters, rule files, and CI checks as human engineers. Cameron Bernhardt, an engineering manager at Stripe, noted that "the agents are increasingly producing changes end-to-end" while maintaining this review standard.

Capacity Planning for Agent-Scale CI

If you're planning to deploy autonomous coding agents, you need to model the CI impact before you feel it. Here's a rough framework.

Start with your current baseline. How many PRs per week do your engineers create? What's the average CI time per PR? What's your current runner utilization during peak hours?

Apply a multiplier. Stripe's ratio is roughly 1,300 agent PRs across their engineering org. For a smaller team, even 3-5x your current PR volume is a reasonable planning assumption once agents are running. And remember Stripe's two-CI-run limit: each agent PR might trigger CI twice, so your effective multiplier on runner minutes is higher than the PR count suggests.

Factor in the burst pattern. Human PRs follow workday patterns. Agent PRs don't. Stripe engineers routinely spin up multiple Minions in parallel, and agents can run around the clock. Your peak CI load may shift from Tuesday at 2 PM to "always."

Invest in local validation. Stripe's approach of catching lint and type errors before pushing to CI is the single biggest cost reducer at agent scale. Their pre-push hooks resolve common issues in under a second. Every failure caught locally is a CI run saved.

The Infrastructure Gap

Stripe built an entire developer productivity organization around making engineers faster. Minions inherited that investment: pre-warmed environments, selective test execution, a centralized MCP server with nearly 500 tools, conditional agent rule files scoped to subdirectories. Most companies don't have any of this.

That gap matters. If your CI runners are slow, agents will amplify that slowness. If your test suite isn't selective, agents will waste compute running irrelevant tests. If your review process doesn't differentiate between a config tweak and an architectural change, agent PRs will flood the queue and slow everything down.

The takeaway from Stripe's experience is that autonomous coding agents don't just need a good LLM. They need the surrounding infrastructure to be fast, elastic, and smart about what it runs. "What's good for humans is good for agents" is how Stripe puts it, and they're right. But the bar for "good" goes up substantially when your agent fleet is generating pull requests 24 hours a day.

Preparing Your Pipeline

You don't need to be Stripe-sized to hit these bottlenecks. A 30-person team running five or six concurrent agents will generate enough PR volume to stress a typical CI setup. Here's where to start:

Measure your current CI throughput and cost per PR. You can't plan for 5x volume if you don't know your baseline.
Switch to elastic runner capacity. Fixed runner pools can't absorb the burst patterns that agents create. You need runners that scale up when agents are active and scale down when they're not.
Implement pre-push validation. Follow Stripe's lead: catch linting, formatting, and type errors before the branch ever hits CI. This is the highest-leverage investment you can make.
Add automated code review as a first pass. AI-powered review tools can catch bugs, security issues, and style violations before a human reviewer opens the PR. This keeps the human review queue focused on decisions that actually need human judgment.
Cap agent CI iterations. Stripe's two-run limit is a good default. Letting agents loop indefinitely against CI is expensive and rarely productive past the second attempt.

The shift from experimental agent usage to production-scale deployment is already happening. Stripe's 1,300 PRs per week is today's number, and it's growing. For platform engineers and CI/CD architects, the question isn't whether you'll face this volume. It's whether your infrastructure will be ready when you do.