
MCP Security Scanning: Audit Your AI Agent's Tools
Stripe's autonomous coding agents now produce over 1,300 pull requests per week. Every one of those PRs is human-reviewed but contains zero human-written code. The system, called Minions, operates across a codebase that processes over $1 trillion in annual payment volume. That's not a research project. It's production at scale.
Most engineering teams aren't thinking about what happens when agent-generated PRs outnumber human ones by 5x or 10x. But that's the trajectory. And the bottlenecks that emerge at that volume aren't in the agent itself. They're in everything around it: CI runners, test queues, merge conflict resolution, and code review bandwidth.
Minions aren't code completion tools. They're fully unattended agents that take a task description, execute it end-to-end, and deliver a finished pull request. An engineer types a request in Slack, and the agent handles everything: reading the codebase, writing code, running linters, iterating against tests, and submitting the PR for review.
The system started as an internal fork of Block's open-source agent Goose, then was heavily customized for Stripe's infrastructure. Each Minion run launches on a dedicated cloud development environment (a "devbox") that spins up in about 10 seconds with the full monorepo, warm caches, and pre-loaded services. That isolation is critical: agents can run with full permissions because they can't touch production.
Stripe orchestrates Minions using what they call blueprints: state machines that interleave deterministic steps (run linters, push to branch) with agentic steps (implement the task, fix failing tests). This hybrid approach means the LLM handles the creative parts while deterministic code guarantees that required steps like linting always happen.
A team of 50 engineers might push 200 PRs per week. Add autonomous agents, and that number can jump to 1,000 or 2,000 without adding a single person. The infrastructure implications are real and specific.
Every agent PR triggers CI. If your runners are sized for human-speed PR volume, agent-speed volume will saturate them. GitHub's hosted runners already have concurrency limits, and self-hosted pools sized for 200 PRs per week won't absorb 1,300. Stripe solves this with their devbox infrastructure, but most teams don't have pre-warmed cloud environments sitting in a pool.
The cost math changes too. Stripe deliberately limits Minions to at most two CI runs per task because "CI runs cost tokens, compute, and time, and there are diminishing marginal returns if an LLM is running against indefinitely many rounds of a full CI loop." That constraint only works because they've invested heavily in local pre-push validation. Without that investment, agents will burn through CI minutes iterating on failures that could have been caught earlier.
When 20 agents push branches simultaneously, merge queues back up. Each PR that lands changes the base, potentially invalidating the CI results of PRs still waiting. At human pace, this happens occasionally. At agent pace, it's constant. Stripe handles this by giving each agent its own isolated devbox with a clean checkout, but the merge problem still exists at the point where branches converge into the main line.
Stripe has over three million tests. Running the full suite for every agent PR would be absurd, so they use selective test execution: CI runs only the tests relevant to the changes. This is the kind of infrastructure investment that most teams haven't made yet. If your test suite takes 40 minutes end-to-end and you're running it 1,300 times a week, you need either very fast runners, very good test selection, or both.
Stripe reviews every agent-generated PR with a human. That's a deliberate choice for a codebase handling payments at their scale. But reviewing 1,300 additional PRs per week is a significant load, and it only works because of two things: the PRs tend to be well-scoped tasks (config changes, dependency upgrades, minor refactors), and the agents produce code that already passes CI before the human sees it.
For most teams, though, reviewing AI-generated code at this volume means rethinking how review time is allocated. Some questions worth asking:
Stripe also feeds agent-generated PRs through their existing quality gates: Minions use the same linters, rule files, and CI checks as human engineers. Cameron Bernhardt, an engineering manager at Stripe, noted that "the agents are increasingly producing changes end-to-end" while maintaining this review standard.
If you're planning to deploy autonomous coding agents, you need to model the CI impact before you feel it. Here's a rough framework.
Start with your current baseline. How many PRs per week do your engineers create? What's the average CI time per PR? What's your current runner utilization during peak hours?
Apply a multiplier. Stripe's ratio is roughly 1,300 agent PRs across their engineering org. For a smaller team, even 3-5x your current PR volume is a reasonable planning assumption once agents are running. And remember Stripe's two-CI-run limit: each agent PR might trigger CI twice, so your effective multiplier on runner minutes is higher than the PR count suggests.
Factor in the burst pattern. Human PRs follow workday patterns. Agent PRs don't. Stripe engineers routinely spin up multiple Minions in parallel, and agents can run around the clock. Your peak CI load may shift from Tuesday at 2 PM to "always."
Invest in local validation. Stripe's approach of catching lint and type errors before pushing to CI is the single biggest cost reducer at agent scale. Their pre-push hooks resolve common issues in under a second. Every failure caught locally is a CI run saved.
Stripe built an entire developer productivity organization around making engineers faster. Minions inherited that investment: pre-warmed environments, selective test execution, a centralized MCP server with nearly 500 tools, conditional agent rule files scoped to subdirectories. Most companies don't have any of this.
That gap matters. If your CI runners are slow, agents will amplify that slowness. If your test suite isn't selective, agents will waste compute running irrelevant tests. If your review process doesn't differentiate between a config tweak and an architectural change, agent PRs will flood the queue and slow everything down.
The takeaway from Stripe's experience is that autonomous coding agents don't just need a good LLM. They need the surrounding infrastructure to be fast, elastic, and smart about what it runs. "What's good for humans is good for agents" is how Stripe puts it, and they're right. But the bar for "good" goes up substantially when your agent fleet is generating pull requests 24 hours a day.
You don't need to be Stripe-sized to hit these bottlenecks. A 30-person team running five or six concurrent agents will generate enough PR volume to stress a typical CI setup. Here's where to start:
The shift from experimental agent usage to production-scale deployment is already happening. Stripe's 1,300 PRs per week is today's number, and it's growing. For platform engineers and CI/CD architects, the question isn't whether you'll face this volume. It's whether your infrastructure will be ready when you do.
Recommended for you
What's next in your stack.
GET TENKI