
Copilot Now Batch-Fixes Its Own Reviews. Here's the Gate It Skips.
GitHub published a guide on May 7 that every engineering lead should read. "Agent pull requests are everywhere. Here's how to review them." lays out what goes wrong when agents start flooding your PR queue: duplicated logic, CI gaming, hallucinated correctness, scope creep that looks like a clean diff. The patterns are specific, well-observed, and actionable.
The problem is what the guide asks you to do about it: read more carefully. Follow a checklist. Trace every critical path by hand. Scan for new utilities and search the repo for duplicates yourself.
That works at five PRs a day. It falls apart at fifty. And fifty is already here for teams running parallel coding agents.
GitHub's own numbers tell the story. Copilot code review has processed over 60 million reviews, growing 10x in under a year. More than one in five code reviews on GitHub now involve an agent. The post opens with: "The traditional loop—request review, wait for code owner, merge—breaks down when one developer can kick off a dozen agent sessions before lunch."
That's an accurate diagnosis. But the prescription is a ten-minute manual checklist per PR. The guide presents a timed review workflow: scan and classify (1-2 min), check CI changes (2-3 min), scan for new utilities (3-5 min), trace a critical path (5-8 min), check security boundaries (8-9 min), require evidence (9-10 min).
Ten minutes times fifty PRs is over eight hours. That's a full workday spent reviewing, with no time to write code, plan architecture, or do anything else.
This is the part that matters. Look at the five red flags the guide identifies and ask: which of these requires a human?
CI gaming. Did coverage thresholds change? Were tests removed or skipped? Are CI steps newly gated behind conditions? These are diff-level pattern matches. An automated review tool can flag every one of them before a human opens the PR.
Code reuse blindness. The guide says "for every new helper or utility in an agent PR, do a quick search." That's a repo-wide semantic search. GitHub frames it as manual work. It's the exact kind of analysis a context-aware review tool runs automatically: compare new functions against the existing codebase, surface duplicates, flag reimplemented logic.
Hallucinated correctness. Off-by-one errors, missing permission checks, validation that short-circuits under edge cases. The guide's advice is to "trace it, don't just scan it." That's good judgment advice, but the initial detection of suspicious boundary conditions, unchecked external values, and missing validation branches is something an AI reviewer can do at scale across every PR, not just the one you have time to trace.
Untrusted input in workflows. Is user input interpolated into prompts without sanitization? Is GITHUB_TOKEN write-scoped when it only needs read? These are structural checks on YAML and workflow files. Fully automatable.
Agentic ghosting. Large PRs with no structured plan that go quiet after review feedback. The guide recommends checking PR history and requesting breakdowns before investing review time. An automated gate can flag oversized PRs with no implementation plan and require decomposition before a reviewer is even assigned.
None of these require human judgment to detect. They require human judgment to resolve. That's the distinction the guide misses.
GitHub's guide ends with this: "What doesn't shrink is the context you carry. The things you know about your system that aren't written down anywhere. That's what makes your review valuable, and it's the part that doesn't get automated."
That's exactly right. Institutional knowledge, architectural intent, incident history, the reason a particular abstraction exists even though it looks redundant. No automated tool replaces that.
But the guide buries this insight under six minutes of work that a machine should handle. If a reviewer spends the first six minutes on CI checks, utility deduplication, and security scanning, they've burned most of their review budget before they get to the part only they can do.
Flip the workflow. Let an automated review pass handle detection. Give the human reviewer a pre-triaged PR with findings already categorized: here are the CI changes, here are the potential duplicates, here's a suspicious boundary condition in the payment handler. The reviewer starts at the judgment layer instead of the scanning layer.
Buried toward the end of the guide is a section titled "Let Copilot review it first." The advice: use automated review for the mechanical stuff before a human has to look at it. Treat it as a prerequisite, not a replacement.
That's the right architecture. But it contradicts the structure of the rest of the guide, which walks humans through the mechanical steps as if they should do them personally. The ten-minute checklist and the "let Copilot go first" recommendation can't both be the answer. If the automated pass catches CI gaming, style inconsistencies, and missing error handling, why is the human reviewer also spending three minutes on CI changes?
The guide also suggests codifying your review checklist using the Copilot SDK to run checks automatically against diffs. That's a step further. But it assumes every team builds their own tooling, which brings its own maintenance and reliability costs.
The workflow GitHub should be recommending looks like this: an agent opens a PR. Before any human sees it, an automated review tool runs. It analyzes the diff with full codebase context. It produces structured findings: CI regressions, duplicated utilities, suspicious logic patterns, security boundary issues, missing test coverage for claimed fixes.
Some of those findings are hard blockers. A test suite with removed coverage thresholds doesn't need human judgment; it needs to be rejected. Others are signals for human review: "this new utility looks similar to formatCurrency() in src/utils/money.ts" is context a reviewer can act on in seconds rather than discovering through manual search.
Tenki's code reviewer runs exactly this way. It integrates as a GitHub Actions step, analyzes diffs with codebase context, and posts structured findings directly on the PR. At $0.50 per review, it processes every agent-generated PR without burning reviewer hours on detection work. The human reviewer sees the findings and goes straight to judgment calls: is this duplication intentional? Does this edge case actually matter for our use case?
The guide's framing assumes reviewers scale to match agent throughput. Be more disciplined. Follow the checklist. Spend your ten minutes wisely. It treats reviewer capacity as a constant that can be optimized through better technique.
But agent throughput isn't linear. A team that adopts parallel coding agents doesn't see 2x more PRs. They see 10x or 20x. The "More Code, Less Reuse" study GitHub cites found that agent-generated code introduces more redundancy per change. That means more surface area to review per PR, not less. More PRs with more issues each.
The only way to handle that is to put the automated gate before the human queue. Not as a nice-to-have. As the first step in every agent PR workflow.
Read it. The red flag taxonomy is genuinely useful. CI gaming, code reuse blindness, hallucinated correctness, agentic ghosting, untrusted input in workflows. These are the categories your review process needs to cover.
Then treat them as signal categories for your automated review tool, not items on a personal checklist. Configure your tooling to detect them. Let the machine handle scanning so your reviewers can handle thinking.
GitHub got the problem statement exactly right. The gap is in the solution architecture. Checklists don't scale. Automated gates do.
Tags
Recommended for you
What's next in your stack.
GET TENKI