AI Agents

May 2026

AI Code Creates 1.7x More Review Issues

Eddie Wangengineering

The numbers behind the gut feeling
The vibe merge problem
Why AI code fails differently than human code
Better prompts won't fix this
Treating agent PRs as higher-risk by default
A practical review architecture for agent-heavy teams
1. Tag and triage by author type
2. Focus automated review on AI's weak spots
3. Enforce CI gates that catch what agents miss
4. Decouple generation from review
5. Use AI-aware checklists for human reviewers
Review is the bottleneck now

Your team shipped 40% more pull requests last quarter. Your incident rate also went up. These two facts are probably connected.

In December 2025, CodeRabbit published its State of AI vs Human Code Generation Report, analyzing 470 open-source GitHub pull requests. The headline finding: AI-generated PRs produce 1.7x more review issues than human-written code. Not different kinds of issues. The same kinds, just more of them, more often, at higher severity.

That tracks with what we see at Tenki. Teams using AI coding agents — Cursor, Claude Code, Codex, Copilot — are pushing more code through review than ever before. And their reviewers are drowning.

The numbers behind the gut feeling

CodeRabbit's study compared 320 AI-co-authored PRs against 150 human-only PRs using a structured issue taxonomy. AI-authored changes averaged 10.83 issues per PR, compared to 6.45 for human-written ones. The breakdown is worse than the headline suggests:

Readability issues spiked 3x. AI code looks consistent on the surface but violates local patterns for naming, clarity, and structure. It doesn't read like your codebase.
Security vulnerabilities were up to 2.74x higher. Improper password handling and insecure object references were the most common patterns. Not novel vulnerabilities, just old mistakes repeated at scale.
Logic and correctness errors were 75% more common. Business logic mistakes, incorrect dependencies, flawed control flow. The expensive kind of bugs.
Error handling gaps doubled. Omitted null checks, missing early returns, incomplete exception logic. The stuff that causes 2 AM pages.
Excessive I/O operations were 8x more common. AI defaults to simple, readable patterns over efficient ones. Great for tutorials, bad for production hot paths.

A separate Cortex benchmark report corroborates the trend: while PRs per author increased 20% year-over-year thanks to AI assistance, incidents per pull request went up 23.5%. More output, more problems.

And it gets worse when you look beyond automated testing. METR, an AI safety research org, published a study in March 2026 showing that roughly half of AI-generated patches that pass automated test suites would be rejected by actual repository maintainers. The patches passed the grader but failed the human. Code quality, broken side effects, and core functionality failures were the top rejection reasons.

The vibe merge problem

Level Up Coding's Code Review Bench 2026 analysis coined a term for what happens when AI-generated PRs hit human reviewers: vibe merging. The code looks plausible. The diff is huge. The reviewer skims, sees nothing obviously wrong, approves. PR review times are reportedly up 91%, and approval quality is trending the opposite direction.

Vibe merging is the natural result of two forces colliding. AI agents produce code that's syntactically correct and superficially clean, which is exactly the kind of code humans are bad at scrutinizing. Meanwhile, the volume of AI-generated changes overwhelms the review queue, pushing reviewers toward faster approvals just to keep the backlog manageable.

The result is a quiet degradation loop. Agents generate more code. Humans approve more of it with less scrutiny. Bugs reach production. The team adds more agents to fix the bugs faster. Repeat.

Why AI code fails differently than human code

CodeRabbit's analysis identifies five root causes behind the quality gap, and they all come back to the same structural limitation: AI models infer code patterns statistically, not semantically.

No local business logic. An agent doesn't know that your billing service treats zero-dollar invoices as errors, or that a particular enum was deprecated last sprint. It generates plausible code based on statistical patterns, not your team's actual rules.

Surface-level correctness. AI-generated functions compile and often pass basic tests, but skip control-flow protections and misuse dependency ordering. The code works in the happy path and breaks everywhere else.

Repo idiom drift. Naming patterns, architectural norms, and formatting conventions drift toward generic defaults. The code is "correct" but doesn't belong in your repository. A reviewer familiar with the codebase spots this instantly. An overloaded reviewer doesn't.

Degraded security patterns. Without explicit constraints, models recreate legacy patterns from training data. They'll use deprecated crypto APIs, skip input validation, or handle credentials inline instead of through your team's approved helpers.

Clarity over efficiency. Models default to simple loops, repeated I/O calls, and unoptimized data structures. Readable, sure. But when that code sits in a request handler processing 10,000 calls per second, readable isn't enough.

Better prompts won't fix this

The instinctive response to "AI code has more issues" is to improve the prompts. Write more detailed system instructions. Feed the agent your style guide. Stuff the context window with examples.

That helps at the margins, but it doesn't solve the fundamental problem. The generating agent and the reviewing process share context. The agent produced the code based on its understanding of the task. If you ask the same agent (or a similarly-configured one) to review the output, it'll confirm its own assumptions. It's the LLM equivalent of grading your own homework.

METR's study demonstrates this gap clearly. Patches that passed automated test suites (the benchmark's "grader") were rejected by maintainers at dramatically higher rates. The automated grader and the generating agent share the same blind spot: they evaluate whether the code does what the tests expect, not whether it does what the project needs.

What you need is independent review that doesn't share context with the generating agent. A reviewer that evaluates code against your repository's actual patterns, conventions, and constraints, without being primed by the same prompt that produced the code in the first place.

Treating agent PRs as higher-risk by default

This is the approach we've taken with Tenki's code reviewer. When an agent opens a PR, the review shouldn't use the same level of trust as a PR from a senior engineer who's been on the team for three years. The signals that a PR is agent-authored are usually clear: bot accounts, co-authored-by trailers, specific branch naming patterns, PR templates generated by tools like Cursor or Codex.

Tenki's reviewer is context-aware, built to understand your codebase and apply your team's custom rules. You can configure it to apply stricter checks to bot-authored or agent-authored PRs. That means elevated scrutiny for exactly the categories CodeRabbit's data flags as problematic: error handling coverage, naming consistency, security patterns, and adherence to local conventions.

The key difference from asking the generating agent to self-review: Tenki's reviewer operates independently. It evaluates the diff against your repository's patterns without being biased by the prompt or context that produced the code. It's a second opinion that doesn't already agree with the first one.

A practical review architecture for agent-heavy teams

Based on CodeRabbit's data and what we've seen in Tenki's own review patterns, here's what actually moves the needle for teams where a significant chunk of PRs come from AI agents.

1. Tag and triage by author type

Establish a clear way to identify agent-authored PRs. Most agents leave obvious signals (GitHub Apps, specific user accounts, co-author metadata). Use these signals to route agent PRs through a review path with higher default scrutiny. In Tenki, this means setting custom context rules that trigger stricter severity thresholds when the PR author matches a known bot pattern.

2. Focus automated review on AI's weak spots

CodeRabbit's data tells us exactly where to look. Configure your review tool to pay extra attention to error handling paths, security-sensitive patterns (credential handling, input validation), naming consistency against existing conventions, and performance-critical sections where AI's "clarity over efficiency" defaults can hurt. These aren't arbitrary rules. They're the categories where AI generates 2-8x more issues.

3. Enforce CI gates that catch what agents miss

Linters and formatters catch the 2.66x formatting gap and part of the naming inconsistency problem. But you also need correctness rails: require tests for non-trivial control flow changes, mandate type assertions and nullability checks, and standardize exception-handling patterns. Don't rely on agents to add these. Treat missing safety checks as CI failures.

4. Decouple generation from review

This is the most important architectural decision. Your review tool should not share context, prompts, or configuration with the generating agent. If Cursor writes the code, Cursor shouldn't review it. If Codex generates a PR, the review should come from an independent system that evaluates the diff against your repository's ground truth. Tenki's reviewer operates as that independent layer: it reads the diff and your codebase, not the prompt that produced the change.

5. Use AI-aware checklists for human reviewers

When a human does review an agent-authored PR, they need different questions than for human-authored code. Not "does this look right?" but specific checks: Are error paths covered? Are concurrency primitives correct? Are configuration values validated? Are credentials handled through the approved helper? These target the exact failure modes that CodeRabbit's data shows are most amplified in AI output.

Review is the bottleneck now

Here's the counterintuitive conclusion from all this data: AI coding agents make code review more important, not less.

For the past decade, the bottleneck in most engineering teams was writing code. Getting features built, bugs fixed, migrations done. AI agents have largely solved that bottleneck. A team of five with good agent tooling can produce the PR volume of a team of fifteen.

But the bottleneck has shifted. Writing speed is no longer the constraint. The constraint is review capacity: can your team verify that all this generated code actually meets your standards before it reaches production? With 1.7x more issues per AI PR, the review load doesn't just scale linearly with output. It compounds.

Teams that recognize this shift are investing in review infrastructure, not just generation tools. They're treating code review as a first-class engineering system that needs its own tooling, automation, and quality standards. The ones that don't are the teams whose incident dashboards will tell the story for them.

If your team is generating more code with agents, your review process needs to keep pace. Tenki's code reviewer starts at $0.50 per review, with 10 free reviews to try it out. It takes about two minutes to set up via the GitHub App.

AI Code Creates 1.7x More Review Issues

Table of Contents

The numbers behind the gut feeling

The vibe merge problem

Why AI code fails differently than human code

Better prompts won't fix this

Treating agent PRs as higher-risk by default

A practical review architecture for agent-heavy teams

1. Tag and triage by author type

2. Focus automated review on AI's weak spots

3. Enforce CI gates that catch what agents miss

4. Decouple generation from review

5. Use AI-aware checklists for human reviewers

Review is the bottleneck now

Smarter reviews. Faster builds.
Start for Free in less than 2 min.

AI Code Creates 1.7x More Review Issues

Table of Contents

The numbers behind the gut feeling

The vibe merge problem

Why AI code fails differently than human code

Better prompts won't fix this

Treating agent PRs as higher-risk by default

A practical review architecture for agent-heavy teams

1. Tag and triage by author type

2. Focus automated review on AI's weak spots

3. Enforce CI gates that catch what agents miss

4. Decouple generation from review

5. Use AI-aware checklists for human reviewers

Review is the bottleneck now

Smarter reviews. Faster builds. Start for Free in less than 2 min.

Smarter reviews. Faster builds.
Start for Free in less than 2 min.