Stop Counting Comments: Action Rate Is the AI Code Review Metric That Matters

Hayssem Vazquez-Elsayedproduct

Comment volume is a production metric, not a quality metric
Why vendors optimize for volume
The real signal: action rate
The cry-wolf problem at CI
What a precision-first reviewer looks like
What to ask vendors before buying
The metric your team already understands

Every AI code review tool on the market wants you to know how many comments it generates. CodeRabbit's homepage proudly displays 75 million defects found. That number is supposed to impress you. It should worry you instead.

Comment volume tells you one thing: the tool talks a lot. It tells you nothing about whether anyone listened.

The metric that actually predicts whether engineers trust an AI reviewer, keep it enabled, and let it shape their code is action rate: what percentage of comments result in a developer changing code before merge? A tool generating 10 focused comments with an 80% action rate outperforms one generating 100 comments with a 10% action rate. The first tool changed 8 lines of code. The second changed 10 but burned through 90 interruptions to get there.

Comment volume is a production metric, not a quality metric

When a vendor shows you a dashboard with total comments generated, they're showing you throughput. A comment that says "consider renaming this variable" and a comment that catches a race condition in your database transaction both count as one. The dashboard doesn't distinguish between them, and neither does the marketing copy.

This framing tricks you into evaluating the wrong thing. Generating 80 comments on a 200-line PR isn't thoroughness. It's a failure mode dressed up as a feature. The reviewer is commenting on nearly every other line, which means it's either flagging things that don't matter or repeating the same observation in slightly different words across multiple files.

Think about it from the receiving end. You open a PR, see 47 AI comments, and you haven't even started reading them yet. What's your first instinct? For most engineers, it's to scroll past them. The volume itself signals that the tool isn't discriminating, so you stop discriminating too.

Why vendors optimize for volume

The incentive structure here is straightforward. Comment volume is easy to measure, hard to dismiss in a sales meeting, and looks impressive on a landing page. "75 million defects found" reads better than "our comments have a 40% action rate," even though the second claim is dramatically more useful.

Volume also creates a survivorship illusion in demos. A prospect watches the tool review a sample PR and sees 30 comments appear. Some of them are genuinely useful. The prospect remembers the three good ones and forgets the 27 they'd ignore in practice. The demo worked exactly as designed.

There's also a technical reason: tuning for precision is hard. Reducing false positives means the model needs better context about the codebase, the team's conventions, and the intent behind the change. That's expensive to build. It's much cheaper to lower the confidence threshold and let everything through. More comments, more "value."

The real signal: action rate

Action rate answers a simple question: when the AI reviewer leaves a comment, does the developer change their code?

This isn't a fuzzy sentiment metric. You can measure it concretely. Look at the diff between when a comment was posted and when the PR was merged. Did the lines the comment referenced change? If so, the comment drove action. If the lines stayed the same, the developer either disagreed, didn't read it, or didn't think it was worth the effort.

Action rate compounds in a way that volume doesn't. A tool with a high action rate builds developer trust over time. Engineers start reading comments carefully because they've learned the tool only speaks up when it matters. A tool with a low action rate does the opposite: it trains engineers to ignore review comments entirely, including the occasional good one buried in the noise.

Here's the math that should be in every vendor pitch but never is:

Tool A: 100 comments per PR, 10% action rate = 10 code changes, 90 interruptions wasted
Tool B: 10 comments per PR, 80% action rate = 8 code changes, 2 interruptions wasted

Tool A produced slightly more code changes in absolute terms. But it consumed 45x more developer attention per useful comment. Over hundreds of PRs per month, that attention cost is devastating.

The cry-wolf problem at CI

Developer attention is finite and non-renewable within a work session. Every comment that doesn't warrant action doesn't just waste time reading it. It actively degrades the developer's willingness to read the next one.

This is the cry-wolf problem applied to CI. If your AI reviewer flags 40 things on every PR and most of them are style nitpicks or obvious non-issues, developers will start collapsing the review section without reading it. They might even configure their PR template to auto-resolve AI comments. At that point, you're paying for a tool that produces text nobody reads.

The damage goes further. Once engineers are conditioned to ignore AI review comments, they'll ignore them even when the tool catches something critical. A genuine security vulnerability sitting in a wall of 35 "consider using const here" comments gets the same treatment as the rest: dismissed.

High comment volume doesn't just fail to help. It actively destroys the value of the signal when the signal is real.

What a precision-first reviewer looks like

Tenki's code reviewer is built around this exact principle. Rather than flooding PRs with everything the model notices, it applies severity thresholds and noise filtering as first-class configuration. The product page puts it plainly: "Rather than interrupting your workflow, Tenki code reviewer only flags critical issues."

That's a design choice with real consequences. It means the review layer is tuned for action rate, not volume. When a Tenki comment appears on your PR, it's been filtered through severity classification and codebase-aware context. The tool understands the difference between a genuine bug and a style preference, and it's configured to stay quiet about the latter unless you explicitly ask for it.

Verbosity and severity are both configurable. Teams that want more coverage can lower the threshold. Teams drowning in noise from other tools can set it high and only see critical and high-severity findings. The point is that the team decides the tradeoff, not the vendor's marketing department.

On Tenki's benchmark of 122 real production bugs scored by a 3-LLM judge panel, Tenki detected 69% of bugs compared to CodeRabbit's 29% and GitHub Copilot's 25%. Fewer comments, higher hit rate.

What to ask vendors before buying

If you're evaluating AI code review tools right now, two questions will separate the tools optimizing for the right thing from the ones optimizing for their homepage stats:

What is your average action rate across customers? If the vendor can't answer this, they don't track it. And if they don't track it, they're optimizing for something else. The number itself matters less than whether they have it at all. A vendor that measures action rate has aligned their engineering incentives with your outcomes. One that measures comment volume has aligned their incentives with their marketing.
How do you let me configure comment volume? A tool that doesn't let you dial down noise is a tool that doesn't think noise is a problem. You need severity thresholds, category filters, and verbosity controls at minimum. If the only configuration options are "on" and "off," the tool is treating your team's attention as an externality.

A few more things worth probing during evaluation:

Does the tool understand your codebase, or does it review each file in isolation? Context-free reviews produce generic comments. Generic comments get ignored.
Can you see which comments led to code changes and which were dismissed? Without this feedback loop, neither you nor the vendor can improve the tool's precision over time.
Does the tool separate style suggestions from bug reports? These have fundamentally different urgency levels and should be handled differently in the review interface.

The metric your team already understands

Action rate isn't a novel concept. It's the same principle behind every effective feedback system: signal density matters more than signal volume. A smoke detector that goes off once a year when there's actual smoke is infinitely more valuable than one that goes off every time you make toast. The second detector ends up disconnected and sitting in a drawer.

AI code review is heading for exactly that outcome at a lot of organizations. Teams adopt a high-volume tool, get excited during the trial, then gradually stop reading the comments as the novelty fades and the noise remains. Six months later, the tool is still running, still generating hundreds of comments per week, and nobody's looking at them. The subscription renews because nobody bothers to cancel it, not because it's delivering value.

If you want AI code review that actually works, stop asking how many comments the tool generates. Start asking how many of those comments change code.