Tenki catches 2x more real bugs than any other AI reviewer. Here's the benchmark.

Hayssem Vazquez-Elsayedproduct

The corpus
How the data actually got produced
Why F1, not "did the tool catch one bug"
By severity
About those false positives
The takeaway

The question that matters about an AI code reviewer is unflattering, simple, and easy to dodge: how often does it actually catch the bug? Not how thoughtful its tone is. Not how nice the dashboards look. The bug. The one a human would have caught on a careful read of the diff. Did the reviewer find it, point at the right line, and explain why it matters?

So we built a benchmark to answer that, head to head, against the other AI code review tools people are actually using. This post walks through how it works, what it found, and why we sort the leaderboard by F1 instead of the per-PR catch rate most published benchmarks lead with.

TLDR: Tenki caught 84 out of 122 real bugs across 50 pull requests. The next-best reviewer caught 44. That is roughly twice the recall of the second-place tool, and it holds up across five different languages and five different codebases. The full leaderboard and per-PR breakdown is here. The rest of this post is why I trust that number.

The corpus

50 pull requests, each one a real, merged bug-fix from a public codebase, replayed in reverse so the bug is back in the diff. Five repositories, five languages:

cal.com (TypeScript)
sentry (Python)
grafana (Go)
keycloak (Java)
discourse (Ruby)

We pulled the PRs into a clean fork under the public codereview-benchmark GitHub org, where the corpus, judge prompts, scoring rubric, and raw verdicts are all open for inspection. The bug-introducing diff lives on a branch. Every reviewer sees the same code at the same SHA, with full repository context, default configuration. No custom rules, no per-repo tuning, no prompt engineering on our side to favor Tenki. We measure each tool as it ships, including ours.

Across those 50 PRs there are 122 distinct ground-truth findings. Each finding is a real bug in the diff with a written description: where it is, what triggers it, why it breaks. Some PRs have one bug. Some have eight or nine. The biggest is cal.com-1, which introduces nine separate regressions in a cache-key change. Each of those nine is scored independently.

How the data actually got produced

This was not a single tidy command-line run. It was an automated pipeline with agents running in parallel against different PRs, each handling its own retries, polling, and rate limits.

For every PR in the corpus, an agent did roughly this:

Picked up the next benchmark PR from the queue.
Replayed the bug-introducing diff into a fresh branch on the fork.
Opened a pull request against the fork's default branch.
Waited for each reviewer's bot to comment, polling for a while, then giving up gracefully if a tool went silent.
Pulled every review comment off the PR, attributed by bot login.
Asked three independent LLM judges, one finding at a time, whether each tool's comments caught that finding.
Wrote the verdicts back to a shared store.
Started over with the next PR.

Multiple agents ran in parallel against different PRs. Each agent had to handle its own GitHub rate-limit pressure, its own polling, its own retries for tools that crashed or rate-limited. The reason this is an agentic loop rather than a procedural script is that every tool in the benchmark behaves a little differently: response latency is bimodal for some, idempotency is shaky for others, and a few of them just silently no-op on certain languages. A rigid script would have failed on the long tail. An agent with a goal ("get a clean review from every tool on this PR") works around it.

The judge step is per-finding, not per-PR. For each of the 122 findings, three different LLM models read the tool's comments and vote, independently, on whether that finding was caught. Majority wins. A finding counts as caught only when a line-level comment explicitly identifies the faulty code and explains the impact. Drive-by comments, generic "consider testing this" remarks, and comments on the wrong file do not count.

Why F1, not "did the tool catch one bug"

This is the part of the methodology worth spending time on, because the easiest thing to do is publish a leaderboard that says "Tool A caught 32 of 50 PRs, Tool B caught 30, Tool C caught 4," and that is a common shape for the headline number on a vendor benchmark page. It is also, on its own, almost meaningless.

Here is the problem. A PR with nine real bugs in it can be "caught" by a reviewer that flagged any one of them. A reviewer that posts thirty drive-by line comments per PR will, by chance, land on the right line often enough to claim a high catch rate even if its comments are mostly noise. A reviewer that catches the easy bug and misses the eight harder ones gets the same credit on that metric as a reviewer that found all nine.

You can see this in the data. If we score only "did the tool comment on at least one ground-truth bug per PR":

Tool	PRs caught (binary)
Tenki	32 / 50
Greptile	31 / 50
Devin	30 / 50
Cursor	30 / 50
CodeRabbit	24 / 50
Copilot	24 / 50
Graphite	4 / 50

Read that way, Tenki, Greptile, Devin, and Cursor look almost interchangeable. A skeptical reader would walk away thinking the top four reviewers are essentially the same tool.

Now score by ground-truth finding, not by PR. Out of 122 individual bugs:

Tool	Findings caught
Tenki	84 / 122
Greptile	44 / 122
Devin	44 / 122
Cursor	39 / 122
CodeRabbit	35 / 122
Copilot	30 / 122
Graphite	4 / 122

Tenki catches roughly 1.9 times as many real bugs as the next reviewer in the field. That gap is invisible in the per-PR view. It is unambiguous in the per-finding view. The reason is exactly what you would expect: Tenki is consistent across the multiple bugs in a single PR, whereas the other tools tend to find one and stop. On a 9-bug PR like cal.com-1, Tenki caught all 9. The next-best reviewer in the field caught zero on that same PR.

That is the recall story. Recall alone is also not enough, because a reviewer that posts a comment on every single line will achieve perfect recall and be unusable. So the benchmark also tracks how many of each tool's line comments do not correspond to a ground-truth finding. We call those false positives, with one caveat that I will come back to. Combining the two gives precision and recall, and F1 is the harmonic mean.

If those terms are not muscle memory: recall is the share of real bugs the reviewer caught, precision is the share of the reviewer's comments that pointed at a real bug, and F1 is the harmonic mean of the two. Harmonic mean matters because it refuses to let a reviewer trade one for the other linearly - 0.9 precision with 0.3 recall scores ~0.45, not 0.6. To do well on F1, a reviewer has to be respectable at both.

Tool	Recall	Precision	F1
Tenki	0.69	0.30	0.42
Devin	0.36	0.47	0.41
Cursor	0.32	0.51	0.39
CodeRabbit	0.29	0.25	0.27
Greptile	0.36	0.16	0.22
Copilot	0.25	0.19	0.21
Graphite	0.03	0.50	0.06

F1 is the right metric because it punishes both failure modes. A reviewer that catches every bug by commenting on every line gets a brutal precision penalty. A reviewer that comments rarely but is always right gets a brutal recall penalty. The harmonic mean does not let you trade one against the other linearly: a 0.9 precision and a 0.3 recall gives you an F1 around 0.45, not 0.6. The metric forces both numbers to be respectable.

By F1, Tenki leads the field. The recall gap is so wide (0.69 vs 0.36 for the next reviewer) that even our middling precision lands us at the top. Devin and Cursor sit close behind Tenki in F1 because they trade recall for precision and end up at a similar score from the other end. Below them, the cliff: CodeRabbit, Greptile, Copilot, and Graphite are not competitive on this corpus by this metric.

By severity

Splitting the 50 PRs into severity buckets based on the worst bug in each:

Tool	Critical (4)	High (26)	Medium (20)
Tenki	2 / 4	19 / 26	11 / 20
Devin	2 / 4	17 / 26	11 / 20
Greptile	2 / 4	15 / 26	14 / 20
Cursor	2 / 4	15 / 26	13 / 20
Copilot	2 / 4	13 / 26	9 / 20
CodeRabbit	2 / 4	12 / 26	10 / 20
Graphite	0 / 4	2 / 26	2 / 20

The critical bucket is small enough that I would not put weight on it: four PRs, six of seven tools tied at half. The real story is the high-severity bucket, which has enough samples (26) to mean something. Tenki tops it with 19 of 26. The next tool gets 17. The dropoff after that is steep.

Critical bugs in the field average about 36% caught across the seven tools (eleven hits out of twenty-eight tries). Tenki sits at 50%. That is the headline we lead with on the public leaderboard, and it is fair, but the much more durable finding is the recall lead on high-severity bugs.

About those false positives

Tenki has 197 false positives in this benchmark, second only to Greptile (233). That number deserves some honesty.

A false positive in our scoring is a line-level comment that does not map to any ground-truth finding. Some of those are genuinely bad comments: nitpicks, hallucinated issues, style notes that no human would write. Many of them are not. Tenki has a habit of finding adjacent issues that the merged bug fix did not cover: a one-line correctness concern in a nearby helper, a missing edge case in a sibling function, a subtle race that is not the headline bug. Those count against us in this benchmark.

If we built a "plausible-but-unverified" bucket and put those comments there instead of in "false positives," the precision numbers would move noticeably in our favor and a small amount in everyone else's. We have not done that work yet because it requires hand-labeling, and the strict definition is the more defensible one to publish. We would rather under-claim precision than over-claim it.

Greptile posts 283 line comments to our 289 and lands 44 catches to our 84. Same comment volume, half the catches. That is what the precision gap is measuring.

What you should take from this: a reviewer that talks a lot can either be informative or noisy. Tenki's volume is the cost of catching twice as many bugs. The Pareto frontier is real, and we sit at the high-recall end of it. If your team values precision over recall, Cursor and Devin sit at the other end of the same frontier.

The takeaway

Several published code reviewer benchmarks use a per-PR catch rate as the headline. That metric is generous to every tool, and it makes the field look closer than it is. Switch to per-finding scoring and the differences are striking: the top reviewer catches roughly twice as many real bugs as the runner up. Add precision into the picture and the field reorders again: the tools that comment a lot pay for it, and the tools that comment carefully come up.

Tenki catches 84 of 122 real bugs across 50 PRs from cal.com, Sentry, Grafana, Keycloak, and Discourse. The next reviewer catches 44. That gap is the answer to the question I started with.