Your agents can finally write, run, and ship code in real isolation.Meet Tenki Sandbox
Code Review Benchmarks
How well does Tenki do versus other AI code reviewers in catching real bugs?
Last updated: May 20, 2026
TL;DR
Tenki is the #1 reviewer based on finding-level scoring.
In this independent 2026 benchmark of 6 leading AI code review tools, Tenki catches 84 of 122 real, production bugs across 50 pull requests from cal.com, Sentry, Grafana, Keycloak, and Discourse, graded per finding by a 3-judge LLM panel. That's 1.9× the next-best reviewer.
84/122
Real Bugs Caught
68.9%
Catch Rate
3
LLM Judges
7
Tools
5
Repositories
350
Code Reviews
Methodology
How we benchmarked AI code reviewers.
Every pull request contains a real, merged bug-fix from an open-source codebase. We replay the bug-introducing diff into a clean fork and let each tool review it with its default configuration. Reviews are then graded by a 3-LLM judge panel using majority vote. No synthetic bugs, no repo-specific tuning.
Sources
50 bug-fix PRs from 5 open-source repositories
Real merged fixes from cal.com (TypeScript), sentry (Python), grafana (Go), keycloak (Java), and discourse (Ruby). All five major server-side languages are represented in the bug set.
Replay
Real bugs reintroduced, not invented
The pre-fix diff is replayed against every tool with default settings: no custom rules, no repo-specific tuning, and full repository context for every reviewer. Every tool sees the same code at the same point in history.
Scoring
Per-finding 3-judge LLM majority vote
Bugs are scored individually, not per-PR (which would over-credit drive-by comments). A bug counts as caught only if a line-level comment pinpoints the faulty code and explains its impact, and at least two of three independent LLM judges agree.
Each individual bug is scored, not each pull request. Data is sorted by F1 and tools were kept at their default configurations. Higher-precision tools post fewer comments overall; Tenki's higher comment volume drives both higher recall and lower precision. See methodology for how each metric is computed.
Reviewer
Recall
Precision
F1
Tenki
68.9%(84/122 Bugs)
29.9%
41.7
CodeRabbit
28.7%(35/122 Bugs)
25.0%
26.7
Greptile
36.1%(44/122 Bugs)
15.9%
22.1
Copilot
24.6%(30/122 Bugs)
18.9%
21.4
Graphite
3.3%(4/122 Bugs)
50.0%
6.2
Coding agents
Devin
36.1%(44/122 Bugs)
47.3%
40.9
Cursor
32.0%(39/122 Bugs)
51.3%
39.4
By Severity
Do AI code reviewers catch the bugs that actually matter?
Catch rate broken down by the severity of the individual finding. Critical bugs cause outages, data loss, or auth bypass. High-severity bugs break major user-facing flows. Medium bugs degrade behavior without breaking it. The severity that matters most for production reliability is the first column.
Tenki
69%(84/122 Bugs)
Devin
36%(44/122 Bugs)
Greptile
36%(44/122 Bugs)
Cursor
32%(39/122 Bugs)
CodeRabbit
29%(35/122 Bugs)
Copilot
25%(30/122 Bugs)
Graphite
3%(4/122 Bugs)
By Repository
How each reviewer performs across five production codebases.
Per-repository recall: the share of real bugs each AI code reviewer caught in each codebase.
Tenki
91%(30/33 Bugs)
Devin
33%(11/33 Bugs)
Cursor
30%(10/33 Bugs)
Greptile
30%(10/33 Bugs)
CodeRabbit
21%(7/33 Bugs)
Copilot
21%(7/33 Bugs)
Graphite
0%(0/33 Bugs)
Case Library
Every real bug, every reviewer verdict.
One row per real bug, 122 findings across 50 production pull requests. Each cell shows whether that reviewer flagged that specific defect, decided by 3-LLM majority vote. Click any verdict to see the actual review on GitHub.
Caught Missed
deleteCacheHandler throws generic Error → tRPC surfaces as 500 instead of 403/404
MediumCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
Checkbox fires onChange twice per click via redundant onClick + onCheckedChange handlers
LowCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
`getPendingActions` never shows confirm button for paid payment-enabled bookings
HighCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
Missing `revalidateTag('team-features')` leaves settings layout cache stale after role update
MediumCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
Past pending-unconfirmed bookings lose cancel/edit actions
HighCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
Functional regression: 'Check for recordings' action replaced with disabled button
MediumCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
afterNthPaintCycle fires callback after n+1 frames, not n frames
LowCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
no_show action always included in afterEventActions, shown disabled for upcoming bookings
MediumCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
Charge card action hidden for recurring bookings in recurring tab
MediumCal.com
Tenki
CodeRabbit
Copilot
Cursor
Devin
Graphite
Greptile
forEach with async callback fire-and-forgets calendar/video deletions