Code Review

May 2026

Feedback Sensors for Coding Agents: Wiring Quality Gates Into Self-Correction Loops

Eddie Wangengineering

What "Trial" actually means here
The four layers of feedback
1. Compiler and type checker output
2. Linters and static analysis
3. Test suites
4. Higher-order validators
Implementation patterns that work
The in-loop check
Pre-commit hooks as guardrails
Companion watcher processes
Mutation testing: the sensor that tests your tests
Fuzz testing and code health analysis
How this changes the reviewer's job
Anti-patterns: when feedback loops go wrong
Wiring it together: a practical configuration
The feedback flywheel connection
Where to start

A coding agent writes a function, generates tests, and opens a pull request. The tests pass. Coverage looks fine. A human reviewer spends twenty minutes reading the diff, spots a type coercion bug on line 47, and leaves a comment. The agent didn't know about the bug because nothing in its workflow told it to check.

That's the gap feedback sensors are designed to close. ThoughtWorks elevated this technique to Trial in Technology Radar Vol 34 (April 2026), recognizing that deterministic quality gates — compilers, linters, type checkers, test suites — wired directly into agentic workflows catch more defects than post-hoc human review alone. The idea isn't new. Developers have always run npm run check before pushing. What's changed is that agents can now consume that output programmatically, fix violations, and re-run the checks in a tight loop — all before a human ever sees the PR.

What "Trial" actually means here

ThoughtWorks uses four rings: Adopt, Trial, Assess, and Hold. Trial means the technique has been used successfully in production by multiple teams and is worth pursuing with managed risk. For feedback sensors, the Radar's language is specific: these checks "reduce routine steering work for the human in the loop" and should "run during the coding session and report clean results before a commit is made, rather than relying on post-commit checks."

The distinction matters. Post-commit CI runs catch errors, but by then the agent's context window has moved on. Fixing the error requires re-loading context, re-reading the diff, and generating a new patch. In-session feedback is cheaper by an order of magnitude in both tokens and wall-clock time.

The four layers of feedback

Not all feedback sensors are created equal. They vary in speed, signal quality, and how easily they can be wired into an agent's correction loop.

1. Compiler and type checker output

The fastest, most reliable sensor. A TypeScript agent runs tsc --noEmit after each edit, gets structured error output with file, line, and column, and can fix violations immediately. Rust's cargo check is even better because the borrow checker catches entire categories of bugs that would slip past a linter. The key property: zero false positives. If the compiler rejects it, it's wrong.

2. Linters and static analysis

ESLint, Clippy, Ruff, Semgrep — these catch style violations, common bug patterns, and security issues that compilers miss. They're slightly noisier than compilers (some rules are opinionated), but agents handle that fine. The trick is running them with the project's actual configuration, not defaults. An agent that fixes a "no-unused-vars" warning by deleting an import the next function needs is worse than useless.

3. Test suites

Running the test suite after changes is the most common sensor, and the most dangerous to get wrong. A passing suite doesn't mean the code works — it means the tests the agent wrote (or that already existed) don't catch the problem. This is where most teams stop, and where the more advanced sensors earn their keep.

4. Higher-order validators

Mutation testing, fuzz testing, and code health analysis go beyond "does it compile and pass." They ask whether the tests are actually guarding the right behavior. More on these below.

Implementation patterns that work

The Radar describes two implementation shapes: a reviewer agent that runs checks and triggers corrections, or a companion process that runs in parallel and agents can query. In practice, most teams land on one of these concrete patterns.

The in-loop check

The agent's instruction set (an AGENTS.md or equivalent) tells it to run a check command after each file edit and fix any errors before moving on. This is the simplest pattern and it works surprisingly well. A typical instruction might say: "After code changes, run npm run check. Fix all errors before committing." The agent gets structured output, applies fixes, re-runs, and iterates until clean.

Pre-commit hooks as guardrails

If your agent's commit gets rejected by a pre-commit hook that runs formatters and linters, the agent gets immediate feedback and can retry. This is a simple safety net for agents that skip in-loop checks, but it's a blunt instrument — the feedback arrives late, after the agent thinks it's done.

Companion watcher processes

A background process watches the filesystem and continuously runs the type checker, linter, and a targeted subset of tests. The agent queries it via a local API or reads its output file. This avoids blocking the agent while checks run and gives near-real-time feedback. Some teams expose this through an MCP server, letting the agent ask "what's broken right now?" as a tool call.

Mutation testing: the sensor that tests your tests

The Radar also elevated mutation testing to Trial in the same volume, and the connection to feedback sensors is direct. Tools like cargo-mutants (Rust), Stryker (JS/C#), and Pitest (Java) inject deliberate bugs into source code and check whether the tests catch them. A surviving mutant means a test gap.

This is particularly valuable for AI-generated code. An agent can write a function and tests that achieve 100% line coverage while missing entire categories of logical errors. The tests pass, coverage is green, and the code is still wrong. Mutation testing breaks that illusion. As ThoughtWorks puts it: "If a mutation goes undetected, it reveals a gap in validation rather than just a lack of coverage."

The feedback loop works like this: the agent writes code and tests, then cargo-mutants injects a bug (say, replacing Some(val) with None), and if the tests still pass, that surviving mutant gets fed back to the agent as a prompt: "your tests didn't catch this change." The agent strengthens the test, and the cycle repeats until the kill rate meets a threshold.

Fuzz testing and code health analysis

Two other tools fit naturally into the feedback sensor model.

WuppieFuzz is an open-source, coverage-guided REST API fuzzer developed by TNO. It parses OpenAPI specs and generates request sequences, using coverage feedback to prioritize mutations that explore deeper code paths. For teams building APIs, running WuppieFuzz against agent-generated endpoints can surface crashes, unhandled edge cases, and security issues that unit tests typically miss. The output is concrete — a specific request that caused a 500 — and agents can consume it directly.

CodeScene takes a different angle. Rather than testing behavior, it analyzes code health — complexity hotspots, coupling, code duplication, and change frequency patterns. It runs in CI/CD pipelines as a quality gate and can flag when an agent's changes degrade the health score of a file that's already a hotspot. CodeScene now offers an MCP server, which means agents can query code health data directly during a session. An agent about to refactor a file with a health score of 3/10 can adjust its approach based on what CodeScene reports about that file's history.

How this changes the reviewer's job

When feedback sensors work properly, the PR that lands in a reviewer's queue has already been compiled, linted, type-checked, and tested. The agent has iterated on failures until everything is clean. So what's left for the human?

The stuff that matters most. Architectural decisions: does this abstraction belong here? Is this the right data model? Will this pattern scale? These are judgment calls no linter can make. Reviewers stop spending time on formatting, unused imports, and obvious type errors — the feedback sensors handle that — and focus on whether the code solves the right problem in the right way.

ThoughtWorks' related blip on measuring collaboration quality reinforces this. They recommend tracking first-pass acceptance rate, iteration cycles per task, and post-merge rework as better metrics than raw throughput. Fewer failed builds and shorter feedback cycles indicate that sensors are doing their job.

Tenki's code reviewer fits into this model as an additional feedback layer. It provides context-aware reviews on pull requests at $0.50 per review, catching issues that deterministic tools miss while still running before a human reviewer picks up the PR. Used alongside compiler and linter feedback, it fills the gap between "does it compile" and "does it make sense."

Anti-patterns: when feedback loops go wrong

Feedback sensors aren't risk-free. A few patterns consistently cause problems.

Infinite correction spirals. An agent fixes a linter error by introducing a type error. Fixing the type error triggers a new linter violation. Without a retry cap, the agent burns tokens in an endless loop. Every feedback sensor integration needs a maximum iteration count and a bail-out path that flags the issue for human review.

Green tests masking design problems. An agent writes tests that achieve high coverage but test implementation details rather than behavior. Everything is green, mutation testing kills 90% of mutants, and the code is still poorly designed. Feedback sensors validate correctness, not design quality. That's still a human job.

Sensor overload. Running the full test suite, mutation testing, fuzz testing, and code health analysis after every file edit would bring any agent to a crawl. The right approach is tiered: fast checks (compiler, linter) run after each edit, the test suite runs after a logical unit of work, and slow validators (mutation testing, fuzzing) run before the PR is opened.

Suppressing warnings to pass. Agents optimized for "make the checks pass" will occasionally add // eslint-disable-next-line or @ts-ignore rather than fix the underlying issue. Your agent instructions should explicitly forbid this pattern. A good rule: "Never suppress warnings. Fix the root cause."

Wiring it together: a practical configuration

Here's what a feedback sensor setup looks like in an AGENTS.md for a TypeScript project:

# Feedback Sensors

## After every file edit
- Run `npx tsc --noEmit` and fix all type errors
- Run `npx eslint --no-warn-ignored {changed_files}`
- Fix violations at the source. Never add disable comments.

## After completing a logical change
- Run `npm test -- --run` for the full suite
- If tests fail, fix the code (not the tests) unless
  the test itself is wrong
- Max 3 fix-and-rerun cycles. If still failing,
  stop and report the issue.

## Before opening a PR
- Run `npx stryker run --mutate src/{changed_files}`
- Target: 80% mutation kill rate minimum
- Strengthen tests for surviving mutants

The tiered structure is intentional. Fast checks run constantly. Slower checks run at natural breakpoints. The retry cap ("Max 3 fix-and-rerun cycles") prevents infinite spirals. And the mutation testing threshold gives the agent a concrete target rather than an open-ended "improve the tests."

The feedback flywheel connection

The Radar's Assess ring includes a related blip called the feedback flywheel — a meta-technique where teams review agent session outcomes and use them to improve the agent's instructions and sensors over time. It's the same idea as retrospectives applied to agent workflows.

The sensors and the flywheel work together. If you notice an agent keeps hitting the same linter rule and wasting cycles fixing it, you add a more specific instruction to prevent the pattern. If mutation testing consistently finds the same class of surviving mutant, you add a custom structural test that catches it earlier. The sensor provides the signal; the flywheel turns that signal into a permanent improvement.

OpenAI's engineering team describes a similar concept as "garbage collection" for codebases — even with strong early feedback loops, entropy accumulates. Architecture drift reduction with LLMs, another Radar blip, extends feedback sensors into later stages of the delivery lifecycle by combining deterministic analysis tools with LLM-powered evaluation to detect structural violations.

Where to start

If your coding agents aren't already consuming compiler and linter output, start there. Add a single line to your agent instructions: "Run the check command after changes and fix all errors before committing." That alone will eliminate the most common class of agent-generated defects.

Once that's stable, add test suite feedback with a retry cap. Then consider mutation testing for your core domain logic — not everything, just the code where a subtle bug would actually hurt. Layer on code health analysis and fuzzing when you're ready for them.

The goal isn't to automate review away. It's to make every review count. When the agent handles the mechanical checks, the human reviewer can focus on the questions that actually require judgment — and that's a better use of everyone's time.

Feedback Sensors for Coding Agents: Wiring Quality Gates Into Self-Correction Loops

Table of Contents

What "Trial" actually means here