Articles
Articles
May 4, 2026

AI Made the Code, but Who’s Actually Reviewing It?

AI made code generation fast. Code review didn't get the memo.

TL;DR

Generation went from "write the code" to "describe what you want." Review didn't follow. The result is the LGTM reflex — reviewers approving diffs they didn't really read, because the volume made deep review impractical. The bottleneck moved, and most tooling is still pointing at the wrong end of the pipeline.

We built Contextur because we kept hitting this on real projects. Here's the framing and what the tool actually does about it.

The development process changed under our feet

Three years ago, the slowest part of shipping a feature was someone typing it. That stopped being true around the time Cursor and Claude Code became table stakes. Recent industry studies put generation throughput somewhere between 2.3x and 5x human baseline in controlled environments, and on a good day it feels higher.

What didn't change at the same rate:

  • The time it takes a senior engineer to reason about a diff.
  • The cost of fixing a defect that slipped through review and reached production.
  • The cognitive load of holding a system's invariants in your head while reading 800 lines of generated code.

The metric worth watching isn't lines of code per day. It's lines of code per reviewer-hour, and that ratio has gone the wrong way for almost everyone using these tools seriously.

Review is now the bottleneck — and most teams haven't noticed

The failure mode has a name in the recent literature: the LGTM reflex. When PR volume outruns reviewer attention, reviewers stop reading diffs. They scan filenames, check that CI is green, and approve. DORA's stability numbers in AI-accelerated codebases bear this out — change failure rates trend up, and most of the growth comes from defects that an attentive review three years ago would have caught.

The part people get wrong: the answer isn't more reviewers, and it isn't smarter reviewers. It's structurally cheaper review.

The naive AI response is to throw an LLM at the diff and ask "review this PR." That tends to make things worse, not better. You get:

  • Generic findings that mistake "different from this codebase" for "wrong."
  • False-positive rates high enough to train reviewers to ignore the bot.
  • No memory of project-specific decisions, so the same architectural advice keeps re-appearing.
  • One opaque output blob that is itself fatiguing to read.

Now the reviewer is reading the diff and the LLM's review of the diff. Reviewer attention has gotten more expensive, not less.

What Contextur does differently

Contextur is a CLI you drop into a repo. It runs locally, doesn't call any LLM API directly, and produces a structured review prompt that your existing agentic tool (Cursor, Claude Code, Codex) executes.

The piece that matters is the architecture. Contextur runs a three-stage pipeline for every review:

Stage 1 — Specialists. Up to ten focused reviewers run independently against the diff: correctness, security, architecture, testing, operability, performance, api-contract, data-migration, ci-release, maintainability. Each has a tight, explicit scope and a strict output schema. Critically, each one is required to quote the offending code at path:line. No quote, no finding.

Stage 2 — The Challenger. Every finding marked critical or high is sent through an adversarial pass that issues one of three verdicts: CONFIRMED, DOWNGRADED, or REJECTED. The Challenger's job is to kill false positives before they reach the human. In my experience this is the single biggest unlock for reviewer attention.

Stage 3 — The Synthesizer. A final pass deduplicates findings across specialists, applies the Challenger's verdicts, and produces one developer-facing report sorted by severity, capped at twenty entries.

The reviewer never sees the raw output of the ten specialists. They see one synthesized report with hallucinations pruned and over-classified findings already downgraded.

How this actually helps reviewers go faster

Speed in code review isn't really measured in seconds per diff. It's measured in whether the reviewer keeps enough attention budget to think hard about the things that matter.

Contextur helps three concrete ways:

  1. Severity-ordered, deduplicated output. The reviewer reads one ranked list, not ten parallel reports they have to mentally merge.
  2. False-positive suppression by design. The Challenger stage means most of the noise that would erode reviewer trust never reaches the report. The bot stays credible, which is what makes it useful at all.
  3. Repo-resident standards. The reviewer prompts live in .contextur/, version-controlled with the code. When architecture decisions change, the reviewers update with them. Reviews stay aligned with how the project actually works, not how it worked six months ago.

The point isn't that the reviewer reads faster. It's that the reviewer is still willing to read carefully by the third PR of the day.

What it isn't

Contextur is an early MVP, not yet on npm. It doesn't replace human review — it makes the human's review cheaper.

It doesn't fix toxic debt that's already in your codebase; it just slows the rate at which new toxic debt is added. And the quality of the output still depends on the agent doing the reasoning underneath.

But the framing matters more than the tool. Most AI tooling for development so far has been pointed at the generation half of the loop, where the productivity gains are loud and the failure modes are quiet. The next round of useful tools will be pointed at the validation half, where the bottleneck has been hiding in plain sight.

That's the bet behind Contextur, and so far it's holding up.

Written by Ignacio Vallarino