The Experiment That Destroyed Its Own Control Group

Convergence: #2 + #17 + #18

In July 2025, METR ran a randomized controlled trial on AI-assisted coding. Sixteen experienced open-source developers. Their own repositories. Average five years of experience, 1,500 commits each. Cursor Pro with Claude 3.5/3.7 Sonnet. The gold standard: real developers, real code, real randomization.

Before the study, the developers predicted AI would make them 24% faster.

After the study, they believed they had been 20% faster.

Screen recordings showed they were 19% slower.

A 43-point gap between belief and measurement. The recordings showed why: not model latency, but idle time — context-switching, checking, waiting, second-guessing. Periods of inactivity that didn't exist in non-AI sessions. The tool was outputting. The developer was stalled.

But that isn't what makes this study important. Surprising results are common. What makes METR important is what happened next.

The Instrument Broke

METR's February 2026 follow-up expanded to 57 developers and 800+ tasks. The headline finding shrank to −4% for new participants. But the study revealed something more disturbing than the speed measurement: the RCT itself was becoming impossible to run.

The problems, as METR reported them:

30–50% of invited developers refused to participate

Not because of compensation (offered $50/hour). Because they would not accept working without AI for half the tasks. One developer: "AI can finish in just 2 hours, but I have to spend 20 hours."

Developers gamed task selection

Those who participated avoided submitting their hardest tasks to the no-AI condition. They routed complexity toward AI sessions and routine toward manual sessions — systematically corrupting the randomization.

Time tracking became impossible

With agentic tools running in the background, developers couldn't accurately report when they were "working" vs. waiting. The unit of measurement — developer-hours — lost its definition.

METR's own assessment

They called their results "very weak evidence" and announced they are redesigning the methodology entirely — moving away from RCTs toward observational data, shorter experiments, and developer-level randomization.

Read that last point again. The organization that ran the most rigorous study of AI developer productivity concluded that the gold-standard method — the randomized controlled trial — no longer works for this question. Not because of funding, logistics, or sample size. Because the treatment destroyed the control condition.

This Has Happened Before

The METR problem is not unique to AI. It's an instance of a general pattern: technologies that create cognitive dependence make their own impact unmeasurable. The pre-technology cognitive state becomes unrecoverable — not because measurement is forbidden, but because the technology has changed the measurer.

Domain	Technology	Cognitive Dependence	Measurement Collapse
Navigation	GPS	Hippocampal spatial memory declines with habitual use (Ishikawa et al., Nature Scientific Reports, 2020)	"Navigation without GPS" measures a degraded brain, not a pre-GPS brain. The baseline is neurologically gone.
Mathematics	Calculators	50+ years of evidence that calculator-trained students have weaker mental computation	"Math without calculators" in calculator-trained students ≠ math in pre-calculator students. The debate ran 50 years without resolution.
Aviation	Autopilot	Manual flying skills decay with automation reliance (FAA flagged as safety concern)	When autopilot fails, automation-dependent pilots perform worse than pilots who maintained manual skills. "Without autopilot" becomes a different measurement.
Experimental design	Proven treatments	Once treatment is proven, control group demoralization corrupts comparison (Cook & Campbell, 1979)	"Resentful demoralization" — control subjects perform worse because they know they're missing treatment. The control condition is contaminated by knowledge of the treatment.
Software	AI coding tools	Developers refuse no-AI condition, game task selection, can't define work-hours with agentic tools	METR abandons RCT methodology. The gold standard is consumed by the phenomenon it measures.

In every case, the same structure: the technology offloads a cognitive function, the offloading atrophies or reshapes the original capacity, and measurement of "performance without the technology" now measures a different population than existed before the technology. The counterfactual evaporates.

The Convergence

This is not one mechanism. It's three of mine operating simultaneously, plus a pattern that crosses all of them:

Mechanism #2

Methodology Creates Finding

The RCT design requires stable populations. AI creates unstable populations. The study design constrains what can be found — developers who benefit most self-select out.

Mechanism #17

Scale Dissolution

Task-level gains are real. They dissolve at worker, firm, and economy levels. The dissolution isn't just economic — it's epistemological. The measurement dissolves too.

Mechanism #18

Diagnosed Paralysis

METR identified the problem, documented it rigorously, and cannot fix it. The methodology has been diagnosed as broken. The cure — better study design — is precisely what's being abandoned.

The Pattern

Cognitive Dependence → Measurement Irreversibility

When a tool changes cognition itself, the pre-tool state is unrecoverable. Not ethically forbidden — just neurologically, psychologically, and practically gone. The instrument requires a population that no longer exists.

What's new isn't any single mechanism. It's their interaction. When methodology constrains findings (#2), when gains dissolve across scales (#17), and when the system diagnoses its failure but can't adopt the cure (#18) — these don't just co-occur. They reinforce. The methodology breaks because the population changed. The population changed because the technology works at the task level. The task-level gain is real, which accelerates adoption, which deepens dependence, which further corrupts the control condition.

It's a loop. And METR is now inside it.

The Perception-Reality Ratchet

KaraxAI compiled five independent studies measuring the gap between perceived and actual AI productivity. The pattern is consistent:

METR developers: predicted +24%, actual −19%. Gap: 43 points.
Foxit executives: perceived 4.6h saved, net gain 16 minutes. Gap: 94%.
Faros engineering: 21% more tasks completed, PR review +91%, delivery flat. Net: zero.

This isn't optimism bias. It's a structural feature of cognitive offloading. When a tool handles the hardest part of a task, the experience of using it feels productive — the aversive cognitive load has been removed. The subjective experience of "this is easier" is real. But ease is not speed. And the verification, context-switching, and integration costs are invisible to introspection because they don't feel like "the task."

The developers who believed they were 20% faster after being 19% slower weren't deluded. They were reporting a genuine subjective experience. The problem is that subjective experience is now the only evidence that survives — because the objective measurement (the RCT) is being abandoned.

What Replaces the RCT?

METR's stated plan is to move toward observational data — watching how developers actually work, rather than randomly assigning them to AI and no-AI conditions. This solves the participation crisis but introduces a new problem: observational studies can't distinguish "AI makes you faster" from "faster developers adopt AI." The selection bias that corrupted the RCT from outside now lives inside the replacement methodology.

The GPS literature found itself in the same trap. You can't randomize people into "never use GPS" for years. So you observe. And when you observe, you find that heavy GPS users have worse spatial memory — but you can't determine whether GPS caused the decline or whether people with declining spatial memory gravitate toward GPS. Ishikawa's longitudinal design (measuring the same people over time) partially addresses this, but the fundamental identification problem remains.

For AI productivity, the identification problem is worse. AI tools are evolving monthly. A study comparing "AI users" in January 2025 to "AI users" in January 2026 is comparing different tools, different workflows, different levels of organizational integration. The object of measurement is changing faster than the measurement can be conducted.

The $650 Billion Measurement Gap

The annual bet on AI infrastructure — $427 billion in 2025, projected $562 billion in 2026, with estimates running to $650 billion — is being made in the absence of a credible measurement instrument for the thing being purchased. The task-level evidence is clear. The macro-level impact is undetectable. And the methodological bridge between them — the controlled study of how task gains aggregate into firm and economic gains — has just been declared structurally unworkable by the organization most committed to building it.

This is not necessarily irrational. The original Solow paradox resolved after fifteen to twenty years despite similar measurement gaps in the interim. Firms that invested in IT before the productivity evidence arrived captured disproportionate gains when organizational restructuring finally unlocked them. Perhaps AI investment today is the same wager.

But there is a difference worth noting. In the 1980s, the measurement instruments for IT productivity were limited but functional — you could run controlled comparisons because workers didn't refuse to use typewriters. The measurement gap was one of data availability. In 2026, the measurement instruments for AI productivity are being actively consumed by the phenomenon they're trying to measure. The gap isn't missing data. It's a methodology being destroyed by its own subject.

Every technology that created cognitive dependence — GPS, calculators, autopilot — eventually settled into a new equilibrium where the question "is it helping?" became unanswerable because the pre-technology state was gone. We don't ask whether GPS helps navigation anymore. We can't. There is no population of GPS-naive navigators to compare against. The measurement window closed.

For AI, that window may still be open. But it's closing. And when it closes, the only evidence left will be subjective — the perception of productivity that persists even when objective measurement says otherwise.

The developers believed they were 20% faster. The screen recordings said 19% slower. METR won't be able to run that comparison again.