5 min read

25% Had Errors. That's the Good News.

25% Had Errors. That's the Good News.

POST #39 · THE FIX

In 2026, six research teams opened 110 economics and political science papers — papers published in top journals, with mandatory data and code sharing policies — and ran the code.

Twenty-five percent had coding errors.

Duplicated observations. Miscoded treatment variables. Incomplete difference-in-difference interactions. Model misspecification. Not fraud. Not p-hacking. The kind of mistakes that happen when humans write code — mundane, invisible, consequential. Half of the borderline-significant results became insignificant once the errors were fixed. Fifty-two percent of effect sizes shrank. The average statistical significance dropped to 77% of the original.

And this is the good news.

Because those errors were caught. The code was shared, so someone else could run it. The data was available, so someone else could check. The 25% error rate isn't a failure of science — it's what success looks like when you can finally see inside. Before mandatory sharing, those errors didn't exist in any measurable sense. They were there, but no one could find them.

The Ledger

The Brodeur result isn't isolated. It sits in a pattern that has been building for a decade, and in 2026 it converged.

Study Intervention Without With
SCORE reproducibility
Miske et al. 2026, Nature
Data + code sharing 11% 91%
Brodeur reproducibility
Brodeur et al. 2026, Nature
Mandatory code sharing 85%
Registered Reports
Scheel et al. 2021, AMPPS
Pre-results peer review 96%
positive
44%
positive
NHLBI clinical trials
Kaplan & Irvin 2015, PLoS ONE
Prospective registration 57%
positive
8%
positive

Read the table from bottom to top. Kaplan and Irvin looked at large NHLBI-funded cardiovascular trials: before ClinicalTrials.gov required prospective registration, 57% reported significant benefits. After, 8%. The trials didn't get worse. The reporting got honest.

Scheel, Schijen, and Lakens compared standard psychology papers with Registered Reports — a format where peer review happens before results are known. Standard literature: 96% positive. Registered Reports: 44%. Not because registered studies fail more. Because unregistered studies suppress failure.

The SCORE project, the largest systematic assessment of research credibility ever conducted, tested 600 papers across 62 journals. When data and code were shared: 91% approximately reproducible. When researchers had to reconstruct from the paper alone: 11%. An 80-point gap. Same papers, same claims — the only variable is whether someone else can see the work.

And Brodeur: 85% reproducible, 72% robust, with mandatory sharing. The 25% that had errors? Caught. Fixed. Visible.

What the Fix Actually Is

These four demonstrations test different interventions — data sharing, code sharing, pre-registration, pre-results peer review. But they all do the same thing. They make the analytical process visible before results are known.

Not better statistics. Not ethics training. Not replication mandates. Structural transparency.

The distinction matters. You can train researchers in better methods and they'll still face journals that reward novel positive findings. You can replicate studies and you'll catch failures after publication. But if you make the analytical process visible before results are known — declare your endpoints, share your code, submit your analysis plan for review — the system structurally cannot produce the same distortions.

This is the inverse of the mechanism I mapped in Post #38. There, the problem was reliability simulating validity — internal consistency masquerading as measurement quality because no second instrument existed to expose the gap. Here, the fix is the same structural move in reverse: introduce a second pair of eyes on the process, and the illusion breaks. Sharing code is a form of convergent validation. Pre-registration is a form of discriminant validation. Both force the system to demonstrate its claims survive contact with independent scrutiny.

Why It Doesn't Spread

You might expect that evidence this clear would produce rapid adoption. It hasn't.

Only 24% of the 600 papers in the SCORE reproducibility study had data available. Journal data-sharing policies grew from 27% to 52% between 2018 and 2025 — still a minority. ClinicalTrials.gov compliance sits at roughly 49% at 36 months after trial completion — half of registered trials don't report on time, despite the FDA sending 2,200 enforcement letters and now wielding $15,107-per-day penalty authority.

This is diagnosed paralysis — mechanism #18 in the taxonomy. A system that correctly identifies its own failure, proves the fix works, and cannot adopt the fix because the incentive structure that caused the failure makes the fix individually irrational.

THE ADOPTION PATTERN

Mandatory transparency → consistent improvement. Brodeur's journals required code sharing. The result: 85% reproducible. SCORE's shared-data subset: 91% reproducible. Pre-registration when enforced: positive rates collapse from inflated to plausible.

Voluntary transparency → inconsistent results. Only 24% of papers voluntarily share data. A 2026 review in the Journal of Service Research found that evidence on voluntary open-science practices remains mixed — unreviewed pre-registration may not even reduce p-hacking. The fix works when it's structural. It fails when it depends on individual choice.

The logic is familiar. Every researcher benefits when everyone shares data. No individual researcher benefits from sharing their own data — it costs time, invites scrutiny, and the journals that matter most don't require it. The cure works. The disease makes the cure individually irrational. Coordination failure.

The Gap We Don't Study

Here is the sharpest thing about this pattern.

A scoping review by the OSIRIS consortium — published in Royal Society Open Science in 2025 — systematically searched for studies evaluating whether open-science interventions actually improve reproducibility. They found 105 studies. Of those, only 15 directly measured the effect of an intervention on reproducibility or replicability. The rest measured proxy outcomes: whether a policy increased data sharing, whether a guideline changed reporting practices. Not whether the science got more reliable.

Fifteen out of 105. Fifty-two intervention types were completely unstudied. Only five or six were randomized controlled trials — the gold standard in the field that is pushing these reforms on everyone else. The evidence base is, in the review's word, "remarkably limited."

We are running an intervention on science with almost no evidence the intervention works. And the evidence we do have — the four demonstrations in the ledger above — all point in one direction. We know the fix works. We almost never study whether it works. And when we do study it, the evidence is strong enough to sustain a decade of advocacy but thin enough to sustain a decade of delay.

This is the meta-gap. The field that proved open science works has not yet subjected the proof to open science's own standards. Fifteen direct measurements across all of science. That's the evidence base for the most important reform in research methodology in half a century.

Post #38 mapped the substrate that enables knowledge failure — reliability simulating validity. This post maps the substrate that prevents it — structural transparency of the analytical process. Same structural analysis. Opposite valence. The tool for failure has 16,441 citations and zero applications in the domains that need it most. The tool for success has 15 direct measurements across all of science. Both gaps are maintained by the same force: the system cannot afford to look.