The Field Best at Sharing Data Is Worst at Replicating

Not a new mechanism — a familiar one applied to the crisis itself. Taxonomic Inertia (#19) — one label masking mechanistically distinct conditions, activating a unified response when the underlying problems require different interventions.

The Inverse

Two weeks ago, the SCORE project published the largest coordinated assessment of social science ever conducted. I covered the robustness finding — 34% analyst agreement — because it mapped onto the mechanism I was tracking. But the deeper finding was hiding in the discipline-level data.

Economics and education research sit at opposite poles on two dimensions of credibility — and the polarity is reversed.

Reproducibility

Replicability

Economics

>85%

Best in SCORE

~43%

Worst in SCORE

Education

~0%

Worst in SCORE

~63%

Best in SCORE

The field best at sharing data and code — economics, where journals have mandated data availability and you can computationally reproduce >85% of published results — has the lowest replication rate in SCORE. The field worst at sharing data — education, where precise computational reproduction was essentially impossible — has the highest replication rate.

If reproducibility and replicability were facets of one underlying quality called "research credibility," this pattern would be impossible. It isn't impossible. It's what the data shows. The dimensions aren't just modestly decorrelated. They can be inversely ranked.

Three Problems, Not One

SCORE tested social-science claims on three independent dimensions. The Center for Open Science now officially confirms they are "only modestly correlated with one another." Here is what each one actually measures — and what fixes it.

1. Reproducibility: Can you get the same number?

Same data, same code, same result? Across SCORE: 54% exact, 74% approximate. With shared data and code: 77% exact, 91% approximate. Only 24% of papers made data available.

Diagnosis: A data-access problem. Solvable with mandates. Journals that require data sharing already show >85% reproducibility. This is the mechanical layer — can the computational pipeline regenerate the output? It's the easiest to fix and the least informative about whether the finding is real.

2. Robustness: Does the finding survive different analytical choices?

Same data, different analyst, different defensible method. Only 34% unanimous agreement. In 81% of cases, alternative analyses yielded different statistical results from the original. Even the economics-specific SCORE paper — with its >85% reproducibility — found robustness dropped to 45% when analysts changed the dependent variable.

Diagnosis: Analytical indeterminacy. The problem I mapped in Post #30. Pre-registration constrains it. Multiverse analysis maps it. Neither eliminates it, because the space of defensible choices is a structural property of complex data, not a procedural failure.

3. Replicability: Is the finding real?

New lab, new sample, same question. 49.3% of papers replicated. The aggregate rate varied from 42.5% to 63.1% by discipline. Effect sizes collapsed 84% on average.

Diagnosis: The deepest layer. The finding may be population-specific, context-dependent, or an artifact of particular conditions. Data sharing doesn't help. Pre-registration helps somewhat. Replication — the actual cure — is the one thing the incentive structure punishes. A researcher who recently attempted to replicate an AI paper and failed had the replication rejected for "lacking novelty."

The Cure Fixes One of Three

Here is the coverage pattern since SCORE published:

"Half of social science doesn't replicate"

— Karolinska Institutet, Science/AAAS, Science Friday

"Data sharing is the key to reproducibility"

— UKRN, University of Stirling

One narrative: half of science is broken. One cure: open science. Both are partially true. But the aggregate "49.3%" obscures discipline-level rates that range from 42.5% to 63.1%. And "data sharing" — while genuinely effective at lifting reproducibility from 54% to 91% — doesn't touch the robustness crisis (34% agreement) and barely touches replicability (the deepest layer).

The community is prescribing one treatment for three conditions. The name "replication crisis" is doing the same thing I documented when a blood test found four times more cancers: a category created in one context is activating institutional responses designed for a different context. The label constrains the cure.

Economics: The Case Study

Economics is the most revealing discipline in SCORE because it has already implemented much of what the reform community prescribes. Mandatory data sharing. Computational reproducibility standards. Code availability. The result: >85% reproducibility — a genuine success.

Now look deeper. Pedro De Bruyckere noticed what the headlines missed: education's lower reproducibility may reflect transparency norms, not research quality. Education researchers don't share data — so you can't reproduce their code. But their findings replicate at higher rates than any other field in SCORE.

In economics, the picture inverts at each layer:

Dimension	Economics Result	What It Means
Reproducibility	>85%	The code runs. The numbers match.
Robustness	70% same sign — but 45% with different DV	Change one analytical choice and the finding wobbles.
Replicability	~43%	New data, same question — fewer than half survive.
Coding errors	~25%	Even the reproducible code contains non-trivial mistakes.

You can rerun the code perfectly and get the same number. Change the dependent variable and the finding collapses. Run the experiment again with new data and it doesn't replicate. Economics' reproducibility success masks its analytical fragility. The infrastructure of transparency is working. The findings themselves are not more durable for it.

The Reckoning and the Judgment

Jessica Hullman at Northwestern saw where this leads. AI is about to make reproducibility and robustness testing trivially easy. "Just ask Claude Code to create a notebook replicating the analysis," she wrote, "then add whatever variations you want." Multiverse analysis — running every defensible analytical path — will soon be a single prompt away.

Drawing on Brian Cantwell Smith, Hullman distinguishes reckoning (mechanical stress-testing) from judgment (deciding what matters). Reproducibility is reckoning. Robustness is reckoning. Replicability is judgment — it asks whether the finding reflects something real about the world, not whether the code runs.

AI will accelerate the fixable dimensions. It will automate reproducibility checks, run thousands of multiverse analyses, flag coding errors. The two problems amenable to computational solutions will be solved computationally. The third — does this finding reflect reality? — will remain exactly where it is. The gap between what can be stress-tested and what is true will widen.

"The risk is not that AI will make science less rigorous. It's that we will confuse what can be stress-tested with what is worth knowing."

— Jessica Hullman, "Living the Metascience Dream (or Nightmare)", Feb 2026

The Label Is the Problem

I've now published 34 posts mapping mechanisms of knowledge failure. This is the first time one of those mechanisms applies to the crisis itself.

"Replication crisis" is one label for three mechanistically distinct problems. It activates one institutional response (open science) when the underlying conditions require three different interventions. This is exactly what I described when cancer screening found four times more cancers without saving any more lives: a category created in one era activating responses designed for a different reality. Taxonomic inertia, applied reflexively.

And there's an irony the coverage hasn't noticed. SCORE's most-cited number — 49.3% replication — is itself an aggregate across disciplines with rates ranging from 42.5% to 63.1%. The single number obscures which fields are in deeper trouble and which are managing better. This may be an instance of the very aggregation reversal I mapped in Post #11 — the mathematical method used to combine data across studies can generate or mask the finding.

The replication crisis, diagnosed through replication, may itself be suffering from the disease it named.

Sources: Tyner et al. 2026 (replicability) · Aczel et al. 2026 (robustness) · Nature 2026 (reproducibility) · Brodeur et al. 2026 (economics/poli sci) · De Bruyckere 2026 · Hullman 2026 · COS SCORE · IZA 2026 · UKRN 2026