5 min read

Five Analysts Looked at the Same Data. They Agreed One-Third of the Time.

Five Analysts Looked at the Same Data. They Agreed One-Third of the Time.

Mechanism #27 in a taxonomy of knowledge failure: Analytical Indeterminacy — the space of defensible analytical choices is so vast that the analyst determines the finding more than the data does.

The Question Nobody Asked

What if the person analyzing the data matters more than the data itself?

Not because they're biased. Not because they're incompetent. Because the number of defensible ways to analyze a complex dataset is so large that choosing among them — which covariates, which exclusions, which model, which robustness check — shapes the result more than the underlying phenomenon does.

This week, the largest coordinated test of social science ever conducted proved exactly that. The SCORE project — seven years, 865 researchers, 3,900 papers sampled from 62 journals — published its results in Nature. The headlines focused on the replication rate. The real finding is deeper.

Three Dimensions of Credibility, Three Different Answers

SCORE tested social-science claims on three axes. Each tells a different story.

Replicability: Can independent labs, running new experiments, reproduce the same finding? Result: 49.3% of papers replicated. Roughly half. Consistent with prior large-scale replications.

Reproducibility: Can you get the same numbers from the same data and code? Only 24% of papers made their data available. Of those testable: 54% precisely reproduced, 74% approximately.

Robustness: What happens when multiple analysts address the same question with the same data, using their own judgment? Aczel and 490 co-authors tested 100 papers. Five or more independent analysts per paper, same data, same research question, free to choose their own analytical approach.

Result: 34% unanimous agreement. In 81% of cases, analysts obtained different statistical results from the original. Even with a tolerance band four times wider than standard, only 57% converged.

The Headline
49.3%
of papers replicated
Original effect size: r = 0.25
Explained variance: 6.25%
What “Replicated” Means
r = 0.10
replication effect size
Explained variance: 1.0%
Collapse: 84% reduction

Half of studies “replicate” — but the surviving effects explain 1% of variance, down from 6.25%. A finding that accounts for one percent of what’s happening is statistically significant and practically empty. The word “replicated” is doing enormous work to obscure that collapse.

The Third Proof at Scale

This isn’t a surprise. It’s the third large-scale demonstration of the same structural finding, each at greater scale than the last.

Study Scale Finding
Silberzahn et al. 2018 29 teams, 61 analysts Same soccer dataset → 69% found significant effect, 31% did not
Breznau et al. 2022 73 teams, 161 researchers “Hidden universe of uncertainty” — >95% of variance unexplained by expertise or beliefs
Aczel et al. 2026 490 co-authors, 100 papers 34% unanimous agreement on same data and question
Software eng. multiverse
(empirical SE, 2025)
3,072 analytical pipelines <0.2% reproduced published results

Each study at larger scale. Each confirming the same structural finding. The press treats SCORE as a standalone discovery. It’s not. It’s the third proof — and the first at a scale large enough to be undeniable.

What the Commentary Recommends

The response has been measured and constructive. Brian Nosek, architect of the replication movement: “There is a hidden uncertainty in papers that is not recognized in the way it ought to be.” The fix, per Nosek and others: share data, pre-register hypotheses, run multiverse analyses showing results across all defensible analytical paths.

And there’s reason for optimism. Brodeur and colleagues tested 110 papers from journals that mandate data and code sharing. Result: 85% computationally reproducible, 72% of significant estimates survived robustness checks. Mandatory sharing works — mechanically.

But look closer at even Brodeur’s optimistic sample: 25% contained non-trivial coding errors. 28% failed robustness checks. The mechanical fix — can I get the same number from the same code? — works. The structural question — does this number mean what we think? — barely improves.

The Fix Proves the Diagnosis

Here is the point nobody in the commentary is making.

Multiverse analysis — the leading proposed solution — works by running all defensible analytical paths on a dataset and reporting the full distribution of results. This is genuinely useful. It replaces false certainty with honest uncertainty. But it is also an admission.

If the problem were procedural — sloppy coding, poor documentation, unclear methods — you could fix it with better procedures. Pre-registration, code sharing, and standardized protocols would converge the results. Brodeur’s data shows they partly do.

But multiverse analysis exists because even with perfect transparency, the results still diverge. The space of defensible analytical choices for any complex dataset is not a narrow corridor with one correct path. It’s a garden with hundreds of forking paths, to borrow Gelman and Loken’s phrase, and many of them lead to different conclusions.

The fact that you need to run all paths reveals that no single path is correct. The fix doesn’t solve the problem. It maps its contours.

Reproducibility ≠ robustness ≠ truth. You can reproduce a number perfectly (same data, same code, same output) and have it mean nothing, because a different defensible analysis of the same data yields a different answer. SCORE proved all three things at once: moderate replication, moderate reproduction, and widespread analytical divergence.

The Weaponization

There’s a darker thread. The replication crisis has already become a political instrument. The Trump administration has cited scientific uncertainty to justify sweeping funding cuts — NSF by 57%, NIH by 40%, CDC by 53%. DOGE has decimated agency staffing. The message: if science can’t even replicate itself, why fund it?

This is mechanism #16 — epistemic sabotage — in real time. The genuine finding (analytical indeterminacy is real) becomes the weapon (therefore science is unreliable, therefore cut funding). The uncertainty is real. The instrumentalization is cynical. Both things are true simultaneously.

Meanwhile, on the ground level: an AI researcher recently attempted to replicate an AI paper, failed to match the results, and had the replication paper rejected for “lacking novelty.” The system that discovered its own failure still punishes those who document it. That’s mechanism #18 — diagnosed paralysis.

The Garden Has Too Many Paths

“The analyst determines the finding more than the data does” is not a temporary problem that better procedures will eliminate. It is a structural property of the relationship between complex data and the questions we ask of it. The garden has too many paths. Running all of them is genuine progress — multiverse analysis makes the hidden uncertainty visible. But pretending that one path was ever the correct one is the error SCORE has now proven, at scale, for the third time.

The fix works by admitting what it cannot fix. That admission is the finding.


Sources: Tyner et al. 2026, Nature (replicability) · Aczel et al. 2026, Nature (robustness) · Brodeur et al. 2026, Nature (reproducibility) · Nosek 2026, Nature (commentary) · Nature News · Science/AAAS · Silberzahn et al. 2018 · Breznau et al. 2022, PNAS