One Drug, Four Truths: How the Method You Use to Count Studies Can Reverse What Medicine Knows

Nissen Meta-Analysis, 2007

OR 1.43

95% CI 1.03–1.98 · p = 0.03

SIGNIFICANTLY INCREASES
HEART ATTACK RISK

Same Data, Naive Pooling

OR 0.94

95% CI 0.69–1.29 · p = 0.71

NOT SIGNIFICANT
(SLIGHTLY PROTECTIVE)

Same 42 clinical trials. Same patients. Same drug. Same outcomes. Two mathematically valid ways to count them. One says the drug is killing people. The other says it might be protecting them.

The drug is rosiglitazone — sold as Avandia by GlaxoSmithKline. At its peak, it was a $3.3 billion-per-year diabetes medication taken by millions. What happened to it is a case study in how the gold standard of medical evidence — the meta-analysis — can produce opposite conclusions from identical data, depending on choices that most readers never see.

How One Meta-Analysis Killed a Drug

In May 2007, Nissen and Wolski published in the New England Journal of Medicine a meta-analysis pooling 42 GlaxoSmithKline trials. Their method: calculate an odds ratio for each trial separately, then combine using a standard fixed-effects model. Result: OR 1.43 for myocardial infarction. Statistically significant. The drug increases heart attacks by 43%.

The effect was immediate. The FDA slapped a black box warning on Avandia. Prescriptions cratered from 39% of thiazolidinedione market share to 8%. Europe banned it entirely. Thirteen thousand lawsuits followed. Eleven thousand five hundred settlements. A $3.3 billion drug was, in practical terms, dead.

One year later, Rücker and Schumacher published in BMC Medical Research Methodology what should have been a bombshell of its own. They took the exact same data from the Nissen meta-analysis and simply pooled it differently — combining all treatment-arm patients into one group and all control-arm patients into another, then computing a single odds ratio. This is called naive pooling, and while it has known limitations (it ignores the trial structure), it is not mathematically wrong. The result: OR 0.94. Not only not dangerous — slightly protective. Not only not significant — p = 0.71.

The reversal is an instance of Simpson’s paradox: when a trend that appears in several groups reverses when the groups are combined. In rosiglitazone’s case, imbalances in treatment arm sizes across trials acted as a confounder. The standard meta-analytic method gave each trial equal structural weight. The naive method gave each patient equal weight. Two legitimate frameworks, two opposite conclusions.

Then It Got Worse

In 2020, the story added a third truth. Veroniki et al. published in the BMJ an individual patient-level data (IPD) meta-analysis — using raw patient records from GSK for 33 trials and 21,156 patients, rather than just the published summary statistics Nissen had worked from. This is supposed to be the best possible meta-analysis. What they found:

Endpoint	Odds Ratio	95% CI	Verdict
Composite cardiovascular	1.33	1.09–1.61	Significant — increased risk
Myocardial infarction	1.25	0.99–1.60	Borderline — CI crosses 1.0
Heart failure	1.60	1.20–2.14	Significant — real signal
Cardiovascular death	1.18	0.64–2.17	Not significant

The IPD analysis — using the actual patient records — found more heart attacks and fewer cardiovascular deaths than Nissen’s summary-level data had reported. The MI signal was borderline, not clearly significant. Heart failure was the real signal all along. Cardiovascular death was a non-finding.

And then came the fourth truth. In 2013, the FDA lifted all restrictions on rosiglitazone, concluding that "recent data do not show an increased risk of heart attack." By 2015, the REMS requirements were eliminated entirely. The drug was, officially, exonerated.

It never recovered. Prescriptions remained below 1% of market share. Europe never reinstated it. The meta-analysis had done its work.

Four Truths, One Drug

Summarizing the landscape of what “the evidence” said about rosiglitazone and heart attacks, depending on when you looked and how you counted:

Truth #1 — Nissen 2007

Summary-level meta-analysis, fixed effects. OR 1.43. Drug significantly increases MI risk. The finding that killed the drug.

Truth #2 — Rücker 2008

Same data, naive pooling. OR 0.94. Drug is slightly protective. Simpson’s paradox reversal.

Truth #3 — Veroniki 2020

IPD meta-analysis, 21K patients. MI borderline (OR 1.25, CI crosses 1.0). Heart failure significant. CV death: nothing.

Truth #4 — FDA 2013/2015

Re-adjudicated RECORD trial data. No increased CV risk. All restrictions lifted. Drug exonerated. Never recovered.

Four different “truths” about the same drug. Not because the science was fraudulent (though GSK did plead guilty and pay a $3 billion fine for withholding cardiovascular safety data between 2001 and 2007 — meaning the data feeding into the meta-analyses was incomplete). Not because anyone made an arithmetic error. But because the method of aggregation, the level of data access, the choice of endpoint, and the statistical framework each generated a different answer.

The Fragility Underneath

There’s another layer. Nissen’s OR 1.43 — the number that triggered black box warnings, a European ban, and 13,000 lawsuits — was fragile. As a 2025 review in PMC documented: removing just two large studies from the 42-trial pool made the finding non-significant. The entire cascade — the warnings, the ban, the billions in settlements, the drug’s effective death — depended on the inclusion of two studies.

This is not unusual. Nissen’s own method required a choice about zero-event studies — trials where nobody in either arm had a heart attack. He excluded them. Other reasonable methodological approaches include them with continuity corrections, which reduces the effect. The authors of that sensitivity analysis wrote: “Alternative reasonable methodological approaches could yield increased or decreased risks that were either statistically significant or not.”

“Alternative reasonable methodological approaches could yield increased or decreased risks that were either statistically significant or not.”

— Diamond et al., BMC Research Notes, 2009

Read that again. Reasonable choices. Valid methods. And they produce both “significant risk” and “no risk” from the same data. The gold standard of evidence turns out to have researcher degrees of freedom baked into its most trusted operation.

This Isn’t Just Rosiglitazone

If this were one drug, one time, you could call it an edge case. It isn’t.

Berenfeld et al. (2025) applied both classical and causal meta-analytic frameworks to 597 published meta-analyses. For most, the methods agreed. But for analyses using non-linear effect measures — odds ratios and risk ratios rather than risk differences — they found cases where classical meta-analysis declared a significant treatment effect and causal meta-analysis found no definitive conclusion. Their case study: drug-eluting vs. bare-metal stents in acute coronary syndrome. Classical approach: significant. Causal approach: inconclusive. The choice of meta-analytic framework determined whether you’d recommend one stent over another.

Bakbergenuly, Hoaglin, and Kulinskaya (2019) showed that the risk ratio — one of the most commonly used effect measures — has mathematical properties that make it diverge from odds ratios when event rates aren’t small. The conventional inverse-variance-weighted approach introduces biases. And then there’s the event definition problem: a screening meta-analysis by Plumb et al. found that measuring the risk of attendance (RR 1.26, p=0.07, not significant) versus the risk of non-attendance (RR 0.92, p=0.01, significant) reversed the statistical conclusion. Same data. Same correct analysis. Different choice of what counts as the “event.”

In a systematic sample, 9 of 157 meta-analyses showed effect reversion — the conclusion flipped depending on the method. That’s roughly 1 in 17. Not a rounding error. Not a curiosity. A structural feature of the method.

And in 2023, a paper in PMC went further, mathematically proving that the mainstream random-effects methods used in most meta-analyses are fundamentally flawed, capable of producing “potentially harmful public health policy recommendations.” The authors argued that future meta-analyses “should never employ mainstream methods.”

The Hierarchy with a Crack at Its Apex

OR 1.43 OR 0.94

Meta-analyses & systematic reviews

The tool at the top of the hierarchy has degrees of freedom that can reverse the conclusion.

Evidence-based medicine places meta-analysis at the apex of the evidence hierarchy. This is the tool that is supposed to resolve conflicting individual studies. When randomized controlled trials disagree, we pool them. When the pool speaks, that’s the closest thing medicine has to a final answer.

But “the pool” is not one thing. It’s the output of a chain of choices: Which studies to include. How to handle zero-event trials. Which effect measure to use (odds ratio, risk ratio, risk difference). Which meta-analytic model (fixed effects, random effects, causal). Which endpoint to designate as the “event.” Whether to use summary data or individual patient data. Which random-effects estimator (DerSimonian-Laird, REML, Paule-Mandel, Knapp-Hartung).

Each choice is defensible. Each can change the answer. When the answer at the top of the hierarchy depends on which defensible choices a researcher made, the hierarchy has a structural flaw at its apex.

What This Does Not Mean

It does not mean meta-analysis is useless. It is still the best tool available for synthesizing evidence. The alternative — reading individual studies and guessing — is worse.

It means that when someone says “the meta-analysis shows,” the right follow-up questions are: Which meta-analysis? Using which aggregation method? Which effect measure? Which endpoint? Including or excluding zero-event studies? Summary-level or individual patient data? How fragile is the result — how many studies can you remove before the finding disappears?

It means that meta-analysis is not the neutral arbiter it is presented as. It is a tool with parameters, and the parameters have consequences. The gold standard has cracks. Knowing they’re there is the beginning of reading evidence honestly.

Aggregation Reversal — Mechanism #11

The mathematical method used to combine studies generates or reverses the finding. Not the data. Not the individual trials. The counting method. When the tool at the top of the evidence hierarchy has enough researcher degrees of freedom to flip its own conclusions, the hierarchy has a fault line where it is supposed to be strongest.

Distinct from mechanism #2 (methodology creates finding at the individual study level) and mechanism #3 (what gets published). This operates at the synthesis step — the layer of evidence that exists precisely to overcome individual study limitations.

Sources: Nissen & Wolski 2007, NEJM · Rücker & Schumacher 2008, BMC Med Res Methodol · Diamond et al. 2009, BMC Res Notes · Veroniki et al. 2020, BMJ · Bakbergenuly et al. 2019, Res Synth Methods · Berenfeld et al. 2025, arXiv · PMC 2023 (random-effects critique) · PMC 2025 (pooled perspectives) · DOJ — GSK $3B settlement