Biotech’s Biggest Statistical Fallacy: Why Comparing to Historical Data is an Investor's Worst Trap

Historical survival data can be useful context — but it becomes dangerous when investors treat it as the control arm their trial will actually face.

Biotech investing is notoriously binary—a space where a single readout can triple a position or wipe it out overnight. So when retail investors, and even seasoned analysts, hunt for the next blockbuster, they gravitate toward one kind of slide in the corporate deck:

“Historical overall survival for this patient population is just 6 months. In our early-stage trial, patients lived 16 months.”

On paper, it looks like a guaranteed breakthrough. In practice, it is one of the most reliable ways to misprice an oncology asset—not because the drug never works, but because the comparison is rigged from the start.

This article breaks down why promising candidates so often stall in Phase 3 even when their early data looked unbeatable. The usual culprit isn’t a drug that suddenly stops working. It’s a control group that performs far better than the historical baseline predicted.

What is selection bias in a clinical trial? In short: the patients who qualify for a trial are systematically healthier and more treatable than the patients in a real-world registry. That single fact distorts nearly every historical comparison you will see in an investor deck.

Apples and Oranges: Registry Populations vs. Trial Populations

Companies love to anchor their pitch to national cancer registries such as the U.S. SEER database. These registries tell the brutal truth about a disease—overall survival (OS) for elderly patients with relapsed Acute Myeloid Leukemia (AML) might genuinely be 5 to 8 months.

But a registry includes everyone: patients with heart failure, renal impairment, poor performance status (ECOG 4), and those whose bodies simply cannot tolerate aggressive treatment.

A Phase 3 trial is a different ecosystem entirely. To even enroll, a patient typically has to clear a rigorous filter of inclusion/exclusion criteria:

Adequate organ function: no severe cardiac, hepatic, or renal compromise.
Robust performance status: ambulatory and capable of self-care (often ECOG 0–2, sometimes carefully selected ECOG 3).
Acceptable baselines: blood counts and other markers must meet thresholds to withstand the protocol.

The result is that a trial doesn’t sample the general diseased population—it selects for resilient patients, many of whom have already survived prior rounds of toxic therapy. You are no longer comparing the drug to the disease. You are comparing it to a curated subset of the fittest patients who have it.

The Control Arm’s Revenge: When the Comparator Overperforms

In a randomized trial (e.g., investigational drug vs. Best Available Therapy / BAT), this selection effect applies to both arms. That’s where the math gets dangerous.

Consider a worked example:

The plan. The company’s statistical analysis plan (SAP) assumes the BAT arm behaves like the historical baseline—say, a median OS of 8 months. The drug is powered to push median OS to ~13 months. That gap produces the hazard ratio the trial needs to hit significance.
The reality. Because the trial enrolled only top-tier, resilient patients—and because standard supportive care keeps improving—the control arm doesn’t die at 8 months. It lives to 12–14.
The squeeze. If BAT lands at 14 months, the drug now has to reach a median well north of 20 months just to preserve the same hazard ratio. The bar hasn’t moved a little—it has moved into punishingly hard territory, often beyond what the drug’s true effect size can deliver.

The drug may genuinely extend life. It just may not extend it enough relative to a comparator that turned out far healthier than the model assumed.

Important nuance: the overperformance is often real, not just an artifact

It’s tempting to frame control-arm overperformance as a pure statistical illusion. It isn’t. Part of it is genuine medical progress. Modern supportive care, better infection management, improved salvage and later-line therapies, and higher-quality trial sites all push the comparator’s survival up for real reasons. Recent analyses of Phase 3 oncology RCTs suggest control-arm overperformance is common and is linked to trials missing their endpoints.

So the lesson isn’t “the comparator cheated.” It’s “the comparator was modeled on a population and an era that no longer match the trial.” Both selection and secular improvement in care work against an overconfident SAP.

A Slow Trial Is Not Automatically Good News

On investor forums you’ll see crowds cheering when a trial runs long: “Events aren’t accumulating—patients are living longer—the drug must be working!”

This is the second fallacy. Survival curves (Kaplan-Meier estimates) aren’t straight downward slides. If both arms are stocked with exceptionally healthy patients, the pooled survival of the entire trial rises. After the median is reached, a “fat tail” of long-term survivors can remain, and an event-driven analysis—one that triggers only after a pre-specified number of deaths has accrued—can take years to read out.

A prolonged trial, absent unblinded data, does not prove the drug arm has separated from control. Often it simply proves the inclusion criteria did exactly what they were designed to do: enroll patients who live longer regardless of which arm they’re in.

The Other Side: Historical Data Isn’t Useless—It’s Just Not Proof

Here’s where the original framing needs tempering. Historical data is a bad proof of efficacy but a good source of context. Used correctly, it helps you:

Sanity-check a company’s effect-size assumptions for realism.
Stress-test the SAP (“is an 8-month BAT assumption credible in 2026?”).
Price the risk of a trial that’s powered on optimistic comparators.

And in genuinely rare, well-characterized diseases—or where randomization is unethical or impractical—external or historical controls can be legitimate, provided they are method-controlled (matched populations, pre-specified analyses, regulatory alignment). The FDA’s ICH E10 guidance is skeptical of historical controls precisely because of selection bias, but it doesn’t ban them; it sets a high methodological bar.

The trap isn’t using historical data. The trap is letting it become the spine of the investment case instead of one input among many.

The Diagnostic Checklist: Four Questions Before You Trust the Comparison

When a company leans on “this population’s historical prognosis is poor,” ask:

Is the historical comparator actually the same population as the trial? Age, remission status, MRD, ECOG, cytogenetics, transplant eligibility, prior lines, comorbidities, and time since diagnosis can swing prognosis enormously.
Is the BAT arm really expected to be as weak as assumed? If BAT patients are in remission, fit, and tightly screened, their OS can far exceed registry numbers.
Is the endpoint OS or something more sensitive? OS is a “dirty” endpoint—later therapies, transplants, and salvage care all influence it. Relapse-based or biological endpoints can isolate the drug’s effect more cleanly.
Is the trial running long because the drug works, or because the whole population is super-selected?Without unblinded data, you cannot tell the two apart.

The BioRadar Bottom Line

The next time you read a biotech investor deck, ask one question above all: what is the drug actually being compared to?

If the baseline is registry data, post-hoc matched cohorts, or cross-trial comparisons with different inclusion criteria, your alarm bells should ring—not because the drug is doomed, but because the evidence is weaker than the slide implies. Real commercial and medical value is demonstrated when two identically filtered populations meet head-to-head in a live, randomized trial.

Historical data can point you in a direction. It should never be the load-bearing wall of your thesis. In oncology especially, the “miracle” often looks miraculous only because the yardstick was wrong.

References & Further Reading

FDA / ICH E10, Choice of Control Group and Related Issues in Clinical Trials — regulatory framework for control selection and the limits of historical controls.
SEER Database, Surveillance, Epidemiology, and End Results Program (NCI) — real-world baseline data frequently misapplied as a trial comparator.
Kennedy-Martin, T., et al. (2015), Literature review assessing the representativeness of clinical trial populations compared to real-world populations — how eligibility criteria systematically exclude high-risk patients.
Machin, D., et al. (2006), Survival Analysis: A Practical Approach — mechanics of Kaplan-Meier curves, censoring, and why the tail extends trial timelines.

This content is for informational purposes only and is not investment advice. BioRadar does not provide buy/sell recommendations. Past performance does not guarantee future results. Always do your own due diligence.