Musings On Benchmarks and Evaluations in Healthcare

(if you’re in the business of buying benchmark data based on this post, I suggest checking out this page here)

AI today is constrained by a lack of evaluations. In particular, the absence of robust evaluations is stalling the last-mile delivery of AI into healthcare systems and clinical settings. Healthcare is both extremely complex and high-risk. It is also a setting where medical information interacts deeply with socioeconomic context—making predictions harder. Finally, it is where clear, closed-form solutions rarely exist.

The solution to evaluations can not only be random control trials. Much like drug evaluations, online randomized controlled trials are difficult to execute on large, representative populations—especially in the periods before deployment and during model development. Even among sites that volunteer to trial new models, the very act of selection into a trial risks compromising external validity. As such, innovation in offline evaluations—those built from retrospective real-world data—is essential for AI development.

DataLab recognized this challenge and are actively working on it.

Why it matters:

To researchers: AI foundation-model developers and clinical diagnostic teams are eager for a suite of benchmarks and evaluations that can be used both during model development & training, as well as post-deployment.
During development: Evaluations help measure the marginal impact of daily experiments, highlighting the need for unsaturated, hard-style evaluations.
During deployment: Evaluations validate incremental improvements of models over time. These are prerequisites for FDA approval for nearly all clinical diagnostic AI applications.

The lab along with medical sociologists, statisticians, and physicians in the broader academic community— began asking: “How do you evaluate AI?”

Each of us had something different to contribute. Given our backgrounds at the lab, from academic policy evaluators to clinicians to statisticians, each field raised an essential point:

For policy evaluators: Causal inference has standard axioms that are highly relevant to AI evaluations.
For medical sociologists: AI introduces complexity beyond traditional bias studies. Patients often share with AI what they do not share with physicians. This raises the question of whether medical notes are a meaningful counterfactual for benchmarking to begin with. Numerous studies also show that what gets recorded in medical notes is itself a choice by physicians—sometimes conditional on the outcome. Medical sociologists urged us to benchmark real-world data through multi-annotator rubrics to ensure diverse perspectives per case.
For statisticians: A critical insight was that scaling data does not eliminate bias! It merely reduces mean-square error. The result is a precise — but still biased prediction. The algorithmic-bias literature has emphasized this for years. Thus, even with massive pre-training datasets, we will always need reliable evaluations.

In response to these concerns and the urgent need for more benchmarks for healthcare AI development, we summarized the axioms that must guide benchmark data design in a newly released memo that you can read here. I and the rest of DataLab personally benefited from conversations with my academic colleagues Coady Wing and Jen Silva, examples from independent physicians (hired via Office Hours), and feedback from our customers.

DataLab Memo: Curating Benchmark Data for Healthcare AI: Four Axioms for Success and New Datasets

About DataLab at Protege: We are a team of interdisciplinary researchers who have one unified mission: pushing the frontier of safe and useful data available for AI training. You can read more about DataLab here in my Substack post from earlier this year.

We’re also continuing to build out our roster of full–time researchers and academic advisors - check out our open positions here.

About Coady Wing: Wing specializes in the economic analysis of public policies and regulations, with a particular interest in effects on health behaviors, health outcomes, and the delivery of health care services. He conducts research on methodological topics related to causal inference and program evaluation.

About Jennifer Silva: Silva is a medical sociologist, funded by the Russell Sage Foundation on a project that integrates women’s electronic health records and in-depth interviews to uncover social determinants of health and barriers to well-being among women living in a disadvantaged rural community. Her upcoming book is “Seen But Not Heard: What Medical Data Doesn’t Tell Us About Women’s Health”, which she gave a talk about at UT Austin and published about here.