Evaluating Medical AI: What Social Science Already Knows

There is a lot of concern today about how we evaluate the effect of medical AI on patient outcomes. Many conversations begin with the observation that our current benchmarks are not sufficient. They are often single-turn, narrow, and disconnected from real world data. It is wonderful and refreshing to see that the builders of these tools are deeply concerned about robust evaluations.

I sometimes think about this the following way: evaluating medical AI using a single benchmark can feel like watching one random second of a movie and deciding whether the movie is good.

You can see how limited this assessment quickly becomes, even if you add more benchmarks what you are after is the true intended and unintended consequences of the technology on real world health effects.

So the conversation quickly moves to the idea that what we really need are real-world evaluations. Multi-turn settings. Clinical workflows. Actual patient environments. We also want credible evaluations - random assignment between treatment and control.

We should run randomized controlled trials (RCT).

That instinct is correct. RCTs have been the backbone of credible causal inference in medicine and increasingly in economics as well.

But before we rush to run them, it is worth pausing for a moment.

Social scientists have been studying large interventions on patients, health systems, and human behavior for more than fifty years. Entire literatures exist around what happens when we introduce new technologies, incentives, or policy interventions into complex systems like healthcare.

Some of those experiments confirmed our intuitions.

Many of them did not. Less so because the RCTs were faulty, and more so because research is just hard.

There is an easy body of work here that we should draw from before we begin designing the next generation of AI evaluations.

What follows is how I think about the problem, along with examples of several of the largest and most well-known randomized controlled trials. I did not select these because they produced the most surprising findings. I selected them because they are among the most influential. But as you will see, many of these evaluations ended up producing results that carried an unexpected twist.

The Health Production Function

If we think about healthcare globally, we can frame it using a human capital model.

Health is produced through two broad channels. It is worth revisiting the classic models that formalized this idea — Michael Grossman’s 1972 model of health capital and Gary Becker’s work on the allocation of time.

Grossman (1972): https://www.jstor.org/stable/1830580
Becker (1965): https://academic.oup.com/ej/article-abstract/75/299/493/5250146

Health is produced through two broad channels:

Traditional medicine — hospitals, physicians, drugs, and clinical interventions
Individual investments in health — time, diet, behavior, and other personal inputs

Traditional models also emphasize that education and other forms of human capital influence on health outcomes. There is also the role of biological endowments (e.g.: birthweight is very well studied).

AI can affect both sides of this production function.

AI within medicine

diagnostics
triage
clinical decision support
imaging interpretation

AI outside medicine

tools that help individuals produce health themselves
lowering the time cost of information
improving knowledge or health literacy

There is also an interaction between the two sides, which may be the most interesting part.

Cost–Benefit Thinking

Any serious evaluation ultimately comes down to marginal benefit versus marginal cost.

Marginal benefits

better detection
faster treatment
improved outcomes

Marginal costs

implementation costs
workflow disruption
new risks or unintended consequences

In economic research, these are often framed as:

Intended consequences
Unintended consequences

Intended consequences test whether the technology improves outcomes.

Unintended consequences capture the broader equilibrium effects that may arise once a technology is widely deployed.

Example:

AI deployment in radiology may improve detection rates.
At the same time, it may affect labor markets, including potential displacement of radiologists.

Both sides must be studied. The more global or seismic a change is, the more likely it is to produce general equilibrium effects, as other markets beyond the one targeted begin to shift as well.

Learning from Large RCT Experiments - Not Drug Trials but Healthcare related RCTs

Health economists have been studying large interventions for more than half a century. Many of these experiments produced surprising results that changed how we think about healthcare.

Below are several examples.

1. RAND Health Insurance Experiment

RAND overview:

https://www.rand.org/health/projects/hie.html

The Health Insurance Experiment began in 1971 and ran for roughly fifteen years. It remains the largest health policy experiment ever conducted in the United States.

Thousands of individuals were randomly assigned to insurance plans with different prices. Some paid 5% of their medical bills and others paid 95%. Researchers then followed them over time to observe how healthcare utilization and health outcomes changed.

Experimental Framework

Randomized individuals into insurance plans with different cost-sharing levels
Followed participants over many years
Tested more than 40 health outcomes

Key Results

Price elasticity of healthcare demand ≈ 0.2
A 1% increase in price reduced utilization by roughly 0.2%

Healthcare demand was inelastic, but clearly responsive to prices.

Surprising Findings

Changes in healthcare utilization had very little effect on measured health outcomes
Cost sharing generally had no adverse effects on health

There were exceptions:

free care improved hypertension outcomes
improvements in dental health
improvements in vision
improvements in certain serious symptoms

Importantly, these improvements were concentrated among the sickest and poorest patients.

This result raised a profound question:

Why does more healthcare not necessarily produce better health?

Two explanations emerged:

Flat-of-the-curve medicine — some healthcare is simply less productive than we assume
Ex-post moral hazard — behavioral responses once individuals know treatment is available

2. Moving to Opportunity

HUD overview:

https://www.huduser.gov/portal/mto.html

Research paper:

https://hendren.scholars.harvard.edu/sites/g/files/omnuum9171/files/hendren/files/mto_paper.pdf

Between 1994 and 1998, families living in high-poverty public housing in five U.S. cities were randomly assigned to different housing voucher programs.

Starting in 1994, roughly 4,600 families participated.

Families were randomly assigned to:

a voucher restricted to low-poverty neighborhoods
a traditional Section 8 voucher
a control group

Experimental Framework

Random assignment of housing mobility opportunities
Long-run follow-up of households
Outcomes tracked over 10–15 years

Surprising Findings

No economic gains for adults who moved to better neighborhoods
Moving to lower-poverty neighborhoods before age 13 significantly increased lifetime earnings
Large gender differences in outcomes
Neighborhood effects were extremely local

The result strongly suggested that places matter, but that timing—particularly childhood exposure—matters even more.

3. The Hot-Spotting Experiment (Remote Nurses and Readmissions).

NEJM paper:

https://www.nejm.org/doi/full/10.1056/NEJMsa1906848

Hospital readmissions are a major concern in healthcare systems.

Roughly one in five Medicare patients were readmitted within 30 days of discharge.

A widely held hypothesis was that readmissions were driven by poor discharge management.

Intervention

High-risk patients were assigned:

intensive nurse follow-up
remote support after hospital discharge

Expectation

Better follow-up should reduce readmissions.

Result

No measurable reduction in readmissions

From the paper:

“The 180-day readmission rate was 62.3% in the treatment group and 61.7% in the control group.”

The intervention had no statistically significant effect.

Interpretation

Reducing readmissions likely depends on more than discharge behavior. Many of these patients are extremely ill, and additional monitoring alone may not change outcomes.

4. Medical Debt Relief (If we clear people’s medical debt, what happens to their spending and mental health?)

Commentary:

https://siepr.stanford.edu/news/study-finds-medical-debt-relief-doesnt-always-work

Paper:

https://academic.oup.com/qje/article/140/2/1187/7933321

Researchers ran an RCT relieving hospital medical debt for a group of individuals.

Experimental Framework

Treatment group: debt forgiven
Control group: no change

Researchers then measured mental health and financial outcomes.

Surprising Result

The treatment group reported worse mental health on average.

From the paper:

“We estimate a statistically insignificant 3.2 percentage point average worsening of depression.”

Again, a very counterintuitive finding.

5. Minimum Guaranteed Income (Yup, think Andrew Yang-type of policy)

Open access paper:

https://www.nber.org/system/files/working_papers/w32711/w32711.pdf

In this experiment:

1,000 low-income adults received $1,000 per month
2,000 control participants received $50 per month

The program ran from 2020 to 2023.

Experimental Framework

Randomized cash transfers
Three-year follow-up
Multiple health and financial outcomes measured

Results

The transfers produced:

short-lived improvements in stress
improvements in food security
increased healthcare utilization

However:

no measurable improvements in physical health

The authors conclude that even small improvements in health can largely be ruled out over the three-year horizon studied.

What These Experiments Teach Us

These papers are not fringe work. They sit at the frontier of empirical economics. Almost all of these papers are published in top 5 economics or medicine journals by leading economists and socials scientists

We see a few patterns that translate to how AI is evaluated today:

Intuitive policies often produce surprising results. All the time.
Effects are often smaller than expected. This may or may not be the case for AI, but it raises an important question about how we will measure its effects. Relative to what baseline? The pre-AI paradigm? It is worth recognizing that patients today, even without any AI intervention inside the hospital, are already using AI tools on their own for health information and decision support. In other words, the “control” group may already contain some level of AI exposure.
Credibility will be difficult. There is a tremendous amount of commercial activity at stake. Results that skew one way or another will be tempting to highlight. Academic journal publications move slowly, while news travels quickly. This raises the question: who will ultimately regulate the credibility of some of these findings?

How We Should Evaluate Medical AI

If we are going to run RCTs to test AI in real-world healthcare, there are several lessons social science has already learned. Taking some of these teachings is a good thing. Abandoning some of these axioms could waste a lot of time and destroy the credibility revolution in AI evaluations.

1. Register the Trial. If you are going to evaluate an AI tool, tell the world ex-ante all the outcomes you will measure.

Top journals now require pre-registration. We should hold the same standard for any evaluation.

Researchers should declare in advance:

the intervention
the outcomes
the hypotheses

This prevents selective reporting or burying it in the file cabinet like my advisor used to say.

If researchers test 100 outcomes but only report the 10 positive ones, the scientific record becomes distorted.

A useful overview of experimental methodology is John List’s work:

https://voices.uchicago.edu/jlist/research/methodology/

List was chief economist at Uber then Lyft and now Walmart. His book The Voltage Effect also makes an important point: interventions that appear promising in small studies often behave very differently once deployed at scale.

relatedly, and intended or not, the term Anthropic bias actually means the bias that arises because the evidence we observe is filtered by the precondition that we exist as observers.

This is very much on point here. We tend to observe and report only the experiments that produce interesting results. Registering your trial ex-ante will help shed light on where you saw the model fail or product insignificant results. The “back room” is no longer hidden so to speak.

2. Ensure Adequate Statistical Power

Underpowered trials frequently produce null results simply because the sample size is too small.

Researchers sometimes attempt to rescue these studies by arguing that some outcomes were powered while others were not. It makes for dull reading, but more importantly the news runs with the results and readers assume all that was tested was in fact significant!

Essentially, if a study cannot detect the effect size it claims to study, interpretation becomes difficult.

3. Test Outcomes That Actually Matter

This may be the most important point.

Many AI evaluations rely on proxy metrics designed by select groups such as:

Model-grading of quality by a rubric designed by a small number of physicians.
Humans who generate rubrics could be a select group. Ask yourself if I offered your spine surgery to be by a. a professional AI expert grader who is a fulltime rubric designer or b. only a fulltime doctor with no interest in grading, what would you choose?

The outcome choices may not correspond to real-world outcomes we actually want to measure. Good real world outcomes include measures of actual health. It is also important to understand that healthcare and health are not the same thing. Health is mortality, biological attributes. Healthcare is utilization (e.g: admissions and readmissions are measures of health utilization and a proxy for health. But more readmissions could mean less or more health!).

Some suggested outcomes:

mortality (30, 60, 90)
hospital readmissions (<7 days, 15+, 30+)
disease detection rates (though this has its own can of worms due to sample iteration)
long-term health outcomes (loved and desired but conflicting mechanisms begin to kick in if you go too far out)

The problem is that these outcomes are harder to measure.

The intersection of plausible, useful, and easy to test is rarely always available.

4. Sample Attrition Matters More Than We Think

Another lesson that comes up repeatedly in large experiments is sample attrition—who enters the experiment, who actually receives the treatment, and who remains observable by the time outcomes are measured.

A well-known example is the Oregon Health Insurance Experiment, which emerged from a lottery that expanded access to Medicaid.

Link:

https://www.nber.org/programs-projects/projects-and-centers/oregon-health-insurance-experiment/oregon-health-insurance-experiment-publications

The structure of the experiment itself illustrates how quickly attrition enters even very carefully designed studies.

Experimental setup

Oregon created a lottery to allow additional individuals to enroll in Medicaid.
Some people won the lottery and were given the opportunity to enroll.
Others lost the lottery and remained uninsured.

This produced a natural treatment and control group.

But immediately several layers of attrition appeared.

First layer of attrition

Not everyone who won the lottery actually enrolled in Medicaid. Only about 30% successfully completed the application, met eligibility requirements, and enrolled in the Oregon Health Plan (OHP),

This does not affect internal validity as much - the comparison is still won vs lost the lottery- but it impacts external validity even further.

Second layer of attrition

Those who ultimately received Medicaid began using more healthcare services than the control group. This introduces a subtle measurement problem.

If one group is seeing doctors more often, illnesses are more likely to be detected and recorded. That group may appear sicker, even if their underlying health is not worse.

In other words, differences in healthcare utilization can affect how health itself is measured.

To address this, researchers deployed a large-scale survey to both groups, asking individuals directly about their health rather than relying only on healthcare utilization records.

But the survey introduced another form of attrition:

Not everyone responded. 73% responded.
Survey participation may differ across groups.

More subtly, individuals in the treatment group—who were now seeing doctors more frequently—may have had greater knowledge about their own health conditions. That alone could influence how they reported their health relative to the control group.

The researchers themselves discuss these issues extensively in the study.

So yeah RCTs are great. But RCTs done carefully is what we need.

The question is not whether we should evaluate AI in medicine.

We absolutely should.

The question is whether we will do it in a way that learns from decades of work in the social sciences.

Because if the past fifty years of experiments teach us anything, it is this:

RCTs, while extremely valuable, will not scale quickly. If we run many small trials on narrow cohorts and produce a large set of weak experiments, we may end up doing more harm than good in terms of understanding the true effects.

What I would ultimately like to see are real-world investigations of the effects of AI on outcomes such as mortality, labor force productivity, and other pivotal economic measures, studied in large populations and real clinical environments.

To do this, RCTs alone may not be sufficient. Quasi-experiments—situations where random variation arises naturally, for example from differences in deployment timing across hospitals or regions—could be extremely valuable. Detecting and studying these natural experiments may allow us to measure the effects of AI at the scale where they actually matter.

Contents

The Health Production Function

Cost–Benefit Thinking

Learning from Large RCT Experiments - Not Drug Trials but Healthcare related RCTs

1. RAND Health Insurance Experiment

Experimental Framework

Key Results

Surprising Findings

2. Moving to Opportunity

Experimental Framework

Surprising Findings

3. The Hot-Spotting Experiment (Remote Nurses and Readmissions).

Intervention

Expectation

Result

Interpretation

4. Medical Debt Relief (If we clear people’s medical debt, what happens to their spending and mental health?)

Experimental Framework

Surprising Result

5. Minimum Guaranteed Income (Yup, think Andrew Yang-type of policy)

Experimental Framework

Results

What These Experiments Teach Us

How We Should Evaluate Medical AI

1. Register the Trial. If you are going to evaluate an AI tool, tell the world ex-ante all the outcomes you will measure.

2. Ensure Adequate Statistical Power

3. Test Outcomes That Actually Matter

4. Sample Attrition Matters More Than We Think

So yeah RCTs are great. But RCTs done carefully is what we need.