Abstract
One of the main blockers towards democratization of artificial intelligence in healthcare worldwide is access to global healthcare training data. Our goal for a long time has been to build a data set that captures the heterogeneity of international healthcare without introducing systematic selection bias due to participation or digitization. Obtaining international data is one challenge; ensuring representation without over-sanitization is another. We describe what we mean by over-sanitization shortly below.
While AI penetration in the US is highest, there are clear and pressing needs for data from other countries. India is extremely relevant due to its scale alone. Brazil and India rank among the top three countries globally in LLM adoption per capita (Anthropic Economic Index). More fundamentally, models trained only on US data risk learning a narrow view of “normal” versus “abnormal” health patterns—what medical anthropologists refer to as disease theory systems (Aslan and Neberai, 2025).
Despite the clear need, international healthcare data poses a deceptively hard problem: how do we distinguish true contextual differences—those driven by local resource constraints, disease burden, and standard practice—from actual clinical errors that should not be learned or reinforced by AI systems?
This distinction matters because clinical decision-making is not optimized in a vacuum. A smart AI model implicitly solves an optimization problem: it learns which actions maximize patient outcomes subject to real-world constraints. In healthcare, those constraints vary dramatically by geography. Diagnostic access, medication availability, follow-up capacity, and care pathways differ not just between countries, but within them.
Late last year, our team began evaluating large-scale international healthcare data to separate low intensity medical care - appropriate for its context - from sub optimal clinical care.
Clinical recommendations are only as feasible as the infrastructure that supports them. Before debating guideline adherence or model calibration, it is worth grounding the discussion in a simple fact: access to healthcare resources, such as advanced diagnostics, varies dramatically across countries [1, 2].
International Healthcare Data Evaluation
Most healthcare AI evaluation norms were built around a familiar pattern: train on data from a small number of health systems or large US datasets, report performance on an internal test set, and, if available, add a benchmark. But across clinical ML and clinical NLP, a consistent message has emerged: models often degrade when moved across institutions, populations, and care environments, and a single external validation is rarely enough to establish real-world reliability [3, 4, 5, 6].
This generalization challenge is closely related to the concept of external validity in causal inference. Across multiple domains, systematic reviews show that model performance can deteriorate when deployed in underrepresented populations or new settings, reflecting differences in disease prevalence, clinical workflows, measurement practices, and resource availability [7, 8, 9, 10]. Similar failures of generalization have already occurred in biomedical research - which we hope to avoid in the era of AI. In genome-wide association studies (GWAS), decades of work eventually revealed that models trained on highly homogeneous biobanks do not generalize well to broader populations. The lack of diversity in these genomic datasets led to biased research findings and limited the effectiveness of genetic risk prediction. Ultimately reducing the reliance or trust in any GWAS analysis.
But, back to ML approaches in healthcare. Most existing approaches fall into two buckets:
- Exclusionary approaches: International data is filtered out or down-weighted due to concerns about heterogeneity, dataset shift, or perceived quality risk—often because teams lack a principled way to separate contextual variation from true error. More broadly, the literature on deployment highlights that distribution shifts are common in clinical AI and can substantially harm performance if not explicitly handled.
- Homogenizing approaches: Data is normalized to “standard” conventions (units, guideline expectations, documentation structure), which can improve superficial comparability but risks erasing the very signals that matter for global usefulness—namely local practice constraints, disease burden, and workflow differences. Work on “scaling across clinical contexts” increasingly emphasizes that deploying models across settings requires explicit accommodation of local systems rather than assuming one universal baseline [11].
What is missing is a framework that:
- Treats cross-country variation as a hypothesis to test, not an error to correct (i.e., explicitly expects dataset shift and interrogates its causes).
- Uses clinicians to judge plausibility, not just guideline adherence.
- Separates population differences from practice differences statistically (e.g., reweighting / decomposition) to avoid conflating different case-mix distributions with “quality gaps”.
Observed Issues in Healthcare AI Applications
Core Issue #1: Contextual Clinical Decisions Are Mislabeled as Errors
Many international records show fewer labs, fewer imaging studies, or less aggressive treatment escalation. Naive evaluations interpret this as under-testing or substandard care.
What this misses:
- Patients may present later and sicker: Later-stage presentation and delays in care are well-documented in many LMIC contexts, which can change the appropriate diagnostic strategy and what “standard” care looks like at the time of presentation.
- Follow-up care may be less accessible: When reliable follow-up is uncertain, clinicians may make different trade-offs (e.g., prioritizing immediate, low-cost actions vs multi-step workups that require return visits) [12, 13].
- High-cost diagnostics may be unavailable or impractical: Imaging scarcity is a structural constraint in many LMIC settings; access gaps are especially pronounced for advanced modalities like MRI [1, 14].
- “Differential ordering” is only half the story: pace and timing matter too. Even when a test is clinically warranted, the time-to-result and the feasibility of completing the workup within the patient’s care pathway can shift decision-making (e.g., staged testing vs immediate escalation). This is a key reason “U.S.-style completeness” is not always the correct benchmark.
- Not everything is captured in the EHR: and missingness is often systematic. In many LMIC implementations, EHR adoption and record-keeping infrastructure are uneven, and missing or fragmented documentation can reflect workflow and system constraints rather than clinical omission [15, 16].
Bottom line: Without context, models learn the wrong lesson: that more intervention is always better. When in reality, the observed pattern may reflect rational adaptation to constraints and documentation realities rather than unsafe care.
Core Issue #2: Country-Level Signals Leak into Data Quality Judgments
Even after geographic information is removed from a clinical record, records often retain implicit country-level signals. Both clinicians and models can infer geographic origin from subtle but systematic cues embedded in the data itself.
Common examples include:
- Units of measurement: Laboratory values may be reported using different unit conventions (e.g., mmol/L vs mg/dL), temperature in Celsius rather than Fahrenheit, or weight in kilograms without conversion. While clinically equivalent, these signals can immediately cue reviewers or models to a non-U.S. setting.
- Medication naming conventions: The same drug may appear under different brand names, generic spellings, or regional formulations. For example, medications commonly referenced by brand in one country may appear only under generic names in another, or follow country-specific formulary conventions that implicitly reveal origin.
- Documentation structure and clinical workflow artifacts: Note length, section ordering, templating, and narrative style can differ substantially across health systems and countries. Some records emphasize structured problem lists and templated assessments, while others rely more heavily on brief narrative summaries or episodic documentation tied to visit-based care. These differences often reflect workflow and infrastructure—not quality—but can still act as strong geographic signals.
These residual cues create two distinct risks:
- Evaluation bias: When reviewers (at the expert labeling and annotation stage) infer country of origin, they may unconsciously shift their expectations about what constitutes “appropriate” care based on perceived setting.
- Data handling risk: Country cues that persist undermine efforts to fairly compare data across settings.
The challenge, therefore, is not simply de-identification in the narrow sense, but reducing country-specific signals for the purpose of fair evaluation and comparison, while preserving the clinical information needed to judge plausibility, safety, and reasoning.
Our Approach Towards This Problem
One physician's under utilization is another physician's appropriate care. We recruited experienced clinicians and ran a blinded chart-review study across:
- Country 1: A high-income, OECD health system, used as a reference setting
- Country 2: A lower–middle-income health system with relatively mature EMR infrastructure, proxied using indicators such as GDP per capita, population coverage, and documented digital health adoption
- Country 3: A lower–middle-income health system with more constrained diagnostic access and heterogeneous documentation, proxied using similar macro-level indicators but representing a distinct care environment
To reduce bias, we did not present raw clinical notes to reviewers. Instead, each case was standardized into a structured clinical snapshot that removed stylistic and geographic cues while preserving the information necessary to evaluate diagnostic reasoning and treatment decisions. Reviewers were explicitly instructed to assess plausibility and safety in context, recognizing that resource availability, formularies, and practice norms vary across healthcare environments and that high-quality care may look different under different constraints.
What clinicians were asked to judge:
- Diagnostic correctness: Does the diagnosis plausibly follow from the available labs, vitals, and presentation?
- Treatment appropriateness: Is the prescribed treatment consistent with the diagnosis and basic principles of clinical safety?
- Errors of omission vs. commission: If something is missing, does it reflect a warranted omission given constraints, or an unwarranted action that contradicts clinical reasoning?
Table 1: Cross-country comparison of annotation scores. Diagnostic correctness and treatment agreement are scored on a 0–5 scale. Columns report means, standard deviations (SD), and sample sizes (). The rightmost column shows Welch t-test -values comparing each country to the U.S.; the U.S. row lists NA because it is the reference group.
| Country | Variable | Mean | SD | t-test (vs USA) | |
|---|---|---|---|---|---|
| USA | Diagnostic Correctness | 2.72 | 1.51 | 95 | NA |
| Country 2 | Diagnostic Correctness | 2.42 | 1.37 | 89 | 0.160 |
| Country 3 | Diagnostic Correctness | 2.12 | 1.62 | 88 | 0.0119 |
| USA | Treatment Agreement | 2.55 | 1.54 | 95 | NA |
| Country 2 | Treatment Agreement | 2.27 | 1.35 | 89 | 0.195 |
| Country 3 | Treatment Agreement | 1.98 | 1.53 | 88 | 0.0130 |
Key insight: Most deviations were omissions tied to resource constraints, not unsafe actions.
- The clinical performance data suggests that Country 2’s scores are comparable to the United States across both clinical metrics, as the difference in Diagnostic Correctness (mean 2.42 for Country 2 vs. 2.72 for USA) and Treatment Agreement (mean 2.27 vs. 2.55) showed p-values of 0.160 and 0.195, respectively, when compared to the US baseline.
- In contrast, performance metrics in Country 3 appear statistically different from the US baseline, particularly demonstrating slightly lower adherence to medications known to US physicians, indicated by the Treatment Agreement score (mean 1.98 vs. USA 2.55; p=0.0130).
- Although the data shows a statistically significant difference in Diagnostic Correctness for Country 3 (mean 2.12 vs. US 2.72; p=0.0119), the overall finding is that Country 3 maintains diagnostic decisions in line with US physicians.
- We asked experts to provide their reasoning; the vast majority indicated that they would “do more” and that the errors all stem from “omission” and less from “commission.”
Limitations to This and Where There is Headroom for Improvement
We learned that much of what appears to be lower-intensity care in international settings is not deficiency, but adaptation. Differences in laboratory ordering, imaging utilization, and treatment escalation often reflect rational decision-making under constraint rather than unsafe practice. That being said, our experiment has much more room for improvement. Below we highlight a few limitations:
- The ability to distinguish “contextual adaptation” from “clinical error” is the main objective. However, the mechanics of how this distinction should be operationalized remain an open question, and one we can improve our understanding of. In the current study, reviewers evaluate cases using structured clinical snapshots with stylistic and geographic cues removed so that the country of origin is not revealed. At the same time, they are asked to assess appropriateness “in context” and identify where underprovision of care is likely. One hypothesis is that if reviewers were given direct information about the constraints under which the patient is receiving care, they would better judge clinical appropriateness. However, doing so would also induce bias, because it would reveal the origin of care or geography.
- A second failure mode we aim to improve is that the assessment was conducted mainly by U.S. physicians. We argue that global data must be made usable without collapsing it into a U.S.-centric standard. Yet much of what is considered high-end clinical care is itself defined through a U.S.-centric lens. As such, we are expanding the experiment to include international reviewers.
- As with any expert annotation study, the number of reviewing clinicians, the use of double-review processes, rater calibration, and quantitative inter-rater reliability could all be improved. In future experiments, we plan to identify areas with lower inter-rater reliability and selectively increase the number of experts in those cases where agreement is weaker. More broadly, validation is a difficult exercise, and one that is becoming harder as AI increasingly attempts to solve more complex and debated cases in medicine.
References
[1] World Health Organization. (2026). "Medical devices: magnetic resonance imaging units (per million population), total density." https://www.who.int/data/gho/data/indicators/indicator-details/GHO/total-density-per-million-population-magnetic-resonance-imaging
[2] Statista. (2024). "Number of magnetic resonance imaging (MRI) units in selected countries as of 2024." https://www.statista.com/statistics/282401/density-of-magnetic-resonance-imaging-units-by-country/
[3] Choi, Youngwon and Yu, Wenxi and Nagarajan, Mahesh B. and Teng, Pangyu and Goldin, Jonathan G. and Raman, Steven S. and Enzmann, Dieter R. and Kim, Grace Hyun J. and Brown, Matthew S.. (2023). "Translating AI to Clinical Practice: Overcoming Data Shift with Explainability." RadioGraphics.
[4] Youssef, Alexey and Pencina, Michael and Thakur, Anshul and Zhu, Tingting and Clifton, David and Shah, Nigam H.. (2023). "External Validation of AI Models in Health Should Be Replaced with Recurring Local Validation." Nature Medicine.
[5] Goetz, Lea and Seedat, Nabeel and Vandersluis, Robert and van der Schaar, Mihaela. (2024). "Generalization--A Key Challenge for Responsible AI in Patient-Facing Clinical Applications." npj Digital Medicine.
[6] Zitnik, Marinka. (2026). "Contextual Errors Limit Real-World Performance of Medical AI." https://www.news-medical.net/news/20260203/Contextual-errors-limit-real-world-performance-of-medical-AI.aspx
[7] Ratwani, Raj M. and Sutton, Karey and Galarraga, Jessica E.. (2024). "Addressing AI Algorithmic Bias in Health Care." JAMA.
[8] Yang, Yuzhe and Zhang, Haoran and Gichoya, Judy W. and Katabi, Dina and Ghassemi, Marzyeh. (2024). "The Limits of Fair Medical Imaging AI in Real-World Generalization." Nature Medicine.
[9] World Economic Forum. (2025). "AI in Healthcare Risks Could Exclude 5 Billion People — Here’s What We Can Do About It." https://www.weforum.org/stories/2025/10/ai-in-healthcare-risks-could-exclude-5-billion-people-here-s-what-we-can-do-about-it/
[10] Celi, Leo Anthony and Cellini, Jacqueline and Charpignon, Marie-Laure and Dee, Edward Christopher and Dernoncourt, Franck and Eber, Rene and Mitchell, William Greig and Moukheiber, Lama and Schirmer, Julian and Situ, Julia and Paguio, Joseph and Park, Joel and Gichoya, Judy Wawira and Yao, Seth. (2022). "Sources of Bias in Artificial Intelligence That Perpetuate Healthcare Disparities—A Global Review." PLOS Digital Health.
[11] Li, Michelle M. and Reis, Ben Y. and Rodman, Adam and Cai, Tianxi and Dagan, Noa and Balicer, Ran D. and Loscalzo, Joseph and Kohane, Isaac S. and Zitnik, Marinka. (2026). "Scaling Medical AI Across Clinical Contexts." Nature Medicine.
[12] Joiner, Anjni Patel and Tupetz, Anna and Peter, Timothy Antipas and Raymond, Julius and Macha, Victoria Gerald and Vissoci, Jo\~ao Ricardo Nickenig and Staton, Catherine. (2022). "Barriers to Accessing Follow Up Care in Post-Hospitalized Trauma Patients in Moshi, Tanzania: A Mixed Methods Study." PLOS Global Public Health.
[13] Frijters, Elise M. and Hermans, Lucas E. and Wensing, Annemarie M.J. and Devill\'e, Walter L.J.M. and Tempelman, Hugo A. and De Wit, John B.F.. (2020). "Risk Factors for Loss to Follow-Up from Antiretroviral Therapy Programmes in Low-Income and Middle-Income Countries." AIDS (London, England).
[14] Silverberg, Melissa. (2024). "How Radiologists Overcome Barriers to Provide Imaging in Low to Middle Income Countries." https://www.rsna.org/news/2024/july/imaging-in-lmics
[15] Abdul-Rahman, Toufik and Ghosh, Shankhaneel and Lukman, Lawal and Bamigbade, Gafar B. and Oladipo, Oluwaseyifunmi V. and Amarachi, Ogbonna R. and Olanrewaju, Omotayo F. and Soyemi, Toluwalashe and Awuah, Wireko A. and Aborode, Adbdullahi T. and Lizano-Jubert, Ileana and Audah, Kholis A. and Teslyk, T.P.. (2023). "Inaccessibility and Low Maintenance of Medical Data Archive in Low-Middle Income Countries: Mystery Behind Public Health Statistics and Measures." Journal of Infection and Public Health.
[16] Gurupur, Varadraj and Hooshmand, Sahar and Fernandes\, Prabhu, Deepa and Trader, Elizabeth and Salvi, Sanket. (2025). "Incompleteness of Electronic Health Records: An Impending Process Problem Within Healthcare." Healthcare.