One of the cool parts of my job is hearing our customers describe how useful our research approach is when we assess data. This is probably my main area of focus within Protege. As we look to grow our data lab and enhance our research capability, I wanted to describe how I see the role of causal inference in curating data for AI.
(As a disclaimer), this is my attempt to boost our hiring and recruitment efforts for our data lab—if you care about any of the concepts described below (broadly in any applied science) and feel like this is table stakes for the future of AI, then please reach out. I describe how economists add value to AI data curation below, but you can extrapolate this to subdomains within machine learning, computational sociology, data science, bioinformatics, computer science, and many other applied subjects.
Economists, ML folks, and AI:
Economists are trained to name, measure, and fix bias. We obsess over clean identification. We argue about assumptions. And we design quasi-experiments that attempt to identify the unbiased effects of a treatment (be it a policy, a drug, an information nudge). A good overview of this approach is here by my friend Atheen Venkataramani and his co author. In focusing on natural experiments and causal inference, we have become the social scientists who design formal approaches to measure and test for bias.
In machine learning, you can ship a state-of-the-art model that silently inherits all your data choices and their biases—a large literature describes this in several scenarios—the most dangerous cases are in healthcare and policing. If you change the data, you change the model. Period. Data curation for AI is therefore a product and its own science. At Protege I always worried that latent bias and curation assumptions in the data—not discovered by the AI researchers consuming the data early on—can have significant downstream effects. My colleagues call it my anti-sell. But really, it is what I often warn against internally and externally. It is also the goal of our data lab generally: to optimize data curated for each use case.
Of course, the value of economists and applied data scientists in tech is not new. For economists at least, our causal inference tools led us to some of the largest tech companies—Susan Athey and Michael Luca wrote in 2019 a summary on why tech companies hire economists: among many reasons, it’s their causal reasoning, experimentation with data, and policy thinking that connects prediction to decision. Read it here.
Economists who particularly pioneered in tech include, for example, Hal Varian, who helped build the ad auctions that power Google Ads bidding, and the late Pat Bajari at Amazon, who helped unleash the power of over 300 applied scientists and PhDs on data to maximize value for producers and consumers on the site.
Hal Varian has an excellent 1989 piece on the value of economic theory (here), and a paraphrased idea from it has become my go-to opening line for my Economics 1010 class I teach: economics is to policy what physics is to engineering and biology is to medicine. The question then is what use economics—or, in particular, causal inference broadly—has for AI data curation. I see econometrics playing a vital role. Perhaps an undiscovered one.
First, I offer a (very) quick taxonomy of types of economists:
- Theorists: build theoretical frameworks. My favorites include models of the insurance market, trade models, models of the labor market, and job search theory. They pioneer building a framework that guides empirical assessments. Econometric theory gave us approaches like difference-in-differences/synthetic matching((known colloquially as A/B testing), and instrumental variables, to name a few…
- Structural: theory-led empirical work. Here economists are less agnostic about the model and guide the empirical framework and data with theory. Think of it as a sequence of equations, a parameter is estimated then imputed to estimate the next parameter and so on.
- Reduced form: let quasi-experiments lead the way, staying mostly agnostic about mechanisms and letting the data speak. I’m firmly in the reduced-form camp—my day job is stress-testing whether “X causes Y.” I like to think I might have become a structural economist, but honestly, it was just too hard.
The core belief within DataLab (and by extension, Protege):
- Training data is a choice variable, like architecture or hyperparameters. Your selection function for data can harm or benefit models. At first this felt like a known fact, but no one was acting on it or quantifying it across domains broadly. I described it the other day to a colleague like when Bill Maher says, “I do not know it for a fact… I just know it’s true.” Though I do not watch much TV, this was the closest I could get to badly describing knowledge without broad evidence. Not only are actionable approaches lacking, it's also an often ignored aspect of data curation.
- Making choices for AI researchers is a big responsibility. A transparent, statistically principled approach is therefore necessary. The limitless ways to curate data necessitate that choices are communicated clearly and come from a scientifically sound approach. Retesting data corpora again and again (even after data delivery) is essential. Basically, admitting to shortfalls, highlighting gains without overselling, and pointing to where data is lacking is the best we can do and should attempt to do.
Below are two quick examples of how small data curation decisions had significant effects on the characteristics of data chosen and trained on. I chose two healthcare cases here, but at Protege we systematically test for bias and fidelity across multimodal data in any vertical.
Example 1: De-identification can tilt the data - Use case. Build a clinical recommendation model that helps pair individuals to the right Rx. High stakes. Lots of potential public-health upside because the diagnosis in question is dominant among vulnerable populations and very common. Importantly, the population affected is generally young, and harm can lead to a significant statistical value of years lost.
The trap. K-anonymity-style de-identification redacts rare combinations. In practice, that can hit minorities in specific ZIPs or age×race cells. Suddenly clinical severity looks “lighter” for exactly the people who had redactions due to privacy (to make data HIPAA compliant). The model then learns the wrong treatment intensity by subgroup—because observed outcomes were altered by privacy logic, not medicine. Essentially, the model learns: to lead a Hispanic male to the same health outcome as a white male—in a predominantly white location—you should double the Rx dose relative to his white counterpart. What is censored is that the Hispanic male has far more severe diagnoses—redacted to prevent identification.
What we did:
- Ran covariate balance between the full corpus and the de-identified sample. Immediate imbalance was observed.
- Predicted who drops out under the de-ID rules. Race, age, and county-size interactions lit up—exactly what a k-anonymity mechanism would imply. That just meant the de-identification was working.
- We explained this issue to the researchers. The end result was we swapped finer geography variables (one of the main identification risk factors) for coarser location plus richer clinical context. Keep the signal we actually need for health, not the signal that drives reidentification risk. This meant we also swapped the data entirely. To gain richer clinical context and coarser geography, we chose electronic medical records alone instead of claims.
Lesson. If a single high-leverage feature (like 5-digit ZIP) forces you to censor clinical history, you’re trading a great predictive feature for structural bias. Econometrics 101: check balance first before doing anything with data.
Example 2: Who makes the cut for healthcare pretraining data?
At Protege we have gathered a very large healthcare corpora: >100M individuals and billions of clinical notes (not to mention all other modalities). You don’t need a third of the country to pretrain an LLM on healthcare contexts. So, the real problem is who to include or select.
The tempting nonparametric route
One approach is to fine-tune an open-source model on many stratified random samples from the 100M; pick the sample that boosts benchmark scores the most. Then go get more of these subgroups for training.
This approach is fast. But it bakes in at least three assumptions:
- Your 100M are truly population-representative. They may not be.
- Your benchmark mirrors real use and is valid. Benchmarks are a separate topic I shall dissect one day—but most benchmarks have issues (to name a few: who annotated the data, who built the rubric…).
- You assume that over-selecting the group that boosts the score is harmless, even if another group is under-measured due to censoring or incompleteness. Subgroup A did well and subgroup B not much, so we favor A. But actually B was just low-fidelity and censored but inherently, a priori, a group we should include.
Those are strong assumptions hiding in plain sight.
The disciplined (parametric + demographic) route
I often favor anchoring to external population structure (demography is also very useful to AI). Basically, first identifying a benchmark representative dataset and working from there to build a matching training dataset (with some caveats—e.g., wanting to oversample rarer cohorts). In U.S. health, that means using the Medical Expenditure Panel Survey (MEPS; nationally representative care/use/cost) or the NIH All of Us program (explicitly oversamples under-represented groups). We match our sample distributions to these references, then audit utilization and outcomes inside subgroups. It is very parametric in that you have to lean in on clinical input and describe what matters (which covariates to test balance on). You could also make your assessment a little less parametric and try to estimate mortality or subsequent health events using models (non parametric) on the data you have selected vs. the out-of-sample—the hope is that observable covariates (or the predicted outcome variables) are balanced in the two groups.
How we operationalize it:
- Build targets from MEPS/All of Us and weight to match. Most of these surveys also come with their own weights, so make sure you create a representative benchmark first.
- Run domain checks clinicians care about: follow-up windows, lab and vitals distributions, outcomes of events. Get clinical guidance and use the literature to build a broad list of checks.
- Report effect sizes (Cohen’s d) instead of worshiping tiny p-values on n ≈ millions. Every small difference in a cohort of millions vs. other millions will come out statistically significant. I will never forget when my PhD student said “he had a statistically significant difference of 0.2 grams in birthweight (average 3,000 grams) between the treatment and control arm.”
- We are careful that when we run many correlated checks, we control family-wise error with things like Holm–Bonferroni rather than stack unadjusted p-values. This prevents us from over-mitigating for bias when none may exist.
- Finally, iterating and stress-testing the data in various ways. Bias is one element. But data fidelity, broadly, is a multifaceted concept. I often worry that the term “high fidelity” is misused.
The global takeaway is that to curate data you are making decisions under constraints. You cannot bring back all the missing elements for censored individuals. You cannot, with 100% certainty, distinguish rare events in the data from erroneous entries. You cannot, with 100% certainty, say that your data is unbiased. But you can improve on random sampling (because your data starting point is likely not random) and stress-test your data curation methods to reduce bias.
So we are hiring—and if this resonated, then come build the guardrails.
DataLab at Protege is looking for full-time researchers and inviting academic collaborators broadly (Apply Here). You can read about Protege’s recent fundraise with a16z here.