What Should Agentic AI Actually Be Trained to Do?

The problem is not just automation. It is knowledge.

While there is ongoing debate around whether AI improves productivity in the short run, a recent paper by Daron Acemoglu, Dingwen Kong, and Asuman Ozdaglar asks a deeper question. Their focus is the relationship between context-specific decision-making and the long-run accumulation of general knowledge in society. In their framework, successful decisions depend on combining shared community-level knowledge with individual, context-specific knowledge. These two inputs are complements, and human effort produces both: a private amount of knowledge and a thinner public signal that contributes to the broader knowledge stock of society.

Agentic AI can substitute for that effort by producing context-specific recommendations at lower marginal cost. But if it does so too well, it may reduce the incentive for humans to do the work that generates general knowledge in the first place. One of my colleagues at Indiana University recently visited my office and asked, "Why would we write papers in the future if AI gets so good at research?" I argued that we would write more papers, not fewer. Just faster science.

This new take on knowledge spillovers is a powerful idea because it reframes the question. The issue is not only whether an agent performs a task accurately today. It is also whether the set of tasks we choose to automate will preserve, deepen, or erode the human knowledge base that future decisions will depend on. This concept is being referred to as "deskilling" — and its consequences have yet to be fully realized.

A related but not contrary view is one that is much more optimistic. This perspective takes any occupation facing automation over time and models its work as a group of tasks. As automation occurs, the task set changes, and with it the composition of the expertise level needed to do the task and the composition of individuals who do the occupation also change. This research in itself is not new. Researchers who study the Bureau of Labor Statistics (BLS) or ONET datasets on jobs have traced changes in expertise and tasks within jobs over time.

Tasks are bundled into occupations, and automation changes expertise

David Autor and Neil Thompson emphasize that occupations are aggregations of tasks, and that the effect of automation depends on what kinds of tasks are removed and what kinds remain. In their framework, automating some tasks can make the remaining work more expert, while automating other tasks can make the remaining work less expert. The US economy is 150 million jobs and almost all of them have had some shifts in tasks and expertise over time.

There are intuitive examples of this. Consider text editing. In an earlier era, many editorial tasks were relatively low-expert: checking spelling, grammar, and mechanical correctness. As those tasks became automated, the editor's job shifted upward toward style, judgment, structure, and narrative coherence. Automation increased the expertise of the remaining work.

But expert knowledge can flow in the opposite direction. Driving a taxi in London historically required memorizing the city map and passing a demanding test without a GPS. GPS substantially reduced this knowledge burden. The task became easier — and the required expertise fell.

These examples matter because they suggest that “using AI in work” is not a single question. Automation changes the composition of what remains. Some automations deepen human judgment. Others hollow it out. That is exactly why task selection matters. We should not only ask whether a task can be automated, but also ask what kind of human capability remains after automation — and whether the automated task still contributes to a broader stock of socially useful knowledge.

This leads to two practical questions

Taking both papers at face value, two practical questions follow.

What should we actually be working on? Which tasks are socially valuable to make agentic, and which tasks should remain human-centered because they generate learning, expertise, and/or spillovers into the broader stock of knowledge?
How do we capture such tasks in data? Many real-world tasks are not a single decision node. They are trees of decisions. If the data only contains the final artifact or no clear lineage between inputs and outputs, then they do not contain the task.

This second point is crucial. The data problem is severe here. Tasks generated by humans to simulate real jobs lack realism and have potentially reached "flat of the curve" in imagination. If real data is the source, then it is still difficult to obtain all nodes in the decision-making process. For example, in healthcare or finance we may obtain the the outcome of a process, be it balance sheets or physician notes, but not the process itself.

Take a medical note as an example: this is the result of an interaction between physician and patient. It records what was decided - or at least what was documented. It does not fully reveal the back-and-forth that led to the decision, the options that were considered but dropped, the resource constraints that shaped the decision, or the uncertainty that may have changed over time.

The same dynamic appears in financial settings. If one only observes a tax filing or a balance sheet, one does not observe the multi-step decision tree that produced the artifact.

We chose healthcare as the first experimentation ground for generating training tasks

Healthcare is an especially useful domain for thinking about agentic tasks because it can be understood as a production function. Healthcare produces health, which in turn has spillover effects on other forms of human capital, such as education and labor market productivity.

In focusing on health, one of the most famous models is by Grossman (1972) (incidentally, I have an “Erdős number” of 3 to this work: me $\rightarrow$ my advisor Robert Kaestner $\rightarrow$ his advisor Michael Grossman). Grossman, who was trained by Gary Becker and Jacob Mincer, articulated a simple framework for health capital accumulation and the demand for health that has stood the test of time. In this model, health can be viewed as a durable capital stock. It depreciates over time, but it can be increased through investment. That investment is produced using both market inputs—especially medical care—and time. In shorthand,

H_{t+1} - H_t = I_t - \delta_t H_t,

where investment in health depends on medical care and other inputs:

I_t = I_t(M_t, TH_t, E_t).

This framing helps us identify a few ways AI can push the health production possibility frontier:

Increase the marginal benefit of medical care. This includes tasks that improve diagnostic precision, treatment matching, triage, and evidence synthesis.
Reduce the effective cost of medical care. This includes administrative workflows such as prior authorization, denial rebuttal, coding, scheduling, trial operations, and document consolidation.
Reduce time frictions in the system. This includes tasks that shorten queues, improve coordination, reduce handoff failures, and make it easier for patients and clinicians to move through care pathways.

Put differently, healthcare is not one task. It is a bundle of tasks that can increase the productivity of care, lower its cost, or reduce the time burden required to obtain care. Below I summarize some of the tasks our lab has been working towards in the last few months. This includes aggregating data and studying data grounding, and lineage.

Where we began in Clinical Diagnostics RL Tasks

Given data extraction speed and healthcare system constraints, we want to start where (i) the welfare stakes are large, (ii) the work burden is real, and (iii) model gains would plausibly spill over to adjacent clinical workflows. As such, we have formulated suggestions around task selection in clinical diagnostics.

Healthcare spending is extremely skewed

U.S. healthcare is not evenly distributed across patients. A small share of patients accounts for the majority of spend (in our internal framing: roughly 5% of patients drive $>$ 60% of spending). This concentration implies a practical starting point for RL task design: focus on high-intensity episodes where drug discovery and treatment matching can move QALYs and reduce avoidable downstream utilization. This is particularly relevant in chronic conditions and diseases impacting the elderly.

Figure 1: Concentration of annual spending across patients (Pareto-style). A small share of patients accounts for the majority of spend, motivating targeted diagnostic and workflow support. This is calculated from near 60% of all US claims held within Protege in 2023. Total health spending was 5.3 trillion and appx 3.4 trillion is represented in the data. — **Figure 1**: Concentration of annual spending across patients (Pareto-style). A small share of patients accounts for the majority of spend, motivating targeted diagnostic and workflow support. This is calculated from near 60% of all US claims held within Protege in 2023. Total health spending was 5.3 trillion and appx 3.4 trillion is represented in the data.

High-intensity disease categories are where QALY gains can be large

Beyond patient-level concentration, spending is also concentrated across disease groups. A small set of diagnosis categories can explain a large share of total inpatient spend, which gives a natural shortlist for “where to begin” when we want tasks that are economically meaningful and clinically consequential.

Figure 2: Total inpatient spend by disease group (Pareto-style). A small number of disease categories explain a large share of spend. This figure was computed from a cohort of 50M patients with EMR from one aggregator within Protege. This source includes detailed unstructured notes. — **Figure 2**: Total inpatient spend by disease group (Pareto-style). A small number of disease categories explain a large share of spend. This figure was computed from a cohort of 50M patients with EMR from one aggregator within Protege. This source includes detailed unstructured notes.

We then layer in severity. Some categories are not only expensive, but also associated with high years of life lost (YLL). These are the regimes where better diagnostic reasoning, earlier correct treatment selection, and fewer missed criteria can plausibly translate into meaningful QALY gains, not just operational throughput.

Figure 3: Average episode cost vs. average years of life lost (YLL) by disease group. This helps separate “expensive” from “expensive and severe,” which is where diagnostic improvements can matter most for welfare. This was estimated from individuals deceased in 2023 from a group of 50M patients. Years of life lost are calculated by subtracting life expectancy for that birth cohort from age at death. — **Figure 3**: Average episode cost vs. average years of life lost (YLL) by disease group. This helps separate “expensive” from “expensive and severe,” which is where diagnostic improvements can matter most for welfare. This was estimated from individuals deceased in 2023 from a group of 50M patients. Years of life lost are calculated by subtracting life expectancy for that birth cohort from age at death.

Expert labor is a binding constraint and creates diagnostic choke points

Even when the right tests exist, the system bottlenecks on expert time. Radiology and pathology in particular sit at high-volume choke points: many patients per specialist, heavy longitudinal review requirements, and high downstream dependence. Everything waits on reads, staging, and interpretation. In pathology, this is a widely-cited reason behind the average delay of roughly 180 days between diagnosis and treatment.

Figure 4: Clinical labor intensity varies widely by condition. Distribution of the mean number of clinicians involved per encounter (aggregated by primary diagnosis ICD-10 3-character) in 2023, illustrating how some diagnoses induce substantially more multi-clinician coordination and downstream handoffs. — **Figure 4**: Clinical labor intensity varies widely by condition. Distribution of the mean number of clinicians involved per encounter (aggregated by primary diagnosis ICD-10 3-character) in 2023, illustrating how some diagnoses induce substantially more multi-clinician coordination and downstream handoffs.

Together these patterns help narrow the space. We want tasks that sit at high-cost or high-severity bottlenecks, involve real expert judgment, and have enough structure that learning on them can plausibly transfer to neighboring workflows.

Examples: Clinical Diagnostic Agentic Tasks

Multi-disciplinary team (MDT) case conference task

We start with a concrete, ground-truth sequential case. We take one inpatient heart-failure patient and treat the chart as a relay: multiple clinicians contribute partial, time-ordered evidence, and each step changes what the next specialist does. This is not a single-shot task where the model reads everything and answers once. The agent must act at multiple decision points as new artifacts arrive.

One-patient snapshot (verbatim from the ground-truth record).

This patient has a Total of 89 notes.
He was last seen in clinic on 5/26 by Herby Abraham, she admitted him to Jefferson Regional Medical Center for acute on chronic combined heart failure.
He was diuresed, and had a repeat echo that showed severe tricuspid regurgitation with hepatic vein flow reversal during systole, moderate mitral regurgitation and mild to moderate aortic stenosis.
He has been seen by Dr. Helen Hashimoto on 6/21 who ordered transesophageal echo, the TEE showed severe aortic valve stenosis.

Why MDT is the right RL shape.

Even in this single example, care is shared across multiple teams (e.g., advanced HF/transplant cardiology, electrophysiology, structural heart imaging, dermatology, physical therapy). The value is in (i) ordering and filtering evidence across time, (ii) carrying forward what prior teams already established, and (iii) making the next-best clinical or workflow move under uncertainty.

Task framing: the agent must act at multiple decision points.

At each step, the agent receives only the information that exists up to that time (notes, echo snippets, orders, results), and must choose an action:

After admission note (5/26): summarize the HF presentation; identify missing objective evidence needed to justify the inpatient plan (e.g., echo recency, volume status signals); decide what to request next.
After TTE result (5/27): extract the key structural findings (TR/MR/AS) and decide whether to route to a structural pathway vs optimize HF meds/diuresis only; propose next test(s) if the severity picture is incomplete.
After structural consult (6/21): given the ordered TEE, pre-register what would change management depending on the TEE outcome (e.g., severe AS thresholds; low-flow/low-gradient pattern).
After TEE result (6/21): update the problem representation; decide whether the pathway becomes “TAVR evaluation” vs alternative; generate the checklist of downstream prerequisites.
After follow-up clinic note (7/1): reconcile symptoms, weight, and prior findings; propose the next operational steps (who to see next, what decision is pending, what evidence is still missing).

Figure 5: MDT as a relay: time-ordered notes and tests produce a sequence of state updates and decision points (multi-turn, multi-artifact). — **Figure 5**: MDT as a relay: time-ordered notes and tests produce a sequence of state updates and decision points (multi-turn, multi-artifact).

Examples: Healthcare Administration Agentic Tasks

Insurance agentic tasks

Insurance workflows are also RL-suitable because they are multi-step, require strong constraint-following, and have crisp verification signals. These tasks are operationally common in U.S. healthcare. Medical denials impact a very large share of claims, and denial rebuttals remain low-performance tasks despite repeated attempts to wrap general models around coding, billing, and prior authorization workflows.

Table 1: Catalog of administrative agentic tasks: descriptions, sources of difficulty, and verification signals. — **Table 1**: Catalog of administrative agentic tasks: descriptions, sources of difficulty, and verification signals.

Example: extracting evidence to refute a denial

To make the evaluation concrete, we show model performance on a denial-adjacent variant of Highlight Generation: extracting the exact chart evidence needed to rebut a denial. Each case has three independent judge scores to illustrate heterogeneity across cases and model behavior. Examples of failure included confusing episodes of care across time periods, adding evidence not present in the case, and omitting relevant facts.

Figure 6: Model performance of a student model judged by three other models.[1] — **Figure 6**: Model performance of a student model judged by three other models.[1]

Things we are experimenting with

Businesses across the economy are trying to embed AI into real workflows—whether in healthcare, finance, construction, or real estate. But one of the central data challenges is that we usually only observe scattered events within a workflow, or a small node within a much larger decision tree that unfolds across people, systems, and time.

That creates a serious risk in how tasks are selected. It is very easy to bias dataset construction toward tasks that are easier, more streamlined, or already close to solvable by current models. When that happens, we overestimate what these systems can really do. We mistake performance on narrow, clean slices of work for competence. Even worse, very narrowed down tasks can degrade model capabilities over time.

These are exactly the challenges our lab is trying to tackle. The question is how to build realistic, economically-meaningful task datasets that capture relatively complex, multi-turn, dynamic, multi-actor workflows across sectors of the economy. The data must be grounded, factual, safe to use, and privacy-preserving. But it must also be rich enough to teach models something deeper: the subtlety of human decision-making and mimic the multifaceted firm-specific and industry-specific knowledge pools.

Footnotes

1. Protege DataLab internal evaluation ( $\sim$ 30 cases). Coverage is computed as

\texttt{coverage\_score\_percent} = \left(\frac{\texttt{covered\_count}}{\texttt{total\_required\_elements}}\right)\times 100,

and the median (max score) observed in this batch is 11.67.

Contents