Reflections on 2025

Reflections on 2025

As I write reflections on the year 2025, it feels important to do two things at once. First, to reflect on 2025 alone—as a discrete year. But second, and maybe more importantly, to reflect on a slightly longer arc stretching from 2024 to just yesterday, the end of 2025. The arc of my learnings through Protege and DataLab.

From around the start of the pandemic through early 2024—until Protege really began to take shape—I carried a persistent feeling that something in my ability to work on hard problems was stagnating. I thought a lot about Tyler Cowen’s ideas on stagnation: how to detect it, what it looks like at a personal level, whether I was living inside it. I was probably having a low-grade existential crisis.

Then came 2024. I got to build fast—really fast—working on what would become a rocket ship in 2025 through Protege. By 2025, my role had evolved into something I never could have fully anticipated. In any given week, I was speaking with dozens of companies—probably hundreds over the course of the year—each with the most peculiar, fascinating datasets that could be valuable for teaching AI. Another significant share of my time is spent talking to researchers at foundation model labs, people working on problems that are genuinely hard.

When I look back at these conversations in aggregate, they are probably worth more than a decade of formal learning. It still mesmerizes me as a scientist that every morning I wake up, join a call, and learn something new about the future—something I hadn’t thought about carefully before. I already think a lot about the future, so the fact that the learning curve remains this steep is surprising in the best possible way, and a reminder of how thrilling this moment at Protege and DataLab has been.

What follows are some things from 2025 that didn’t surprise me—and some things that absolutely did.

Things That Didn’t Surprise Me

1. AI becomes what we want it to become.

Just like the digital revolution, innovation is steered far more by demand or market size than by invention. Invention rarely leads; demand does. A good paper to be reminded of is Acemoglu and Linn (2004) - access here.

The idea is that where market size for lucrative and rapid AI adoption is highest is where we will see a lot of innovation (and invention). People have taken on using AI much faster than places (institutions). The consumer AI market took over enterprise. Though enterprise is rising rapidly it will only be successful if workers (people) champion it. As such, most of the conversations I have about innovation in use cases are centered around micro individual level embedding of AI in daily life – at home or at work.

2. People use AI as a companion.

Most people use ChatGPT, Gemini, Claude, or similar tools as a kind of friend. They talk to it, ask questions, write documents for their boss, solve homework problems. I use it to now run empirical experiments faster. I use it as a college professor—to create rubrics that grade exams (to the TA’s lack of amusement) or generate alternate versions of tests for students who need to take the test early because, say, their dad is getting married. These are genuinely useful things.

3. But indispensability is still elusive.

Looking forward, it’s clear how hard it is for these technologies to find their digital place where they become indispensable, like the internet itself. Outside of San Francisco and super-utilizing firms, how will everyday people use AI the way they use the internet or computers broadly?

4. Evaluating AI is fundamentally a measurement problem—and measurement theory is hard.

One way to evaluate AI, or even talk about its performance, is to pick a y-axis. You can say, this task used to take y hours and now it takes less. That’s one axis. You could do accuracy on the task. That is another y-axis. But once you move beyond that, everything becomes murky very fast. What are we actually measuring? True intelligence? Performance on benchmarks? Test-taking ability?

In the RL environments world, it’s common to have a higher-compute model grade a lower-compute model. But is that the right way to evaluate? It’s unclear. Are human experts—who are themselves bounded in rationality and intelligence—the right evaluators either? Model evaluation is genuinely hard.

That didn’t surprise me. People have spent decades studying measurement theory, survey methods, and value-added human capital. Economists have been thinking about this since at least the 1940s. It’s a very difficult problem. You can say a model is better because it performs well on specific benchmarks, and you can hope that if it’s strong across many local maxima—say 50 different healthcare benchmarks—then the global maximum improves too. I aim to continue to spend a lot of time on this question.

5. Agentic AI is shaping up to be the next real frontier. That is not surprising.

In 2026, I’m especially interested in economically productive agentic applications—what kinds of data could teach systems to expand expert-level work, democratize skills, and allow people from weaker educational or labor markets to compete globally. Productive applications also include any time-saving or utility-maximizing actions more generally; agentic does not mean just enterprise.

I can’t wait for AI as a true research assistant. Right now, I use tools like Claude Code, Refine.ink, Research Rabbit–esque tools and so on, but there’s clearly an opportunity to combine all of this into a unified agent (Deep Research mode but better). I know peers in bench science struggle with this the most—half of the lab protocol is discussed in meetings, the other half lives in a PhD student’s notebook (not .py but actual paper!), and so on.

6. Healthcare was always going to be a major use case.

People love talking about personal problems. They upload images of their bodies, discuss skin issues, and talk openly about mental health. It was never surprising to me that healthcare would be one of AI’s strongest domains. Mental health AI conversations are definitely undercounted.

7. I was not surprised to use LLMs to read emotions. I’m confident I’m not alone.

I often upload Slack messages from colleagues into ChatGPT or Gemini and ask, “What is this person really trying to say? Can you read between the lines?” And it’s remarkable—and a little ironic—that I ask a large language model to do that for me.

8. Pre-training gains are plateauing—where pre-training already happened.

This wasn’t surprising. It means innovation in reinforcement learning—or whatever comes after RL—is essential. This was the theme at NeurIPS and maybe the theme of 2025.

9. But some domains still have not unlocked data at scale.

There are domains and use cases where new data at scale can still drive meaningful pre-training gains. The challenge is not that data is scarce—it’s that useful data is scarce. Finding useful data matters more than ever.

10. Interdisciplinarity matters. Science is its best then.

Academia always pushed me to think about why interdisciplinary work matters. It didn’t surprise me how much AI benefits from interdisciplinary collaboration. At DataLab at Protege, the best work happens when a program-evaluation teammate works alongside an ML researcher—and you throw an oncologist into the mix. This year I have learned a ton from conversations with peers in sociology, film and liberal arts, medicine and I met my first primate researcher (now a colleague at Indiana University)!

There’s good work on this idea in the literature (medicine, in particular, has long embraced interdisciplinary approaches). I often return to the question “At what stage of their careers are scholars most creative?”, which Weinberg and Galenson use to argue that experimental innovators work inductively, accumulating knowledge from experience, while conceptual innovators work deductively, applying abstract principles.

Because AI advances need to be rapid, a steep and piecewise function rather than a smooth concave arc, I’m a firm believer in conceptual innovators—especially when they’re surrounded by abstract ideas from nearby domains. You probably need both, but without conceptual innovation it’s hard to see invention happening. I’m looking forward to thinking more about interdisciplinary AI work in 2026 and hosting interdisciplinary events.

Now… Things That Surprised Me

1. How fast AI reached everywhere - even if not deep just broad.

Every day I’m surprised by new use cases. This summer in Cairo, a driver told me he uses ChatGPT with his five-year-old daughter to do math—in Arabic. I didn’t think it would work that well. He said it was excellent. One day my mom texted me: “Can you make me look like a pharaonic princess with AI?” A message followed asking what use I am, if I work in AI and can’t make that happen. I realized suddenly that Nano Banana made it to Cairo. It came through with a very vivid image of my mother as a Cleopatra. She loved it.

2. The intensity of competition.

I didn’t expect the field to feel this much like the Manhattan Project. The labs feel like the U.S. versus Russia. Ideas are guarded. Conversations are fragile. Science bleeds easily—ideas leak and so everyone’s guard is up. And rightly so. The stakes are so so high but the level of intensity surprised me.

In 2026, I want to be more intentional about how I communicate with research friends—how to protect ideas while still making progress. Competition is healthy. It propels science forward. But the pressure on AI researchers—many of them extraordinarily gifted and working punishing hours—is enormous. Mental health cannot be an afterthought.

If part of my role at Protege and DataLab is to be a data catalyst through research, I also want it to be to support research conversations—that genuinely enable innovation, not hype.

3. Audio datasets that are truly useful are hard to come by.

Previous research has shown this, and one of the early signals we saw before launching our audio vertical was just how much data-efficiency research is still needed in this area. Whether it’s speech-to-text or text-to-speech models (the latter with even more stringent data requirements), it’s clear that de-identifying audio while maximizing data efficiency is difficult and requires innovation in de-identification approaches. Layer on top of that a language spoken by only 30 million people worldwide—the narrower the group, the harder de-identification becomes within demographic cartesian products—and the problem becomes even more challenging. I’m excited to see our team build here.

4. The video generation problem necessitates specialization (at least for now).

Many of our video data consumers are models at startups outside large labs. By focusing early on specific use cases, they’ve gained real ground—whether through specialization in dubbing, ad creation, sports editing, and similar applications. There’s ongoing debate about the future trajectory of specialized models versus world models, and when each will prove more powerful. In media in 2025, I was surprised by the level of innovation coming from independent startup labs.

5. How much my students hate AI.

This surprised me deeply. I taught a 100 people class called Intro Micro at Tulane last semester. On day one, I asked students how they felt about AI. One student said, “Doesn’t it use a ton of water just to answer a stupid question?” Another asked, “What happens to internships when AI does them all?” The anxiety was palpable.

Higher education is nervous right now, and rightly so. What do we educate for? What do we train for?

In 2026, I want to think seriously about how AI can help people become more skilled than AI, not less. How do we accelerate learning trajectories by 50×? How do we prepare students for roles like agent supervisors—or jobs that don’t exist yet? I am really looking forward to this.

6. Learning to code now matters more, not less.

Even with vibe coding chat windows and tools like Claude Code and Lovable etc, understanding code architecture is more important than ever. You need to know when something is good enough, when it isn’t, and when human intervention is required. If you can’t code, you can’t do that.

I vibe code all the time. I build MVPs and pass them to engineers. That doesn’t make me an engineer—it makes me a terrible one. But it reinforces how behind higher education is in teaching these skills. We were already behind in data and coding literacy and now it is time to call code red.

7. New tasks will come on the scene that humans were never able to do anyways.

AI will enable tasks humans could never do: reading genomic data at scale, matching mutations to trials instantly. These aren’t replacements— since physicians cannot pick up a FASTQ/BAM file of tumor DNA and start reading it and diagnosing. These are expansions of what’s possible. How we will obtain data and evaluate AI usefulness in domains with no ground truth is a hard problem and one we are all thinking about.

8. I am more hopeful than ever.

This surprised me the most. I’ve never been more hopeful about the future.

The productivity gains from LLMs are unlike anything I’ve seen. Summarizing documents, checking code, understanding datasets, adversarially testing ideas—it’s a constant catalyst. I have a hard time not believing that GDP will rise significantly because of AI.

Administrative friction has slowed progress for decades. AI removes it. I genuinely believe this technology will reshape humanity more than the internet, the printing press, or the steam engine.

Looking Ahead

It’s been a big year at Protege. And a big life year. I moved from New Orleans to Indiana. I’m writing this from an Airbnb, waiting to sign a lease. I’ll start teaching my first class at Indiana University on January 12th. I’m looking forward to research dinners, wild datasets, foundation-lab conversations, and all the strange, funny data warts DataLab will uncover.

Most of all, I’m grateful—to be learning this fast from colleagues and research friends, to be part of this moment, and to feel hopeful about the future.

Here is a subset of random readings that are by no means exhaustive that I either shared or were shared with me and I liked. They are in some random order on my phone and not in order of preference:

https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law
I enjoyed this foundation models for biology seminar series
https://s3.amazonaws.com/fieldexperiments-papers2/papers/00748.pdf
https://gottlieb.ca/papers/HealthCareJobs.pdf
https://www.nature.com/articles/d41586-025-01739-z
My go to justification for why I am not working on just economics on any given day

https://x.com/MarcODeGirolami/status/1941821397451370973?s=20

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4754678
Say what you want about RL and the future but how do we make AI as indispensable as the internet is a good framework.
Multimodal AI in medicine of course
When this came out I was a bit surprised but also not surprised
What our government’s AI plan is
Towards understanding and representing misalignment in AI
Lots was published on medical conversational AI in 2025 including AIME new paper
Went down an LLM as a judge bias rabbit hole for a bit and thought what experiments and tools can I setup as guardrails
Med Palm 2 paper was published in Jan 2025 (feels like early innovation after how far progress has come)
I cooked this recipe a lot because the sauce has a funny name (wasakaka sauce)
Who am I if I did not read “The Beautiful Dataset”
The team at scale’s rubrics as rewards and generally went down a rabbit hole on rubric designs.
Models show the same type of risk taking behavior as humans (prospect theory from Kahneman and Tversky). Who does not love some Thinking Fast and Slow.
https://openai.com/index/introducing-gpt-5/#livestream-replay
I told my students what the unicorn benchmark is. I am a latex pdf addict and they found out that day lol…

This nber paper intrigued me mainly because one of the bottlenecks to wide scale adoption of AI is malpractice risk
https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf
Good book and also dabbled in the evolutionary biology.
Learned from this collaboration and the published results they put out.
https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf
https://cookbook.openai.com/examples/realtime_prompting_guide
Followed benchling closely and generally AI in drug discovery and science is exciting beyond measure
Google deep mind X epoch AI report
From the team at Microsoft on benchmarks for hard diagnostic issues like wound care and derm
https://arxiv.org/abs/2509.14448
https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
On agents and tasks and humans in the loop by Ivan Zhao

Contents

Things That Didn’t Surprise Me

Now… Things That Surprised Me

Looking Ahead