I have been thinking a lot about artificial intelligence (AI) lately. Not just because of AcademyHealth’s upcoming course on AI where attendees will learn how to leverage data and AI for high impact research, where I’m an instructor. Not just because others are talking about it more than ever. And not just because it is influencing so much more than we have seen before. I am thinking more about it because lately I’ve been doing a lot of thinking with AI. And that is an important change across the nearly 30 years I have been in the field.
My first real experiences with AI were all about games. I was in graduate school, taking my first course on AI with a bunch of computer science students. At the time, there was a lot of excitement about the competition between Deep Blue and Garry Kasparov in a chess match. While Kasparov was the winner of that match, it would be the last time; a year later the AI system was the winner, and humans no longer were the best in chess.
The Deep Blue program, while impressive at the time, was three generations in AI away from where we are currently. It was an expert-based system meaning that its logic was created by experts writing rules for how it should “think,” with thousands or millions of rules refined from testing against examples. And it was a grand master’s lifetime away from our current era of generative AI, ushered in by large language models (LLMs) such as ChatGPT.
It would be difficult to find a time since that AI chess victory where the excitement and hype around AI has been higher than it is today. Which makes it particularly difficult to really evaluate it. But evaluate it we must, especially in applications in health care and health services where our first responsibility is to do no harm.
Recently, we attempted to evaluate LLMs at a simple task of extracting information from clinical notes. This is a task that has been studied for decades. In 1994, researchers demonstrated that a rule-based AI system, called natural language processing (NLP), could extract findings from notes as well as physicians. Later, I extended this same study and tried to automatically generate the rules. For three decades, researchers in natural language processing have continued to build and refine systems, with modest improvements. They’re not perfect but, depending on the task, they can be as good or better than human readers at identifying concepts or meaning in text. When we tried to evaluate LLMs we compared them against these NLP systems, and found that the LLMs were still not as good as NLP.
I was surprised by this result for two reasons. First, I had expected LLMs to perform better. I had used them for simple tasks, and was amazed by how well they seemed to “understand” the language I was using with them or the words they produced. Second, I was surprised that I considered the result of the study almost irrelevant.
Only a month after ChatGPT was released in November 2022, it had over 100 million users, making it the fastest adopted technology in history. Since then, millions of people have been trying to use these new technologies in increasingly creative ways. Thirty years after studies showed that NLP systems could work with medical text reports, they are still considered more something we study than something we use. NLP may have outperformed LLMs on a specific task in a research setting, but LLMs were already being used in real world settings in ways that NLP never was.
The greatest change of the current AI boom to health care may be not the technology behind it, but the broader acceptance of it. In this environment, evaluations of AI technology should be based on their actual impact, not on measures of performance for specific tasks in a lab. Regardless of how well NLP has performed in lab settings, we still don’t regularly use computers to extract information from narrative text in health care. LLMs are more likely than NLP to change this, not because they perform individually better, but because they are more likely to be used. And ultimately, that is what makes the difference.
AI is transforming health research, but are we fully leveraging its potential? In AcademyHealth’s upcoming course, Leveraging AI for High-Impact Health Research, participants will explore how AI, machine learning, and large language models (LLMs) are changing the field, and how you can apply these advancements in your work. Learn more and register here.