
Aug 29, 2025
CIIP
Podcast
From Prompts to Practice: What Multi-Institutional Studies Teach
Radiology has no shortage of data. Every day, millions of CTs, MRIs, and X-rays are captured and reported on. The real challenge isn’t volume, it’s curation. How do we turn narrative radiology reports into structured, labeled data that can drive model training, clinical decision support, and patient-facing tools?
A recent multi-institutional study led by Mayo Clinic and colleagues across UCSF, MGH, Emory, UCI, and Moffitt Cancer Center set out to test whether large language models (LLMs) could be used for this task. Instead of training custom models, the team focused on prompt engineering—creating carefully structured instructions to guide off-the-shelf LLMs in annotating reports.
Why Prompt Engineering Matters
Traditional natural language processing (NLP) approaches often fail when confronted with the variability of radiology reports. Some institutions rely heavily on structured templates, others lean toward free-text narratives. Even within structured environments, radiologists frequently insert personal style or nuance that complicates downstream analysis.
LLMs, by contrast, are inherently better at adapting to this variability. The study showed that with a well-optimized prompt, models like Llama 3.1 70B could achieve impressive accuracy across sites. In some cases, performance approached human-level accuracy, particularly for well-defined findings such as cervical spine fractures.
What stood out was how much the nature of the finding affected performance. Conditions that hinge on clear radiologic evidence, like fractures, were reliably captured. More ambiguous findings, like pneumonia—where radiologists often hedge with “possible” or “probable”—were less consistent. This highlights that the limitation isn’t just the model, but also the inherent uncertainty in medical communication.
Structured Reporting vs Narrative Style
One of the study’s side observations was that structured reporting does help, but it isn’t a silver bullet. Centers with strong template use tended to show higher accuracy, but radiologists often still added narrative phrasing that threw off results. This suggests that while templates improve extractability, they won’t fully solve variability.
Interestingly, prompts engineered at one site sometimes performed better at another. This reflects how local style and institutional context play as much of a role as the model itself.
Lessons on AI Errors
Another important finding was the behavior of models when they got things wrong. Instead of simply answering “yes” or “no” as instructed, models sometimes generated paragraphs of explanation. Even when the content was technically correct, these outputs were counted as failures because they ignored the prompt format.
The team also noted that chat-based models struggled more than instruction-tuned versions. Chat interfaces often aim to be conversational, which led to irrelevant elaborations instead of concise answers. Instruction-tuned prompts yielded far more reliable outputs.
The Hallucination Problem
One of the most persistent challenges the researchers faced was hallucination. When uncertain, the models often abandoned the simple yes/no format and produced verbose, off-topic replies. These outputs couldn’t be trusted, so they were discarded from the results.
This is a reminder that LLMs don’t always “fail gracefully.” Instead of signaling uncertainty, they can overcompensate with confident but irrelevant text. For clinical applications, this kind of behavior is not just inconvenient—it’s potentially dangerous.
Here’s a short clip from the conversation where Dr. Mana Moassefi explains how her team managed these hallucinations:
Why This Matters
The implications go beyond annotation. If LLMs can reliably label reports, they can create massive, curated datasets that fuel the next generation of AI imaging models. They could also be extended into patient portals, offering lay-friendly summaries of findings while flagging when clinical correlation is required.
Looking further ahead, the concept of agentic AI—where multiple specialized models collaborate, such as one extracting diagnoses, another quantifying uncertainty, and another communicating risk—could reshape how radiology findings are shared with both clinicians and patients.
Transparency will remain a challenge. AI reasoning doesn’t mirror human logic, and sometimes validation will matter more than interpretability. But as this study shows, careful design can move us closer to trustworthy, reproducible outcomes.
The Future of AI in Radiology
Radiologists often worry about AI as a replacement. A more constructive framing is that AI will shift the field from simply diagnosing toward improving screening, consistency, and patient communication. Humans remain essential, but tools like LLMs can extend their reach.
This work demonstrates that big insights can come not from building bigger models, but from designing better prompts, running multi-institutional collaborations, and tackling real-world data variability head on.
If you want to hear a deeper dive into the study and its broader implications, you can check out my conversation with Dr. Mana Moassefi on Imaging Informatics Unplugged.





