AI system curbs hallucinations in automated chest X-ray reports

By Keith Page

Release Date: March 23, 2026

Mingchen Gao

David Doermann

Mingchen Gao, associate professor of computer science

University at Buffalo School of Engineering and Applied Sciences

BUFFALO, N.Y. – A research team at the University at Buffalo is developing an artificial intelligence framework to improve the accuracy and clinical reliability of automated chest X-ray analysis used by radiologists. The method tackles a long‑standing problem in medical imaging: AI vision‑language models tend to drift away from the image as they write. The new framework, Category‑Wise Contrastive Decoding (CWCD), keeps them anchored to what’s actually visible on the X‑ray.

Current multimodal large language models (MLLMs) generate full radiology reports from chest X‑rays in a single step. As the report grows, the model increasingly relies on its own output rather than the image, a pattern that can produce clinically incorrect pairings. For example, since cardiomegaly and pulmonary edema frequently appear together in congestive heart failure, the model may predict pulmonary edema whenever it sees cardiomegaly, even if the X‑ray shows no evidence of it.

The UB research, “CWCD: Category-Wise Contrastive Decoding for Structured Medical Report Generation,” introduces a method that generates radiology findings one anatomical region at a time and uses a novel contrastive decoding technique to reduce common AI errors such as hallucinated or spurious pathology findings. The approach is designed to improve the general efficiency of chest X-ray analysis so radiologists can devote more time to complex cases.

“Chest X-rays are among the most frequently performed diagnostic procedures worldwide, yet they can be difficult to interpret because overlapping anatomy and subtle disease patterns can obscure important details,” said Mingchen Gao, PhD, associate professor of computer science and engineering at UB and a co-author of the study. “Although recent multimodal AI models have made progress, many still generate reports in a single pass. CWCD breaks the report into smaller, image-focused pieces, which reduces errors and strengthens clinical performance. It’s an early step, but it shows real potential to support radiologists by making their workflow more efficient, acting as a second reader to flag discrepancies and helping to triage high-risk patients.”

The research team also includes David Doermann, PhD, SUNY Empire Innovation Professor and chair of UB’s Department of Computer Science and Engineering, along with UB PhD student Mahesh Bhosale and UB MS student Shantam Srivastava. Their work has been accepted at the 2026 Medical Imaging with Deep Learning Conference, an honor reserved for top-rated papers. The conference will be held in Taipei, Taiwan in July.

Focusing on one region of the X-ray at a time

CWCD divides the report into eight clinically meaningful regions – heart, lungs, abdomen and other key anatomical areas – and produces findings for each by contrasting the original X‑ray with a version in which that region is masked. If the model predicts the same finding even when the relevant region is hidden, CWCD treats that as a sign the prediction may be driven by language patterns rather than visual evidence.

“MLLMs often learn to assume that if one disease appears, another must also be present simply because the two tend to show up together in the training data,” Srivastava said. “CWCD counters that behavior by structuring the generation process around anatomical categories and using contrastive decoding between full and masked contexts. This helps isolate the information that truly comes from the image and prevents the model from inserting findings based on learned pairing patterns rather than visual evidence.”

Improved clinical accuracy across the board

The research team evaluated CWCD by using public radiology datasets that contain hundreds of thousands of chest X‑ray images paired with clinical reports. CWCD consistently outperformed LLaVA‑Rad, MAIRA‑2 and other leading radiology AI models on both language quality and clinical accuracy. It produced fewer irrelevant or incorrect findings and generated reports that more closely matched expert‑written references.

The team conducted a follow‑up ablation study to better understand what was actually driving the improvements. It showed that while every component improves accuracy, the largest gains occur when all are combined. According to Srivastava, this demonstrates that CWCD works because it reinforces visual grounding at multiple points in the generation process.

“Our evaluations found that CWCD not only improved clinical validity, but also captured relevant abnormalities without overpredicting,” Srivastava said. “Radiologists need AI systems that are both fluent and clinically trustworthy. We believe CWCD moves us closer to that goal by reducing the kinds of errors that undermine confidence in automated reporting.”

The authors note that while CWCD marks an important step forward for foundational radiology MLLMs, it still has limitations. Since the model analyzes eight regions and makes multiple passes for every word it generates, the process remains slow and computationally heavy. They also point out that the structured reports used for training were produced by reformatting free‑text reports with a language model, a process that can lead to subtle inconsistencies or bias. These challenges underscore the need for additional refinement before CWCD can be used in clinical settings.

“We still consider this an exploratory project. There’s more work ahead to ensure the framework can operate in clinical environments without significant computational costs, including the time and processing power required to run the model,” Gao said. “Our next steps are to add medical knowledge, strengthen the model’s reasoning and make the framework interpretable so we can explain why it arrives at a particular prediction.”