Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

Amir M Hasani; Shiva Singh; Aryan Zahergivar; Beth Ryan; Daniel Nethala; Gabriela Bravomontenegro; Neil Mendhiratta; Mark Ball; Faraz Farhadi; Ashkan Malayeri

doi:10.1007/s00330-023-10384-x

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

Eur Radiol. 2023 Nov 8. doi: 10.1007/s00330-023-10384-x. Online ahead of print.

Authors

Affiliations

¹ Laboratory of Translation Research, National Heart Blood Lung Institute, NIH, Bethesda, MD, USA.
² Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA.
³ Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA.
⁴ Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA. Ashkan.Malayeri@nih.gov.

PMID: 37938381
DOI: 10.1007/s00330-023-10384-x

Abstract

Objective: Radiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports.

Methods: A comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports.

Results: The AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775.

Conclusion: The results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice.

Clinical relevance statement: The findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice.

Key points: • Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports. • Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports. • Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.

Keywords: Artificial intelligence; Digital health; Machine learning; Natural language processing.