Large Language Models in Clinical Oncology: A Comparative Evaluation

A comparative evaluation of five large language models in clinical oncology reveals their strengths and limitations, highlighting the need for continued research into their applications and limitations.

Large Language Models in Clinical Oncology: A Comparative Evaluation

The increasing use of large language models (LLMs) in medical information retrieval has sparked interest in their potential applications in clinical oncology. A recent comparative evaluation tested five publicly available LLMs on 2044 oncology questions, covering comprehensive topics in the field. The responses were compared to a human benchmark, providing valuable insights into the capabilities and limitations of LLMs in clinical oncology.

The Study

Rydzewski and colleagues compared the performance of five LLMs on a set of multiple-choice questions related to clinical oncology to a random guess algorithm and the performance of radiation oncology trainees. The authors assessed the accuracy of the models, their self-appraised confidence, and consistency of responses across three independent replicates of questions. The LLMs were asked to provide an answer to a question, a confidence score, and an explanation of the response. Each LLM was evaluated with 2044 unique questions, across three independent replicates.

The Results

The study found that only one of the five LLMs (GPT-4) scored higher than the 50th percentile when compared to human trainees, despite all showing high self-appraised confidence. The remaining LLMs had much lower accuracies, with some being similar to the random guess strategy. LLMs scored higher on foundational topics and worse on clinical oncology topics, especially ones related to female-predominant malignancies. The authors found combining model selection, self-appraised confidence, and output consistency, helped identify more reliable outputs.

Clinical oncology topics pose a significant challenge for LLMs

Implications and Future Directions

This study demonstrated a need to assess the safety of implementing LLMs in clinical settings and the presence of training bias, in the form of medical misinformation related to female-predominant malignancies. The results highlight the importance of evaluating the performance of LLMs in clinical oncology and identifying strategies to improve their accuracy and reliability.

“The study’s findings underscore the need for continued research into the applications and limitations of LLMs in clinical oncology.” - Rydzewski et al.

Related Reports

The role of LLMs in medical information retrieval is rapidly evolving