Can Generative AI Enhance Doctors' Diagnostic Skills: The Future of Clinical Reasoning?

Published Date: 02/07/2024

New studies compare the clinical reasoning abilities of large language models to those of physicians, showing promise for AI-assisted diagnosis.

Information 

Daniel Restrepo, MD, a hospital medicine specialist and physician in the Department of Internal Medicine at Massachusetts General Hospital, has published two research papers comparing the clinical reasoning abilities of large language models (LLMs) to that of physicians.

Massachusetts General Hospital is a teaching hospital of Harvard Medical School and a biomedical research facility located in Boston, Massachusetts.

Large language models (LLMs) are a form of artificial intelligence (AI) that can process large amounts of information from sources like the internet and generate answers to questions that read like a conversation. The usefulness of these models at performing certain tasks has been an increasing area of study across the healthcare field.

Clinical reasoning refers to the thought processes that allow doctors to reach a diagnosis and is perhaps the most important procedure that physicians perform on a daily basis. Mental errors in clinical reasoning can lead to misdiagnosis, which is unfortunately quite common and affects patients worldwide.

Study Methods 

Two studies were conducted. The first was a live comparison of how a human doctor and an LLM approached a diagnostic mystery. The second study compared the reasoning skills of human doctors to that of an LLM known as GPT4.

In the live comparison study, a case of a 35-year-old man who was referred to the emergency department with low blood pressure and a fast heart rate was presented. The physician and LLM were asked to explain their reasoning, and each step was compared to the output of the other.

In the second study, 21 resident and 18 attending physicians assessed clinical cases divided into segments of information. They were asked to verbalize their reasoning and differential diagnosis as they progressed through sets of clinical information. The answers were graded by experts in clinical reasoning who were blinded to whether a human doctor or LLM responded to the case segment.

Results 

The live head-to-head demonstration with an internal medicine physician yielded interesting observations. Both the physician and LLM came up with the correct diagnosis of granulomatosis with polyangiitis, a rare inflammatory disease, but the two went about it in very different ways. The physician relied on clinical reasoning and diagnostic schemas for inflammatory disease categories, whereas the LLM was more focused on matching the patient’s pattern of symptoms to a diagnosis.

The second study found that GPT-4 performed comparably to both resident and attending physicians in certain measures of clinical reasoning. Specifically, GPT-4 suggested the correct diagnosis about 40% of the time, and the correct diagnosis was included in its initial list of differential diagnoses 67% of the time. However, GPT-4 had more frequent instances of incorrect clinical reasoning (~14%) compared to residents (~3%) and attendings (12.5%).

Conclusion 

The studies suggest that AI may have shortcomings in reasoning capabilities, but it can still augment the abilities of diagnosticians and help keep patients safe. However, significant future study is required to address considerations such as accounting for biases, “hallucinations” (or false information generated by the chatbots), as well as data safety and privacy concerns.

The future of diagnosis is not diagnosticians versus AI, but rather diagnosticians alongside AI, with the technology augmenting but not replacing the clinical reasoning process.

FAQs:

Q: What is the purpose of the studies?

A: The studies aimed to learn whether AI could improve a physician’s ability to diagnose patients by comparing the clinical reasoning abilities of large language models to that of physicians.

Q: What is clinical reasoning?

A: Clinical reasoning refers to the thought processes that allow doctors to reach a diagnosis and is perhaps the most important procedure that physicians perform on a daily basis.

Q: What are large language models?

A: Large language models (LLMs) are a form of artificial intelligence (AI) that can process large amounts of information from sources like the internet and generate answers to questions that read like a conversation.

Q: What did the studies find?

A: The studies found that AI may have shortcomings in reasoning capabilities, but it can still augment the abilities of diagnosticians and help keep patients safe.

Q: Can AI be used for diagnosing patients?

A: While AI shows promise, it is too premature to offer these tools in patient care. Significant future study is required to address considerations such as accounting for biases, “hallucinations” (or false information generated by the chatbots), as well as data safety and privacy concerns.

Biometric Products & Solutions

BioEnable offers a wide range of cutting-edge biometric products and solutions: