OpenAI's failed to achieve a passing score on a self-assessment practice test from the American College of Gastroenterology (ACG), researchers reported.
Using questions from the ACG's 2021 and 2022 multiple-choice practice assessments, the GPT-3.5 and the GPT-4 versions scored a 65.1% (296 of 455 questions) and a 62.4% (284 of 455 questions), respectively, according to Arvind Trindade, MD, of Northwell Health's Feinstein Institutes for Medical Research in Manhasset, New York, and co-authors.
Both versions of the artificial intelligence (AI) chatbot failed to achieve the required 70% grade to pass the exams, they reported in the .
"We were shocked to see that the benchmark is on the lower side, but it also provides framework in terms of improvement," Trindade told ľֱ. "We know it's lower, so what do we need to do to improve it?"
"It really doesn't have an intrinsic understanding of a topic or issue, which a lot of people think it does," Trindade added. "For medicine, you want something that's going to be giving you accurate information, whether it's for trainees or even for patients that are looking at this, and you would want a threshold of 95% or more."
To conduct the testing, the investigators copied and pasted each question with its potential answers directly into ChatGPT. After the AI chatbot generated a response with an explanation, the authors selected the corresponding answer on the ACG's web-based assessment.
Each annual version of the assessment consists of 300 multiple-choice questions that include real-time feedback. The assessments are designed to mirror a test-taker's performance on the American Board of Internal Medicine's gastroenterology board examination.
In total, Trindade and team used 455 questions for each version of ChatGPT. They excluded 145 questions because of an image requirement. They used the GPT-3.5 version available on March 11, and reran the testing with the GPT-4 version when it became available on March 25.
While the researchers set a 70% accuracy as the benchmark for this study, Trindade noted that the medical community should have much higher standards. He said the recent gush of papers showing ChatGPT passing other medical assessments might be overshadowing the fact that this technology is not ready for regular clinical use.
"You can define a threshold how you want it and say [ChatGPT] passed it, but is passing good enough for medicine?" Trindade said. "I would argue it's not. You want it to ace the exam."
"It's important for the medical community to understand it's not ready for prime time yet," he added. "And just because it passes the test doesn't mean that we should be using it."
Trindade acknowledged that this technology is moving at an incredible pace, and he has seen many people in medical settings using it. While the technology is here to stay, he said, medical professionals should be thinking about ways to optimize it for clinical use.
"From generation to generation, the way we learn and the way we're accessing data and information -- whether it's for educational purposes or even to answer a question pertinent to patient care with the patient in front of us -- the paradigm's shifting in how people are accessing information," he said.
The study is another example of research testing the performance of AI models on medical credentialing tests, which has become a way to represent the technology's capabilities as a medical tool.
These efforts had a breakthrough moment in December 2022 when Google researchers showed the company's medically trained AI model, , achieved 67.6% accuracy and surpassed the common threshold for passing scores on a series of questions from the U.S. Medical Licensing Examination (USMLE). Those researchers went a step further in March, when Google announced that Med-PaLM 2, an updated version of this AI model, achieved 85% accuracy and performed at "expert" physician levels on a similar practice assessment using USMLE questions.
For its part, ChatGPT has been no stranger to showing it can pass accuracy thresholds for medical exams, such as a recent study showing it achieved 80.7% accuracy on a radiology board-style assessment. In another recent study, the AI chatbot was even found to beat physicians in answering patient-generated questions. That study showed evaluators preferred ChatGPT's responses more than 75% of the time when compared with real physician answers during a blinded evaluation.
This gastroenterology exam performance is the most recent example that AI models, especially those without any specific medical information and training, are not perfect tools for clinical use, according to Trindade.
"As these AI models and these platforms are coming out -- that make it so easy to type in a question and spit back an answer -- it's attractive because we're so busy these days," he said. "What we need to do is just take a step back, and I think [papers] like this will help establish that it's not ready for prime time."
Disclosures
Trindade reported consulting for Pentax Medical, Boston Scientific, Lucid Diagnostic, and Exact Science, and research support from Lucid Diagnostics.
Primary Source
American Journal of Gastroenterology
Suchman K, et al "ChatGPT fails the multiple-choice American College of Gastroenterology self-assessment test" Am J Gastroenterol 2023; DOI: 10.14309/ajg.0000000000002320.