Chatbots had mixed results when it came to providing direct-to-patient cancer-related advice and treatment strategies for a wide variety of cancers, according to two studies in JAMA Oncology.
When testing GPT-3.5 (OpenAI) with prompts designed to obtain treatment strategies for different kinds of cancers, they found that while most answers were in accordance with National Comprehensive Cancer Network (NCCN) guidelines, one-third were at least partially nonconcordant, reported Danielle Bitterman, MD, of Mass General Brigham and Harvard ľֱ School in Boston, and colleagues in a .
They suggested that clinicians "advise patients that LLM [large language model] chatbots are not a reliable source of treatment information."
Findings from the -- which tested four AI chatbots including GPT-3.5 on direct-to-patient advice -- were more positive, suggesting that their use "generally" produced accurate information on cancer-related search inquiries, but that these responses were not readily actionable and are written at a college level, according to Abdo Kabarriti, MD, of the State University of New York Downstate Health Sciences University in New York City, and colleagues.
"Findings of this study suggest that AI chatbots are an accurate and reliable supplementary resource for medical information," wrote Kabarriti and colleagues, "but are limited in their readability and should not replace healthcare professionals for individualized healthcare questions."
In an , Atul Butte, MD, PhD, of the University of California San Francisco, said that while the results of these studies may suggest "our core belief in GPT technology as a clinical partner has not sufficiently been earned yet," the chatbots used in these studies are off the shelf and likely do not have specific healthcare training.
"Newer LLMs are now being released that have specific healthcare training, such as Google's Med-PaLM 2," he wrote. "Future medical evaluation studies are likely going to need to compare across several LLMs."
Moreover, Butte said the real potential of these tools in cancer care is that they can be trained from the very best centers, and then used "to deliver the right best care through digital tools to all patients, especially to those who do not have the resources or privilege to get that level of care."
Treatment Recommendations
For their study, Bitterman and colleagues developed four prompt templates for treatment recommendations for 26 different kinds of cancers (for a total of 104 prompts), and benchmarked the chatbot's recommendations against 2021 NCCN guidelines. Concordance of the chatbot output with NCCN guidelines was assessed by board-certified oncologists.
Findings showed the chatbot provided at least one recommendation for 102 of 104 (98%) prompts and all outputs with a recommendation included at least one NCCN-concordant treatment. However, 35 of 102 (34.3%) of these outputs also recommended one or more nonconcordant treatments, with 13 of 104 responses (12.5%) "hallucinated," meaning they weren't part of any recommended treatment.
"The chatbot did not purport to be a medical device, and need not be held to such standards," Bitterman and colleagues wrote. "However, patients will likely use such technologies in their self-education, which may affect shared decision-making and the patient-clinician relationship. Developers should have some responsibility to distribute technologies that do not cause harm, and patients and clinicians need to be aware of these technologies' limitations."
Consumer Health Info
In their study, Kabarriti and colleagues inputted Google Trends' top five search queries related to skin, lung, breast, colorectal, and prostate cancer into four chatbots. Outcomes included the quality of consumer health information based on the DISCERN instrument (a scale of 1-5, with 1 representing low quality) and the understandability and actionability of this information based on domains of the Patient Education Materials Assessment Tool (PEMAT), with scores ranging from 0% to 100%, with higher scores indicating a higher level of understandability and actionability.
They determined the quality of text responses generated by the four AI chatbots was good (median DISCERN score of 5, with no misinformation identified). Understandability was considered moderate (median PEMAT Understandability score of 66.7%) but actionability was poor (median PEMAT Actionability score of 20%), with authors noting that responses were written at the college level. "This finding suggests that AI chatbots use medical terminology that may not be familiar or useful for lay audiences," Kabarriti and colleagues said.
"These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information," they added. "To this end, AI chatbots typically encourage users to seek medical attention relating to cancer symptoms and treatment."
AI and LLMs are not yet perfect and can carry biases, Butte said in his editorial.
"These algorithms will need to be carefully monitored as they are brought into health systems," he continued. "But this does not alter the potential of how they can improve care for both the haves and have-nots of healthcare."
Disclosures
The research letter study was funded by the Woods Foundation
Bitterman reported institutional research support from the American Association for Cancer Research outside the submitted work. Co-authors reported multiple relationships with industry.
Kabarriti had no disclosures. A co-author reported relationships with the National Cancer Institute and Gilead Sciences.
Butte reported multiple relationships with industry, foundations, and academic institutions.
Primary Source
JAMA Oncology
Chen S, et al "Use of artificial intelligence chatbots for cancer treatment information" JAMA Oncol 2023; DOI: 10.1001/jamaoncol.2023.2954.
Secondary Source
JAMA Oncology
Pan A, et al "Assessment of artificial intelligence chatbot responses to top searched queries about cancer" JAMA Oncol 2023; DOI: 10.1001/jamaoncol.2023.2947.
Additional Source
JAMA Oncology
Butte AJ "Artificial intelligence -- From starting pilots to scalable privilege" JAMA Oncol 2023; DOI: 10.1001/jamaoncol.2023.2867.