14. International Trakya Family Medicine Congress

24-27 April 2025, Balkan Congress Center, Edirne

Evaluation of Responses Provided by ChatGPT and Gemini Artificial Intelligence Models to Hypertension-Related Queries

Merve Tocoglu, Elif Aktı, Hüsna Sarıca Çevik

Keywords: Artificial Intelligence, Chatbots, ChatGPT, Google Gemini, Hypertension

Aim:

Developments in artificial intelligence (AI) learning technologies have led to the rise of AI platforms such as ChatGPT and Gemini ChatBots, which have begun to be widely used in the medical field. However, both models have limitations regarding data accuracy, validation, and up-to-dateness. In this study, we aimed to compare the quality, understandability, accuracy, and readability of AI-generated information on hypertension.

Method:

Twenty questions were compiled in English in line with the European Society of Hypertension and American Heart Association guidelines. These questions were answered using ChatGPT version 3.5 and Google Gemini. Modified DISCERN instrument (m-DISCERN) used to assess quality, Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) used to evaluate understandability and actionability, the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores were calculated to determine the readability. An independent researcher evaluated the answers and categorised them into accurate and inaccurate.

Results:

ChatGPT responses had a mean score of 26.05±1.9, representing similar quality to the Gemini (mean score: 28.9±2.8). Mean understandability and actionability scores were lower for ChatGPT (66% and 35%, respectively) than for Gemini (76% and 38%, respectively). The mean readability scores of ChatGPT were 13.8±2.34 (FKGL) and 29.5±12.7 (FRE), indicating the level of a college (advanced reading level) and college graduate level (very difficult to read) respectively. For Gemini, these scores were 9.5±1.8 and 51.5±10.2, respectively, indicating high school level (average reading level) and 10th-12th-grade levels (fairly difficult to read). Both platforms' answers were accurate but lacked depth in certain areas.

Conclusions:

These findings suggest that despite the information provided to hypertension queries on ChatGPT is more difficult to read, of lower quality, and more difficult to understand than that on Gemini, both platforms exhibit accurate information that needs to be enhanced in certain areas.

#28