The Responses of Artificial Intelligence to Questions About Urological Emergencies: A Comparison of 3 Different Large Language Models

Abstract

Objective: This study aimed to compare the accuracy and adequacy of responses provided by three different large language models (LLMs) utilizing artificial intelligence technology to fundamental questions related to urological emergencies.

Material and Methods: Nine distinct urological emergency topics were identified, and a total of 63 fundamental questions were formulated for each topic, including two related to diagnosis, three related to disease management, and two related to complications. The questions were posed in English on three different free AI platforms (ChatGPT-4, Google Gemini 2.0 Flash, and Meta Llama 3.2), each utilizing different infrastructures, and responses were documented. The answers were scored by the authors on a scale of 1 to 4 based on accuracy and adequacy, and the results were compared using statistical analysis.

Results: When all question-answer pairs were evaluated overall, ChatGPT exhibited slightly higher accuracy rates compared to Gemini and Meta Llama; however, no statistically significant differences were detected among the groups (3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively; p=0.146). When questions related to diagnosis, treatment management, and complications were evaluated separately, no statistically significant differences were detected among the three LLMs (p=0.338, p=0.289, and p=0.407, respectively). Only one response provided by Gemini was found to be completely incorrect (1.6%). No misleading or wrong answers were observed in the diagnosis-related questions across all three platforms. In total, misleading answers were observed in 2 questions (3.2%) for ChatGPT, three questions (4.7%) for Gemini, and two questions (3.2%) for Meta Llama.

Conclusion: LLMs predominantly provide accurate results to basic and straightforward questions related to urological emergencies, where prompt treatment is critical. Although no significant differences were observed among the responses of the three LLMs compared in this study, the presence of misleading and incorrect answers should be carefully considered, given the evolving nature and limitations of this technology.

Keywords: urological emergencies, artificial intelligence, large language models

View

Abstract

Objective: This study aimed to compare the accuracy and adequacy of responses provided by three different large language models (LLMs) utilizing artificial intelligence technology to fundamental questions related to urological emergencies.

Material and Methods: Nine distinct urological emergency topics were identified, and a total of 63 fundamental questions were formulated for each topic, including two related to diagnosis, three related to disease management, and two related to complications. The questions were posed in English on three different free AI platforms (ChatGPT-4, Google Gemini 2.0 Flash, and Meta Llama 3.2), each utilizing different infrastructures, and responses were documented. The answers were scored by the authors on a scale of 1 to 4 based on accuracy and adequacy, and the results were compared using statistical analysis.

Results: When all question-answer pairs were evaluated overall, ChatGPT exhibited slightly higher accuracy rates compared to Gemini and Meta Llama; however, no statistically significant differences were detected among the groups (3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively; p=0.146). When questions related to diagnosis, treatment management, and complications were evaluated separately, no statistically significant differences were detected among the three LLMs (p=0.338, p=0.289, and p=0.407, respectively). Only one response provided by Gemini was found to be completely incorrect (1.6%). No misleading or wrong answers were observed in the diagnosis-related questions across all three platforms. In total, misleading answers were observed in 2 questions (3.2%) for ChatGPT, three questions (4.7%) for Gemini, and two questions (3.2%) for Meta Llama.

Conclusion: LLMs predominantly provide accurate results to basic and straightforward questions related to urological emergencies, where prompt treatment is critical. Although no significant differences were observed among the responses of the three LLMs compared in this study, the presence of misleading and incorrect answers should be carefully considered, given the evolving nature and limitations of this technology.

Keywords: urological emergencies, artificial intelligence, large language models

INTRODUCTION

Urological emergencies are clinical conditions that require immediate initiation of treatment, as delays in patient admission can lead to irreversible consequences (1). While time is critical in testicular torsion, early medical and surgical intervention is essential in conditions such as urosepsis, which can lead to multiorgan dysfunction and the need for intensive care (2,3).

As technology has advanced, both patients and healthcare providers have increasingly turned to the Internet to research the conditions they encounter (4). Large language models (LLM) developed using artificial intelligence (AI) technology have demonstrated the ability to respond to queries and provide rapid data on even highly specialized topics. Over the years, this rapidly evolving technology has led to the development of AI assistants such as ChatGPT, Google Gemini, and Meta Llama AI, each utilizing distinct infrastructures. The use of AI assistants in medical contexts has gained increasing attention in recent years, paving the way for numerous studies (5). Many studies have been published on the efficacy of their application in various diseases (6). However, the adequacy and reliability of the responses provided by these assistants, which are easily accessible to patients in time-sensitive urological emergencies, remain questionable.

This study aimed to evaluate the accuracy and adequacy of responses provided by three different AI-powered platforms to fundamental questions regarding urological emergencies, focusing on diagnosis, treatment management, and complications.

View

INTRODUCTION

Urological emergencies are clinical conditions that require immediate initiation of treatment, as delays in patient admission can lead to irreversible consequences (1). While time is critical in testicular torsion, early medical and surgical intervention is essential in conditions such as urosepsis, which can lead to multiorgan dysfunction and the need for intensive care (2,3).

As technology has advanced, both patients and healthcare providers have increasingly turned to the Internet to research the conditions they encounter (4). Large language models (LLM) developed using artificial intelligence (AI) technology have demonstrated the ability to respond to queries and provide rapid data on even highly specialized topics. Over the years, this rapidly evolving technology has led to the development of AI assistants such as ChatGPT, Google Gemini, and Meta Llama AI, each utilizing distinct infrastructures. The use of AI assistants in medical contexts has gained increasing attention in recent years, paving the way for numerous studies (5). Many studies have been published on the efficacy of their application in various diseases (6). However, the adequacy and reliability of the responses provided by these assistants, which are easily accessible to patients in time-sensitive urological emergencies, remain questionable.

This study aimed to evaluate the accuracy and adequacy of responses provided by three different AI-powered platforms to fundamental questions regarding urological emergencies, focusing on diagnosis, treatment management, and complications.

MATERIALS AND METHODS

Nine different topics were identified as urological emergencies: Testicular Torsion, Hematuria, Obstructive Uropathies, Penile Fracture, Urosepsis, Paraphimosis, Fournier’s Gangrene, Priapism, and Trauma. For each topic, seven questions were prepared: two related to diagnosis, three related to treatment management, and two related to complications. While preparing the questions, instead of expecting lengthy responses from each AI platform, the command “Can you answer in one paragraph?” was used to request concise answers. The questions were selected focusing on frequently asked fundamental questions. The list of questions is provided in Table 1. Each question was asked separately on new pages to ChatGPT, Gemini, and Meta Llama to prevent any influence from usage history, and responses were documented. The answers were evaluated by the authors participating in the study and scored on a scale of 1 to 4 (1: Completely incorrect, 2: Correct but misleading, 3: Correct but insufficient, 4: Completely correct). The scores were recorded and grouped based on diagnostic questions, treatment management questions, and complication-related questions, followed by statistical analysis. No real patients or patient information were shared in this study. This study was conducted following the Helsinki Declaration.

Statistical Analysis
The mean and deviation, number, and percentage values for the answers given to the questions in three different subgroups and the total for each AI model were documented. Results were analyzed using a non-parametric test. The Friedman test was employed for the comparison of the three groups. A p-value of < 0.05 was considered statistically significant. For the analysis of the study, IBM Statistical Package for Social Sciences SPSS 26.0.1 (IBM, Corp., Armonk, NY, USA) was utilized.

Large Language Models in Artificial Intelligence
ChatGPT-4 is an AI model developed by OpenAI, offering a more comprehensive language understanding and generation capacity compared to its predecessors. It is developed in the USA, with headquarters in San Francisco, California. This model can be utilized across various domains, ranging from everyday conversational language to technical or scientific texts. While interacting with a user, it comprehends the context of the question and generates appropriate responses. Additionally, it seamlessly adapts to multilingual content, enabling smooth communication in different languages.

Gemini 2.0 Flash is an AI model developed by Google DeepMind, with primary facilities located in Mountain View, California. This model is capable of processing visual and textual data simultaneously. This feature allows the model to respond to text-based questions while also interpreting and analyzing visual content. Its most notable characteristic is its ability to integrate information learned from diverse data sources, enabling it to make sense of complex scenarios.

LLama 3.2 is an AI model developed by Meta Llama AI, with major operations based in Menlo Park, California. LLama 3.2 is an AI system that stands out for its efficiency among language models. Its ability to deliver high performance with lower computational power has made it a preferred tool for large-scale projects and diverse applications. The model learns from a vast number of textual sources and provides accurate responses even in complex textual contexts.

View

MATERIALS AND METHODS

Nine different topics were identified as urological emergencies: Testicular Torsion, Hematuria, Obstructive Uropathies, Penile Fracture, Urosepsis, Paraphimosis, Fournier’s Gangrene, Priapism, and Trauma. For each topic, seven questions were prepared: two related to diagnosis, three related to treatment management, and two related to complications. While preparing the questions, instead of expecting lengthy responses from each AI platform, the command “Can you answer in one paragraph?” was used to request concise answers. The questions were selected focusing on frequently asked fundamental questions. The list of questions is provided in Table 1. Each question was asked separately on new pages to ChatGPT, Gemini, and Meta Llama to prevent any influence from usage history, and responses were documented. The answers were evaluated by the authors participating in the study and scored on a scale of 1 to 4 (1: Completely incorrect, 2: Correct but misleading, 3: Correct but insufficient, 4: Completely correct). The scores were recorded and grouped based on diagnostic questions, treatment management questions, and complication-related questions, followed by statistical analysis. No real patients or patient information were shared in this study. This study was conducted following the Helsinki Declaration.

Statistical Analysis
The mean and deviation, number, and percentage values for the answers given to the questions in three different subgroups and the total for each AI model were documented. Results were analyzed using a non-parametric test. The Friedman test was employed for the comparison of the three groups. A p-value of < 0.05 was considered statistically significant. For the analysis of the study, IBM Statistical Package for Social Sciences SPSS 26.0.1 (IBM, Corp., Armonk, NY, USA) was utilized.

Large Language Models in Artificial Intelligence
ChatGPT-4 is an AI model developed by OpenAI, offering a more comprehensive language understanding and generation capacity compared to its predecessors. It is developed in the USA, with headquarters in San Francisco, California. This model can be utilized across various domains, ranging from everyday conversational language to technical or scientific texts. While interacting with a user, it comprehends the context of the question and generates appropriate responses. Additionally, it seamlessly adapts to multilingual content, enabling smooth communication in different languages.

Gemini 2.0 Flash is an AI model developed by Google DeepMind, with primary facilities located in Mountain View, California. This model is capable of processing visual and textual data simultaneously. This feature allows the model to respond to text-based questions while also interpreting and analyzing visual content. Its most notable characteristic is its ability to integrate information learned from diverse data sources, enabling it to make sense of complex scenarios.

LLama 3.2 is an AI model developed by Meta Llama AI, with major operations based in Menlo Park, California. LLama 3.2 is an AI system that stands out for its efficiency among language models. Its ability to deliver high performance with lower computational power has made it a preferred tool for large-scale projects and diverse applications. The model learns from a vast number of textual sources and provides accurate responses even in complex textual contexts.

RESULTS

When the responses to the 18 diagnosis-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.4, 3.8 ± 0.4, and 3.6 ± 0.5, respectively (p=0.338). ChatGPT provided completely correct answers to 15 (83.3%) questions, while Gemini and Meta Llama provided completely correct answers to 14 (77.8%) and 11 (61.1%) questions, respectively. None of the three platforms provided completely incorrect or misleading answers to any of the diagnosis-related questions.

When the responses to the 27 treatment management-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.9 ± 0.5, 3.6 ± 0.8, and 3.8 ± 0.5, respectively (p=0.289). ChatGPT provided completely correct answers to 24 (88.9%) questions, while Gemini and Meta Llama provided completely correct answers to 21 (77.8%) and 22 (81.5%) questions, respectively. Gemini provided a completely incorrect answer to 1 (3.7%) question, while the other platforms had no completely wrong answers. Insufficient and misleading answers were observed in 3 (11.1%), 5 (18.5%), and 5 (18.5%) questions for ChatGPT, Gemini, and Meta Llama, respectively.

When the responses to the 18 complication-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.5, 3.8 ± 0.5, and 3.6 ± 0.6, respectively (p=0.407).

Overall, when considering all topics, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively (p=0.146). ChatGPT provided completely correct answers to 54 (85.7%) questions, while Gemini and Meta Llama provided completely correct answers to 50 (79.4%) and 45 (71.4%) questions, respectively. The mean scores and percentages of correct answers for the three platforms are presented in Table 2. The mean scores of three LLMs are shown in Figure 1.

View

RESULTS

When the responses to the 18 diagnosis-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.4, 3.8 ± 0.4, and 3.6 ± 0.5, respectively (p=0.338). ChatGPT provided completely correct answers to 15 (83.3%) questions, while Gemini and Meta Llama provided completely correct answers to 14 (77.8%) and 11 (61.1%) questions, respectively. None of the three platforms provided completely incorrect or misleading answers to any of the diagnosis-related questions.

When the responses to the 27 treatment management-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.9 ± 0.5, 3.6 ± 0.8, and 3.8 ± 0.5, respectively (p=0.289). ChatGPT provided completely correct answers to 24 (88.9%) questions, while Gemini and Meta Llama provided completely correct answers to 21 (77.8%) and 22 (81.5%) questions, respectively. Gemini provided a completely incorrect answer to 1 (3.7%) question, while the other platforms had no completely wrong answers. Insufficient and misleading answers were observed in 3 (11.1%), 5 (18.5%), and 5 (18.5%) questions for ChatGPT, Gemini, and Meta Llama, respectively.

When the responses to the 18 complication-related questions were compared, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.5, 3.8 ± 0.5, and 3.6 ± 0.6, respectively (p=0.407).

Overall, when considering all topics, the mean scores for ChatGPT, Gemini, and Meta Llama were calculated as 3.8 ± 0.5, 3.7 ± 0.6, and 3.7 ± 0.5, respectively (p=0.146). ChatGPT provided completely correct answers to 54 (85.7%) questions, while Gemini and Meta Llama provided completely correct answers to 50 (79.4%) and 45 (71.4%) questions, respectively. The mean scores and percentages of correct answers for the three platforms are presented in Table 2. The mean scores of three LLMs are shown in Figure 1.

DISCUSSION

AI applications have advanced rapidly in recent years, becoming an integral part of daily life. In the field of healthcare, they have been the subject of studies in a wide range of areas, including disease diagnosis, treatment management, prediction of complications, and interpretation of imaging and pathology examinations (6). Large language models (LLMs) powered by AI provide rapid responses by interpreting written text, scanning open sources, and summarizing information (7). This capability raises the possibility of their use by both patients and healthcare providers. Although algorithms developed for use by healthcare providers have not yet entered routine practice, their widespread adoption is anticipated in the near future.

Meanwhile, the accuracy and adequacy of these platforms, which are used by patients to obtain information, have become a topic of interest. The correctness and adequacy of responses provided by LLMs in patient education have been examined across various subtopics (8). This study aimed to investigate whether the responses generated by LLMs to basic questions in urological emergencies, which may require time-sensitive decision-making, are consistent with the literature, accurate, and reliable.

Urological emergencies encompass a variety of conditions, ranging from testicular torsion, which requires immediate intervention, to hematuria, which may allow for a relatively longer diagnostic window but can still lead to urgent outcomes. The lack of awareness of testicular torsion among patients and their families, delayed hospital presentation, and the potential for organ loss or future infertility can result in devastating consequences. A study investigating the causes of delayed testicular torsion found that only 23.8% of cases underwent timely surgery. Misdiagnosis and the initial consultation with a non-urologist were identified as risk factors for orchiectomy, emphasizing the importance of proper technical training and referral to prevent delays in the diagnosis and treatment of testicular torsion (9). In cases of testicular torsion presenting with scrotal pain, consulting a large language model (LLM) in remote areas with limited healthcare access could potentially reduce the time to initial presentation, thereby preventing orchiectomy.

Another example is urolithiasis, a highly prevalent condition in the general population. Although hospital visits and the need for analgesic treatment due to renal colic are common, patients may prefer to manage the condition without seeking medical attention based on prior experiences or anecdotal information. However, the development of fever and infection during this process may result in complicated urinary tract infections, such as pyelonephritis with obstruction, which, if left untreated, may progress to sepsis and multiorgan failure (10). Therefore, the lack of awareness among patients about the risk of sepsis in cases of renal colic complicated by infection may result in adverse outcomes in individuals who do not seek medical care. A comprehensive study examining factors related to mortality in obstructive pyelonephritis concluded that delayed decompression was associated with increased mortality, with higher rates observed in weekend admissions (11).

Another condition, penile fracture, occurs as an unexpected medical event in men. The dramatic presentation, including an audible snap during sexual intercourse and the appearance of hematoma, often signals the urgency of the situation even to untrained individuals. However, in such acute medical scenarios, the accuracy and reliability of responses provided by a free AI platform, which patients might consult to determine the urgency and potential complications, are of critical importance. In this study, we prioritized evaluating the responses of LLMs for patient education and guidance in these contexts.

Our results demonstrated that AI platforms generally provide accurate and adequate responses to basic questions regarding urological emergencies. While similar responses were predominantly observed across the three different AI assistants, no statistically significant differences were found among the results. The recent study examining the use of ChatGPT for self-diagnosis in orthopedic conditions suggested that, although it could serve as a potential initial step in accessing healthcare, it contained inconsistent results and emphasized the necessity of including clear language encouraging users to seek expert medical opinions (12). Another study investigating the use of AI platforms for emergency medical conditions highlighted that, even if the results are consistent, the ambiguity of sources and the presence of misleading information regarding the timing of medical interventions should be carefully considered due to potential risks (13). Scott et al., in their study evaluating AI-generated responses to urology patient messages, noted that ChatGPT performed better on simple questions compared to complex ones, suggesting its potential to assist care teams (14). A recent systematic review examining the use of LLMs in patient care underscored the need for caution due to the uncertainties inherent in this technology (15).

Furthermore, ethical considerations must be addressed, particularly concerning the reliance on AI tools without professional Supervision. As AI systems evolve, ensuring transparency in source attribution and decision-making logic becomes essential. Healthcare professionals must be aware of the limitations of these tools and use them as supplementary rather than primary decision-making instruments.

Studies involving LLMs must take into account several limitations. First, the instability of the platforms used, their ongoing development, and their potential for rapid evolution over time highlight the necessity of interpreting findings based on the specific conditions of the platforms at the time of the study. We emphasize that our study focused on basic and straightforward questions, with responses summarized in paragraph form for evaluation. The likelihood of inaccuracies or misleading information may increase with more complex and lengthy responses. Since we aimed to investigate basic questions in emergency scenarios, we believe it would be inappropriate to conclude complex urological emergency conditions based on these questions and answers. Given the continuous advancement and widespread adoption of these platforms, we consider it crucial to assess and research their accuracy and reliability consistently.

View