As generative AI becomes increasingly embedded in clinical informatics, evaluating its reliability in domain-specific contexts such as urology is essential. This study provides a systematic evaluation of four widely used LLMs—ChatGPT-4.0, Gemini 1.5 Pro, Copilot (GPT-4-based), and Perplexity Pro—in the context of nocturia and nocturnal polyuria, two highly prevalent and distressing lower urinary tract conditions frequently encountered in urological practice. While all four models successfully produced responses to expert-formulated clinical questions, their overall performance varied substantially across thematic domains and quality criteria. To the best of our knowledge, this is the first study to systematically evaluate the performance of LLMs in addressing clinical content specifically related to nocturia and nocturnal polyuria.
Consistent with prior research evaluating LLMs in urology-related topics such as urolithiasis management (18), our findings revealed that ChatGPT-4.0 and Perplexity Pro consistently outperformed Gemini and Copilot in key areas such as diagnostic clarity, clinical accuracy, and procedural explanation. In particular, ChatGPT achieved the highest average score across all five evaluation domains—relevance, clarity, structure, utility, and factual accuracy—while Copilot scored the lowest, often failing to provide guideline-based or adequately detailed responses. Gemini performed comparably to ChatGPT and Perplexity Pro in all thematic domains except ‘General Understanding’, where it scored significantly lower. This suggests that while Gemini’s content accuracy is largely consistent, its introductory clarity or foundational summarization may require improvement. This domain-specific inconsistency is critical, given that nocturnal polyuria and nocturia often require nuanced diagnostic differentiation and personalized treatment planning.
These findings reinforce earlier reports in the literature demonstrating ChatGPT’s high accuracy in specialty-specific medical contexts. For example, Zhu et al. compared five large language models by posing 22 questions on prostate cancer, and ChatGPT achieved the highest accuracy rate among them (19). Similarly, Caglar et al. found that ChatGPT maintained a guideline adherence rate exceeding 90% in pediatric urology, highlighting its potential in medical education and patient counseling (20). Hacıbey and Halis further supported these results by showing that ChatGPT outperformed other LLMs in addressing clinically relevant questions regarding onabotulinum toxin and sacral neuromodulation (SNM) in the treatment of overactive bladder (15). Consistent with these studies, our evaluation showed that ChatGPT achieved near-perfect scores in the “General Understanding” and “Diagnostic Work-Up” domains.
Interestingly, Gemini exhibited high scores in the “Etiology and Pathophysiology” category, suggesting a potential strength in conceptual reasoning. However, both Gemini and Copilot showed limitations in domains requiring the synthesis of clinical guidelines and nuanced patient-centered reasoning. Copilot consistently scored the lowest across all evaluated domains, with particularly poor performance in factual accuracy and utility. While some of these shortcomings may stem from inherent architectural limitations or reliance on a general-purpose training corpus, other contributing factors likely include insufficient exposure to domain-specific medical content, lack of clinical fine-tuning, and potential dataset bias. These deficits are particularly critical in clinical communication contexts, where precision, guideline adherence, and applicability are essential. The findings underscore the necessity for future LLMs to be trained on structured, peer-reviewed clinical corpora and to undergo post-hoc validation aligned with specialty-specific standards. Supporting this, a recent evaluation of the Me-LLaMA model demonstrated that LLMs with access to curated clinical datasets significantly outperformed those trained primarily on unfiltered web-based content (21).
From a clinical utility standpoint, these findings carry significant implications. Nocturia and nocturnal polyuria are associated with sleep disturbances, falls, cardiovascular morbidity, and reduced quality of life—especially in older adults (2,3). Providing patients and clinicians with accurate, easily digestible information is essential for safe and effective management.
While LLMs generally demonstrated strong linguistic fluency, our results highlight that this does not always ensure clinical reliability. Copilot and, to a lesser extent, Gemini frequently produced responses lacking clinical precision, especially in diagnostic and management-related areas. Similar concerns have been echoed in recent literature, including studies evaluating AI in radiology (22), oncology (23), and urology (24), where model outputs sometimes conflicted with current standards of care.
Recent studies have demonstrated both the potential and the limitations of AI in clinical urology and broader healthcare. For example, Shah et al. reported that AI models have achieved promising results in the detection and grading of prostate cancer and the prediction of kidney stone composition. However, they cautioned that clinical integration requires large-scale validation and careful management of ethical concerns (25). Similarly, de Hond et al. reviewed the development and validation of AI-based prediction models, emphasizing that many published models lack sufficient external validation and are often built on data that do not fully represent real-world clinical diversity, thereby limiting their generalizability (26). Saraswat et al. further highlighted that the lack of explainability in “black-box” AI models creates barriers to clinical trust, citing specific cases where clinicians were reluctant to accept algorithmic recommendations without clear, interpretable reasoning (27). Our findings resonate with these prior observations: while advanced LLMs such as ChatGPT and Perplexity performed well on structured, guideline-based questions, they were less reliable in nuanced, case-based scenarios—underscoring the continued need for explainable, validated, and context-aware AI tools in clinical practice.
The implications of these findings are particularly relevant in the context of increasing reliance on generative AI for patient counseling, academic learning, and even clinical triage. Although advanced LLMs show promising performance and may serve as supportive tools in clinical education and communication, their use in diagnostic or therapeutic decision-making should be approached with caution (28). Importantly, none of the models evaluated in this study disclosed uncertainty levels or cited peer-reviewed sources—features that are essential for safe clinical integration. Based on these findings, several practical pathways exist for integrating LLMs into clinical and educational workflows in urology. Beyond educational and supportive roles, LLMs could be integrated into real-world urological practice through their deployment in clinical decision support systems, patient-facing triage tools, and automated guideline consultation platforms. For example, AI-powered chatbots could provide initial guidance for patients reporting nocturia symptoms, assist clinicians in reviewing complex cases, or streamline documentation by generating summaries and templated clinical notes. In training programs, LLMs may serve as interactive educational companions, simulating patient scenarios and reinforcing guideline-based reasoning. Successful integration will require rigorous validation, clear scope definition, and ongoing human oversight to ensure patient safety and high-quality care.
In the context of growing clinical reliance on AI, the ethical and regulatory landscape for LLMs remains underdeveloped. Notably, none of the evaluated models provided explicit uncertainty estimates or confidence scores alongside their responses. This lack of “uncertainty calibration” poses a significant risk: users may assume an AI-generated answer is fully reliable, even when the underlying model is uncertain or operating outside its domain of expertise. Furthermore, the absence of source attribution—meaning the models do not cite peer-reviewed guidelines, original studies, or medical authorities—makes it difficult for clinicians and patients to verify the validity of the information provided. These limitations heighten the risk of misinformation, misinterpretation, and over-reliance on AI in clinical settings. For LLMs to be safely integrated into healthcare, robust frameworks for uncertainty communication, mandatory source citation, and continuous safety oversight by human experts will be essential. Developers and regulatory bodies must prioritize the inclusion of these features to ensure transparency, accountability, and the ethical use of generative AI in medicine.
This study has several strengths. The use of a standardized, thematically organized question set enabled structured comparisons across five clinically relevant domains. Scoring by two blinded expert evaluators ensured high inter-rater reliability (ICC = 0.91), and the multidimensional evaluation system provided a robust and nuanced performance profile for each AI model.
Future research should explore the integration of LLMs into real-time clinical scenarios, comparing AI-assisted versus physician-led decision-making. Additionally, incorporating patient perspectives and evaluating user trust will be essential to determining the acceptability of these technologies in clinical environments. Developers of LLMs should also prioritize embedding up-to-date clinical guidelines, integrating source attribution, and designing models that can flag uncertain or lower-confidence responses.
Study Limitations
This study has several limitations. First, the use of static, one-shot prompting does not reflect dynamic clinical questioning. Second, the models were evaluated without real-world patient interactions and without access to browsing-enabled features, which may limit the depth and currentness of responses. Third, in the context of increasing regulatory scrutiny over generative AI in healthcare (e.g., the EU AI Act), the absence of transparent traceability and confidence calibration mechanisms in LLM outputs remains a critical barrier to clinical adoption (29). In addition, none of the evaluated models provided explicit uncertainty estimates or cited peer-reviewed sources to support their answers. This lack of “uncertainty calibration” and “source attribution” may increase the risk of misinformation and over-reliance on AI-generated content. Until future LLMs can reliably communicate their confidence and directly attribute recommendations to established clinical guidelines, their use in unsupervised clinical decision-making should be approached with extreme caution and subject to ongoing human oversight. Fourth, the relatively limited sample size (25 questions) and the use of only two expert evaluators, although consistent with similar benchmarking studies, may restrict the generalizability of our results and reduce the ability to detect smaller differences between models. Future research involving larger and more diverse question sets, as well as additional expert reviewers, will be important to validate and extend these findings. Although mean scores and standard deviations were reported for ease of interpretation and comparison with previous studies, it should be acknowledged that Likert-type scale data are ordinal in nature. Therefore, medians and interquartile ranges may be more appropriate statistical measures for these data, as they better represent the central tendency and variability without assuming equal intervals between response categories. Future implementations in clinical decision support should include metadata layers that communicate uncertainty and cite sources to align with ethical standards of medical practice.
DISCUSSION
As generative AI becomes increasingly embedded in clinical informatics, evaluating its reliability in domain-specific contexts such as urology is essential. This study provides a systematic evaluation of four widely used LLMs—ChatGPT-4.0, Gemini 1.5 Pro, Copilot (GPT-4-based), and Perplexity Pro—in the context of nocturia and nocturnal polyuria, two highly prevalent and distressing lower urinary tract conditions frequently encountered in urological practice. While all four models successfully produced responses to expert-formulated clinical questions, their overall performance varied substantially across thematic domains and quality criteria. To the best of our knowledge, this is the first study to systematically evaluate the performance of LLMs in addressing clinical content specifically related to nocturia and nocturnal polyuria.
Consistent with prior research evaluating LLMs in urology-related topics such as urolithiasis management (18), our findings revealed that ChatGPT-4.0 and Perplexity Pro consistently outperformed Gemini and Copilot in key areas such as diagnostic clarity, clinical accuracy, and procedural explanation. In particular, ChatGPT achieved the highest average score across all five evaluation domains—relevance, clarity, structure, utility, and factual accuracy—while Copilot scored the lowest, often failing to provide guideline-based or adequately detailed responses. Gemini performed comparably to ChatGPT and Perplexity Pro in all thematic domains except ‘General Understanding’, where it scored significantly lower. This suggests that while Gemini’s content accuracy is largely consistent, its introductory clarity or foundational summarization may require improvement. This domain-specific inconsistency is critical, given that nocturnal polyuria and nocturia often require nuanced diagnostic differentiation and personalized treatment planning.
These findings reinforce earlier reports in the literature demonstrating ChatGPT’s high accuracy in specialty-specific medical contexts. For example, Zhu et al. compared five large language models by posing 22 questions on prostate cancer, and ChatGPT achieved the highest accuracy rate among them (19). Similarly, Caglar et al. found that ChatGPT maintained a guideline adherence rate exceeding 90% in pediatric urology, highlighting its potential in medical education and patient counseling (20). Hacıbey and Halis further supported these results by showing that ChatGPT outperformed other LLMs in addressing clinically relevant questions regarding onabotulinum toxin and sacral neuromodulation (SNM) in the treatment of overactive bladder (15). Consistent with these studies, our evaluation showed that ChatGPT achieved near-perfect scores in the “General Understanding” and “Diagnostic Work-Up” domains.
Interestingly, Gemini exhibited high scores in the “Etiology and Pathophysiology” category, suggesting a potential strength in conceptual reasoning. However, both Gemini and Copilot showed limitations in domains requiring the synthesis of clinical guidelines and nuanced patient-centered reasoning. Copilot consistently scored the lowest across all evaluated domains, with particularly poor performance in factual accuracy and utility. While some of these shortcomings may stem from inherent architectural limitations or reliance on a general-purpose training corpus, other contributing factors likely include insufficient exposure to domain-specific medical content, lack of clinical fine-tuning, and potential dataset bias. These deficits are particularly critical in clinical communication contexts, where precision, guideline adherence, and applicability are essential. The findings underscore the necessity for future LLMs to be trained on structured, peer-reviewed clinical corpora and to undergo post-hoc validation aligned with specialty-specific standards. Supporting this, a recent evaluation of the Me-LLaMA model demonstrated that LLMs with access to curated clinical datasets significantly outperformed those trained primarily on unfiltered web-based content (21).
From a clinical utility standpoint, these findings carry significant implications. Nocturia and nocturnal polyuria are associated with sleep disturbances, falls, cardiovascular morbidity, and reduced quality of life—especially in older adults (2,3). Providing patients and clinicians with accurate, easily digestible information is essential for safe and effective management.
While LLMs generally demonstrated strong linguistic fluency, our results highlight that this does not always ensure clinical reliability. Copilot and, to a lesser extent, Gemini frequently produced responses lacking clinical precision, especially in diagnostic and management-related areas. Similar concerns have been echoed in recent literature, including studies evaluating AI in radiology (22), oncology (23), and urology (24), where model outputs sometimes conflicted with current standards of care.
Recent studies have demonstrated both the potential and the limitations of AI in clinical urology and broader healthcare. For example, Shah et al. reported that AI models have achieved promising results in the detection and grading of prostate cancer and the prediction of kidney stone composition. However, they cautioned that clinical integration requires large-scale validation and careful management of ethical concerns (25). Similarly, de Hond et al. reviewed the development and validation of AI-based prediction models, emphasizing that many published models lack sufficient external validation and are often built on data that do not fully represent real-world clinical diversity, thereby limiting their generalizability (26). Saraswat et al. further highlighted that the lack of explainability in “black-box” AI models creates barriers to clinical trust, citing specific cases where clinicians were reluctant to accept algorithmic recommendations without clear, interpretable reasoning (27). Our findings resonate with these prior observations: while advanced LLMs such as ChatGPT and Perplexity performed well on structured, guideline-based questions, they were less reliable in nuanced, case-based scenarios—underscoring the continued need for explainable, validated, and context-aware AI tools in clinical practice.
The implications of these findings are particularly relevant in the context of increasing reliance on generative AI for patient counseling, academic learning, and even clinical triage. Although advanced LLMs show promising performance and may serve as supportive tools in clinical education and communication, their use in diagnostic or therapeutic decision-making should be approached with caution (28). Importantly, none of the models evaluated in this study disclosed uncertainty levels or cited peer-reviewed sources—features that are essential for safe clinical integration. Based on these findings, several practical pathways exist for integrating LLMs into clinical and educational workflows in urology. Beyond educational and supportive roles, LLMs could be integrated into real-world urological practice through their deployment in clinical decision support systems, patient-facing triage tools, and automated guideline consultation platforms. For example, AI-powered chatbots could provide initial guidance for patients reporting nocturia symptoms, assist clinicians in reviewing complex cases, or streamline documentation by generating summaries and templated clinical notes. In training programs, LLMs may serve as interactive educational companions, simulating patient scenarios and reinforcing guideline-based reasoning. Successful integration will require rigorous validation, clear scope definition, and ongoing human oversight to ensure patient safety and high-quality care.
In the context of growing clinical reliance on AI, the ethical and regulatory landscape for LLMs remains underdeveloped. Notably, none of the evaluated models provided explicit uncertainty estimates or confidence scores alongside their responses. This lack of “uncertainty calibration” poses a significant risk: users may assume an AI-generated answer is fully reliable, even when the underlying model is uncertain or operating outside its domain of expertise. Furthermore, the absence of source attribution—meaning the models do not cite peer-reviewed guidelines, original studies, or medical authorities—makes it difficult for clinicians and patients to verify the validity of the information provided. These limitations heighten the risk of misinformation, misinterpretation, and over-reliance on AI in clinical settings. For LLMs to be safely integrated into healthcare, robust frameworks for uncertainty communication, mandatory source citation, and continuous safety oversight by human experts will be essential. Developers and regulatory bodies must prioritize the inclusion of these features to ensure transparency, accountability, and the ethical use of generative AI in medicine.
This study has several strengths. The use of a standardized, thematically organized question set enabled structured comparisons across five clinically relevant domains. Scoring by two blinded expert evaluators ensured high inter-rater reliability (ICC = 0.91), and the multidimensional evaluation system provided a robust and nuanced performance profile for each AI model.
Future research should explore the integration of LLMs into real-time clinical scenarios, comparing AI-assisted versus physician-led decision-making. Additionally, incorporating patient perspectives and evaluating user trust will be essential to determining the acceptability of these technologies in clinical environments. Developers of LLMs should also prioritize embedding up-to-date clinical guidelines, integrating source attribution, and designing models that can flag uncertain or lower-confidence responses.
Study Limitations
This study has several limitations. First, the use of static, one-shot prompting does not reflect dynamic clinical questioning. Second, the models were evaluated without real-world patient interactions and without access to browsing-enabled features, which may limit the depth and currentness of responses. Third, in the context of increasing regulatory scrutiny over generative AI in healthcare (e.g., the EU AI Act), the absence of transparent traceability and confidence calibration mechanisms in LLM outputs remains a critical barrier to clinical adoption (29). In addition, none of the evaluated models provided explicit uncertainty estimates or cited peer-reviewed sources to support their answers. This lack of “uncertainty calibration” and “source attribution” may increase the risk of misinformation and over-reliance on AI-generated content. Until future LLMs can reliably communicate their confidence and directly attribute recommendations to established clinical guidelines, their use in unsupervised clinical decision-making should be approached with extreme caution and subject to ongoing human oversight. Fourth, the relatively limited sample size (25 questions) and the use of only two expert evaluators, although consistent with similar benchmarking studies, may restrict the generalizability of our results and reduce the ability to detect smaller differences between models. Future research involving larger and more diverse question sets, as well as additional expert reviewers, will be important to validate and extend these findings. Although mean scores and standard deviations were reported for ease of interpretation and comparison with previous studies, it should be acknowledged that Likert-type scale data are ordinal in nature. Therefore, medians and interquartile ranges may be more appropriate statistical measures for these data, as they better represent the central tendency and variability without assuming equal intervals between response categories. Future implementations in clinical decision support should include metadata layers that communicate uncertainty and cite sources to align with ethical standards of medical practice.