The use of AI models in medicine is rapidly increasing, and various studies have been conducted in prostate cancer prognostic predictions (11). In the existing literature, AI is reported to show promising results in prostate cancer diagnosis and staging by combining imaging, pathology and clinical data (11,14). However, studies directly comparing AI chatbots with clinical risk nomograms and examining performance differences, especially in complex patient groups, are limited. Our study is an important step towards filling the knowledge gap in this field and emphasizes the need for a careful validation process before clinical use of AI. Considering that traditional nomograms have undergone years of validation based on specific clinical parameters, AI needs to be tested with similar rigor.
This study focused on the comparison of the predictions provided by ChatGPT-4o, an AI-based chatbot, and the MSKCC nomogram commonly used in clinical practice for preoperative risk prediction in prostate cancer. Our findings revealed that ChatGPT-4o were highly correlated with nomograms in general, but exhibited significant inconsistencies in certain prediction topics, especially in high-risk and locally advanced patient groups. These results are critical to understanding the potential and current limitations of AI-based tools in clinical practice.
The overall analysis of our study showed high and significant positive correlations between ChatGPT-4o and the MSKCC nomogram for OCD, ECE, SVI and LNI. In particular, a strong correlation was found between OCD predictions; similarly, ECE, SVI and LNI predictions also exhibited overall high and significant positive correlations. This finding suggests that ChatGPT-4o can produce similar outputs to traditional methods in complex clinical decision support processes such as prostate cancer risk prediction, thanks to their capacity to learn from large data sets. The strong correlations observed in the low- and intermediate-risk patient groups also support this potential, as in these groups, except for the LNI prediction in the low-risk group, all other predictions showed significant correlations. However, the most striking findings of our study are the discrepancies in the high-risk and locally advanced patient groups. In the high-risk group, there was no statistically significant correlation between the estimates of OCD, SVI and LNI. Similarly, no significant correlation was found in the predictions of OCD and ECE in the locally advanced patient group. Similarly, no significant correlation was found in the predictions of OCD and ECE in the locally advanced patient group. This suggests that ChatGPT-4o may not produce as reliable predictions as traditional nomograms, especially when the disease is more advanced and complex. The discrepancies observed in the high-risk and locally advanced groups may be explained by several factors. Large language models such as ChatGPT-4o are primarily trained on general internet-based sources rather than curated, domain-specific medical datasets. As a result, their ability to accurately represent rare or complex clinical scenarios remains limited. Nomograms, in contrast, are derived from large patient cohorts with detailed clinical and pathological annotations, allowing them to more precisely model the heterogeneity of advanced disease. In these groups, tumor biology is often more aggressive and unpredictable, with greater variability in features such as extracapsular spread patterns, seminal vesicle involvement, and nodal dissemination. Subtle distinctions in staging parameters (e.g., between cT3a and cT3b disease) may translate into markedly different risk profiles, but such nuances are difficult for a language-based model to capture without access to structured radiological, pathological, or molecular data. Furthermore, while ChatGPT-4o generates probability estimates by identifying linguistic patterns, it lacks true comprehension of the underlying pathophysiological mechanisms. These limitations collectively help to explain the reduced concordance with nomogram predictions in the most clinically complex patient groups.
The fact that our study provides a controlled comparison using synthetic patient scenarios representing risk groups eliminates the variability in real patient data and allows direct comparison of ChatGPT-4o and nomogram outputs. Furthermore, the reference to MSKCC, a validated nomogram widely used in clinical practice, increases the clinical validity of the results. On the other hand, the study has some limitations. The use of synthetic patient scenarios may not fully reflect the heterogeneity and clinical nuances of real-world patient populations. The use of only a single AI chatbot (ChatGPT-4o) and a single nomogram (MSKCC) may limit the generalizability of the results. Furthermore, although 40 patient scenarios were sufficient for statistical analyses, the smaller number of cases, especially in subgroups (10 scenarios in each risk group), may have led to smaller correlations not being statistically significant.
In conclusion, our findings suggest that ChatGPT-4o may be a promising tool in the field of prostate cancer risk prediction, but exhibit significant inconsistencies compared to existing nomograms, especially in complex scenarios such as high-risk and locally advanced disease. These findings emphasize the need for extensive validation and development studies on larger and real patient cohorts before AI can be widely used in clinical practice. Future research should focus on the specific training of AI models with medical data and their integration as a decision support tool for physicians.
DISCUSSION
The use of AI models in medicine is rapidly increasing, and various studies have been conducted in prostate cancer prognostic predictions (11). In the existing literature, AI is reported to show promising results in prostate cancer diagnosis and staging by combining imaging, pathology and clinical data (11,14). However, studies directly comparing AI chatbots with clinical risk nomograms and examining performance differences, especially in complex patient groups, are limited. Our study is an important step towards filling the knowledge gap in this field and emphasizes the need for a careful validation process before clinical use of AI. Considering that traditional nomograms have undergone years of validation based on specific clinical parameters, AI needs to be tested with similar rigor.
This study focused on the comparison of the predictions provided by ChatGPT-4o, an AI-based chatbot, and the MSKCC nomogram commonly used in clinical practice for preoperative risk prediction in prostate cancer. Our findings revealed that ChatGPT-4o were highly correlated with nomograms in general, but exhibited significant inconsistencies in certain prediction topics, especially in high-risk and locally advanced patient groups. These results are critical to understanding the potential and current limitations of AI-based tools in clinical practice.
The overall analysis of our study showed high and significant positive correlations between ChatGPT-4o and the MSKCC nomogram for OCD, ECE, SVI and LNI. In particular, a strong correlation was found between OCD predictions; similarly, ECE, SVI and LNI predictions also exhibited overall high and significant positive correlations. This finding suggests that ChatGPT-4o can produce similar outputs to traditional methods in complex clinical decision support processes such as prostate cancer risk prediction, thanks to their capacity to learn from large data sets. The strong correlations observed in the low- and intermediate-risk patient groups also support this potential, as in these groups, except for the LNI prediction in the low-risk group, all other predictions showed significant correlations. However, the most striking findings of our study are the discrepancies in the high-risk and locally advanced patient groups. In the high-risk group, there was no statistically significant correlation between the estimates of OCD, SVI and LNI. Similarly, no significant correlation was found in the predictions of OCD and ECE in the locally advanced patient group. Similarly, no significant correlation was found in the predictions of OCD and ECE in the locally advanced patient group. This suggests that ChatGPT-4o may not produce as reliable predictions as traditional nomograms, especially when the disease is more advanced and complex. The discrepancies observed in the high-risk and locally advanced groups may be explained by several factors. Large language models such as ChatGPT-4o are primarily trained on general internet-based sources rather than curated, domain-specific medical datasets. As a result, their ability to accurately represent rare or complex clinical scenarios remains limited. Nomograms, in contrast, are derived from large patient cohorts with detailed clinical and pathological annotations, allowing them to more precisely model the heterogeneity of advanced disease. In these groups, tumor biology is often more aggressive and unpredictable, with greater variability in features such as extracapsular spread patterns, seminal vesicle involvement, and nodal dissemination. Subtle distinctions in staging parameters (e.g., between cT3a and cT3b disease) may translate into markedly different risk profiles, but such nuances are difficult for a language-based model to capture without access to structured radiological, pathological, or molecular data. Furthermore, while ChatGPT-4o generates probability estimates by identifying linguistic patterns, it lacks true comprehension of the underlying pathophysiological mechanisms. These limitations collectively help to explain the reduced concordance with nomogram predictions in the most clinically complex patient groups.
The fact that our study provides a controlled comparison using synthetic patient scenarios representing risk groups eliminates the variability in real patient data and allows direct comparison of ChatGPT-4o and nomogram outputs. Furthermore, the reference to MSKCC, a validated nomogram widely used in clinical practice, increases the clinical validity of the results. On the other hand, the study has some limitations. The use of synthetic patient scenarios may not fully reflect the heterogeneity and clinical nuances of real-world patient populations. The use of only a single AI chatbot (ChatGPT-4o) and a single nomogram (MSKCC) may limit the generalizability of the results. Furthermore, although 40 patient scenarios were sufficient for statistical analyses, the smaller number of cases, especially in subgroups (10 scenarios in each risk group), may have led to smaller correlations not being statistically significant.
In conclusion, our findings suggest that ChatGPT-4o may be a promising tool in the field of prostate cancer risk prediction, but exhibit significant inconsistencies compared to existing nomograms, especially in complex scenarios such as high-risk and locally advanced disease. These findings emphasize the need for extensive validation and development studies on larger and real patient cohorts before AI can be widely used in clinical practice. Future research should focus on the specific training of AI models with medical data and their integration as a decision support tool for physicians.