Biostatistics is an essential component of medical research, serving as the foundation for data interpretation and inference in healthcare.1 Biostatistics is crucial for evidence-based clinical decisions, offering tools to accurately interpret and evaluate research data. It aids clinicians in uncovering relevant information, comparing treatment efficacy, and predicting patient outcomes. Ensuring the adequacy of statistical analysis and validity of results through proper statistical methodologies helps reduce biases and confirm the findings’ significance.2 The increased demand on biostatistics expertise at academic healthcare centers today require novel solutions.3
Historically, when considering the various statistical software available, Statistical Package for the Social Sciences (SPSS), Statistical Analysis System (SAS) and Stata4 emerge as top competitors in the field of biostatistics. SAS is known for its powerful features, but it comes at a higher cost and requires more expertise. In contrast, both SPSS and STATA strike a good balance between intuitive interfaces and affordable pricing, catering to a broad spectrum of users. For free open-source packages, R software is the best-known statistical programming language for statistical computing and graphics. However, its lack of a user-friendly interface can be challenging for beginners. Promising tools such as rBiostatistics.com offers the strong computational powers of R software in a free cloud-based while providing a user-friendly interface.5 Regardless of the software chosen, using credible tools for biostatistical analysis is vital. It ensures the accuracy and reproducibility of research outcomes, guaranteeing that conclusions drawn are both reliable and scientifically sound.
The rise of artificial intelligence, especially tools such as OpenAI’s Chat Generative Pre-trained Transformer 4 (ChatGPT-4), is transforming the era in various industries. ChatGPT-4 Data Analyst tool, previously known as Code Interpreter and Advanced Data Analysis,6,7 presents a new frontier for biostatistical computations.8 This allows users to both engage in discussions about the biostatistical test used and execute data analysis by chatting. These prompts are then converted into a python code to execute data analysis. This unique union of conversational AI and analytical offers a potentially huge platform for researchers, as demonstrated in various fields including COVID-19 surveillance and bioinformatics.6,7,9
Currently, the literature on ChatGPT-4 Data Analyst is limited with some authors inquiring on ethical use of ChatGPT-4 for data analysis.10 We aimed to assess the reliability and improvements over time of ChatGPT-4 Data Analyst tool by comparing the analysis results to the standard R package through rBiostatistics.com s user interface.
METHODS
This is an experimental, comparative study conducted at the Analytics Data Unit of Organ Transplant Center of Excellence, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia between 5th and 10th of October 2023 and 5th and 10th of March 2024. The dataset included 2159 patients and ethics approval was extracted from the International Liver Surgery Outcomes Study - LiverGroup.org.11
We selected five variables for analysis: Age, represented a normally distributed continuous variable; Sex, a binary variable; Hospital-stay, a non-parametric variable representing length of stay; Income group, an ordinal variable categorizing income levels; and Mortality, a binary outcome. The analysis was conducted through two statistical software: R package, facilitated through the rBiostatistics.com GUI platform as the reference standard, and OpenAI’s ChatGPT-4 Data Analyst tool, an advanced large language model with statistical computation capabilities that utilizes python.
Statistical tests were applied to the data using R as follows: Descriptive statistics included calculation of the mean, standard deviation, median, and specific percentiles. Chi-squared tests and Fisher’s Exact Test for association between categorical variables. Independent samples T-tests to compare group means. Mann-Whitney U tests and Kruskal-Wallis tests for median comparisons in non-parametric data distributions. ANOVA for mean comparisons across multiple groups. Box-and-whisker plot showing hospital stay by income group as well as kaplan-meier survival curves were generated from R package and then requested from chatGPT-4.
In parallel, an identical series of analysis was conducted through ChatGPT-4 Data Analyst “version 25 September 2023.” And “version Feb 2024”. Our interaction with ChatGPT-4 encompassed two specific methods: Holistic Analysis: A comprehensive directive was relayed to ChatGPT-4, instructing it to execute all the tests in one prompt (all at once request). The conversation flow was as follow: “I want you to perform all the descriptive and inferential tests” (Table 1 followed by table 2), written in a single prompt. ChatGPT-4 suggested to delete some cases due to few missing values to which we responded “no, proceed with the analysis”. ChatGPT-4 conducted the tests in parts, after few tests were executed, it kept asking if it should proceed to the next tests to which we responded with yes. The whole analysis was performed using a total of nine prompts. The second method was conducted with version 25 Sep 2023 only, using Segmented Analysis. This was a more focused approach used, with each test being prompted one after another (Table 2) to compare the difference in accuracy and consistency of ChatGPT-4’s capabilities (one by one request). ChatGPT-4 version feb 2024 ran all the tests in the holistic approach correctly hence the segmented approach was deemed unnecessary.
The output from ChatGPT-4 was compared to the results obtained from R software to assess consistency and identify any variations then compared between the 2 timelines to assess whether chatGPT-4 is improving without announced updates on Data Analyst tool over a 5-month period. The comparison extended across both methods of analysis provided by ChatGPT-4, enabling a thorough assessment of the AI model’s capacity to handle complex statistical computation in a research context. While rBiostatistics.com extended its services at no charge, the utilization of ChatGPT-4’s capabilities came with an associated cost, priced at 20 US Dollars monthly as of 25 September 2023 – 14 February 2024.
RESULTS
Descriptive analysis through ChatGPT-4
Descriptive results, demographic variables, including sex, age, and income group distributions, showed consistent alignment between ChatGPT-4 and R results, with minor variations in No mortality % and mortality % ChatGPT-4 October Holistic Analysis only. For the “Segmented Analysis” method, individual tests were sequentially requested from ChatGPT-4, with each outcome compared to its corresponding R result. The specific details are presented in Table 1.
Inferential Statistical Outcomes for Holistic Analysis
The October version of ChatGPT-4 illustrates the comparative analysis of Chi-square, T-tests, and non-parametric tests against R showed limited consistency. Some tests were omitted (e.g., Crosstabulations 2x2 or 3x2 and Fisher’s Exact Test interquartile range (IQR) for Odds Ratio on sex vs. mortality, T-Test and Mann-Whitney U Test on age and hospital stay vs. mortality, respectively), while others showed inconsistent outcomes, such as T-Tests on age standard deviation and P-value vs. no mortality and Kruskal-Wallis Tests P-value, median, and IQR percentile on hospital stay across different income levels (table 2).
In contrary, certain tests demonstrated reliable results, including the Wilcoxon Rank Sum P-value, Mann-Whitney U Test on hospital stay IQR, Pearson’s Chi-squared P-value income group, and Fisher’s Exact Tests on sex and mortality, as well as T-Tests on age. Overall, the degree of agreement for the Inferential outcomes between ChatGPT-4 and R results occurred in 72% of all tests with accuracy rate at 99% and the results matched the matched with R by 82%.
Inferential Statistical Outcomes for Segmented Analysis
Regarding the Inferential Statistical Outcomes, our study revealed certain tests conducted by ChatGPT-4 in October did not align with the accuracy offered by R. These tests, particularly non-parametric ones assessing the relationship between hospital stay and mortality, as well as income levels, showed inconsistencies. Specifically, the Mann-Whitney U and Kruskal Wallis Tests across various income groups, and the Wilcoxon Rank Sum Test showed inconsistent results. Additionally, the T Test for age versus mortality and Fisher’s Exact Test for sex versus mortality at the higher confidence interval did not match R’s accuracy.
Despite these discrepancies, it’s noteworthy that majority of the other tests, about two-thirds, were conducted by ChatGPT-4 with an accuracy level of 99% or higher, closely matching the outcomes provided by R as shown in Table 2. This demonstrates the potential of ChatGPT-4 in conducting a range of biostatistical analysis, although it highlights the need for further refinement in its capabilities for certain types of analysis.
On the other hand, March version of ChatGPT-4 demonstrated high accuracy as shown in Table 2. All the tests requested were done with 100% accuracy except for Fisher’s exact test odds ratio and Levene’s test p-value. ChatGPT-4 March version initially did not perform these tests and on a second request it acknowledged its limitations and said the results may not be accurate because it cannot run the exact test.
Figure Creation
The figures accuracy of ChatGPT-4 March 2024 version was compared to R package. Figure 1 shows Box-and-whisker plot that was generated from R package (Figure 1A) and it matched the result when compared with the figure generated by a command to ChatGPT-4 (Figure 1B). A survival curve using kaplan-meier was done afterwards and it showed high accuracy of consistency between R (Figure 2A) and ChatGPT-4 (Figure 2B) March 2024 version, but the upper 95% CI did not match.
DISCUSSION
In this comparative analysis of ChatGPT-4 and R for biostatistical computations, we aimed to understand how reliable ChatGPT-4 Data Analyst results are. Our findings highlight the significant improvement ChatGPT-4 had in March 2024 compared to the October 2023 version, particularly in performing inferential tests. With the high potential in ChatGPT-4 data analyst mode, it may be relied on to perform statistical analysis as it advances rapidly. This is not just about numbers; it reflects the future AI holds in the accurate world of biostatistics, especially when it comes to figure creation.
The results showed that the performance of ChatGPT-4 varied depending on the method used in the October 2023 version, while the March 2024 was able to overcome these obstacles. This aligns with Huang et al. (Huang et al., 2024) findings where he found ChatGPT-4 Data analyst has some limitations in results accuracy for epidemiological studies when applied in more advanced statistical methods. This may be attributed to limited python library chatGPT-4 uses for analysis. When tasked with multiple analysis at once in the holistic approach, the results of ChatGPT-4 were less consistent compared to when it processed each test individually in the segmented approach. This is significant because, unlike R, which is a free resource, ChatGPT-4 comes with a cost. This cost-accuracy trade-off is particularly important for researchers with limited funds.
Beyond the cost and accuracy, our study opens a broader conversation about the role of AI in research. The integration of AI like ChatGPT-4 into research is exciting, with various potential uses including academic writing, outlining research studies, interpretation of statistical data, and guiding the research process.12 Even though ChatGPT-4 Data Analyst is less time consuming, requires less effort and foundational knowledge of biostatistical knowledge, and can provide more insight into the data, our findings suggest a measured approach is needed. Its current role in biostatistical analysis is still limited, and consensus on its use in biostatistics needs to be addressed by journals. We believe that currently traditional software is still the more reliable tool.
Certain developments in biostatistical software to improve the accessibility, cost, and usability of the traditional ones such as SPSS and R is needed to help accommodate the increasing demand of biostatistical expertise.3 While R package is free, it requires high-level programming language skills and lacks user friendly interface.13 On the other hand, SPSS, the most common used statistical software,4 has a more user-friendly interface, but it comes at a monthly subscription cost. One tool that is solving these issues is rBiostatistics.com. The development of cloud-based, free, and open access platform5 can help beginners and those with limited resources to bridge the gap performing biostatistics, especially with its integrated e-learning experience.
For now, AI is best utilized as a supplementary tool, as it was used to support decision-making processes to address the spread and impact of pandemics.9 Gerli et al.9 Utilized ChatGPT-4 to generate R code models predicting the daily number of COVID-19 deaths in Italy. This utility offers support but not replacement for established statistical software. AI companies like OpenAI should inform release statements on how reliable the use of tools like ChatGPT-4 Data Analyst with each update. Future research should concentrate on refining AI’s capabilities in statistical analysis to ensure comprehensive performance and intuitive result interpretation. The aim is to create AI systems that complement researchers, enhancing their expertise and enabling more efficient and innovative research methodologies.
It is important to note that this study, like any other, has its limitations. Specifically, the study compares ChatGPT with R software, which is considered the industry standard and has no use of writing capabilities or help in addressing questions with your analysis. Additionally, the results may be affected by frequent updates to ChatGPT. It’s worth noting that the quality of data provided by ChatGPT also depends on the user’s skill in requesting data analysis. However, the present study’s strength lies in its unique patient cohort of over 2000 LT patients. The data from these patients has already undergone statistical work-up and can be used as a gold standard. Our study demonstrates a high level of accuracy when compared to this statistical work-up, which is encouraging for future.
As we consider future directions, it is crucial to explore the capability of ChatGPT-4 to guide researchers in selecting the appropriate statistical tests, its proficiency in choosing the right variables for multivariate analysis, and its adeptness in interpreting this analysis. Future research should focus on enhancing the capabilities of AI in statistical analysis, ensuring that it cannot only perform a wide range of tests but also provide clear explanations and interpretations of the results. The goal should be to develop AI systems that can work alongside researchers, augmenting their expertise and allowing for more efficient and innovative research methodologies.
CONCLUSIONS
As AI systems like ChatGPT-4 advance rapidly, their use for autonomous biostatistical analysis needs to be addressed by the research community. Currently Expert human input is required, and researchers should continue to employ established statistical tools like R, particularly for complex analysis. Despite the advanced capabilities and promise of ChatGPT-4, the journey to effectively integrate AI into research is in its early stages, companies like OpenAI and the research community should address its current and future role in biostatistics. Even though ChatGPT-4 may have many perks over traditional software which may be valuable for intermediate biostatisticians, researchers aiming for dependable biostatistical analysis should still rely on biostatisticians using established software.
Data availability
The data was retrieved by a request from livergroup.org
Disclosure of interest
The authors completed the ICMJE Disclosure of Interest Form (available upon request from the corresponding author) and disclose no relevant interests.
Correspondence to
King Faisal Specialized Hospital & Research Center
7790, Al Maather, 2602, Riyadh 12713, Saudi Arabia
Email: Dimitri.raptis@gmail.com