1. Introduction
Artificial intelligence (AI) has become a transformative tool in healthcare, particularly in medical image analysis for early disease detection[1]. Its application has been especially successful in diabetic retinopathy (DR), a top leading cause of blindness among working-age adults. DR affects around one in three diabetic individuals [2], with the age-standardized global prevalence of blindness due to DR increased from 14.9% to 18.5% between 1990 and 2020, imposing significant healthcare and socioeconomic burdens[3]. In this context, AI-based systems have shown high accuracy and efficiency in detecting DR from retinal images[4,5]. By facilitating early detection and timely intervention, AI-based DR screening strategies hold the potential to improve patient care, expand access to expertise in remote areas, and address the global health burden posed by DR[6].
Despite the promising performance of AI-based DR screening, this software as a medical device, its value and feasibility for widespread implementation in real-world clinical settings require careful assessment. Heath-economic evaluations, such as cost-effectiveness analysis, are essential for understanding the potential benefits of AI and for guiding policymaking and resource allocation[7,8]. Over the past decade, numerous studies have evaluated the economic costs and health outcomes of AI-based DR screening in countries such as Brazil, China and the United States. However, their findings were inconsistent, partially due to differences in study focus, context and methodologies[9–13]. As a result, decision-makers often face overwhelming amounts of information and conflicting conclusions, which complicates the formation of clear policy directives. While some evaluations provide only descriptive summaries without standardized effect measures, this lack of quantitative synthesis makes it difficult to integrate results or pinpoint sources of heterogeneity. Moreover, disparities in income levels, healthcare systems, and research perspectives further hinder the generalizability of conclusions across different settings, especially in countries where context-specific economic evaluations are not available.
To address these gaps, we performed a systematic review and meta-analysis to quantify the economic costs and health outcomes of AI-based DR screening, providing a quantitative assessment of its performance from a health-economic perspective. By pooling incremental net benefit (INB) estimates for various AI-based comparisons and stratifying analyses according to the identified heterogeneity, our study aims to provide robust evidence to inform policy-making and assist the future guidance of AI-enabled DR screening programs worldwide.
2. Methods
This systematic review and meta-analysis was performed following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Protocols[14] and was registered at PROSPERO (No. CRD42024583940).
Data Sources and Search Strategy
A systemic search of the literature was conducted on 1st September 2024 across multiple databases, including PubMed, Scopus, Embase, the Cochrane Library, the National Health Service Economic Evaluation Database, and the Cost-Effectiveness Analysis Registry. The search strategy incorporated a combination of keywords including “artificial intelligence” or "AI" or "deep learning" or "machine learning", “diabetic retinopathy” OR "DR" OR "diabetic macular edema" OR "diabetic macular oedema", “economic outcomes” or "economic evaluation" or “cost-effectiveness” or “cost-utility”. Additionally, the reference lists of eligible studies and relevant reviews were also reviewed to retrieve other potentially relevant studies. No language restriction was applied. Full search strategies were shown in Supplementary Table 1.
Study Selection
Two researchers independently screened the titles and abstracts of the literature. Full articles were reviewed if a decision could not be made based on the abstract. Any disagreements were discussed and resolved with a third researcher. Eligible studies were included if they met the following criteria: (1) conducted among adult diabetic population, including both type 1 diabetes mellitus (T1DM) and type 2 diabetes mellitus (T2DM); (2) studies compared AI-based DR screening strategy to non-AI screening (including manual screening or no screening), regardless of the specific screening strategy (e.g., AI used independently or as an assistive tool for human decision-making); and (3) performed a cost-effectiveness analysis and reported at least one health economic outcome including incremental cost-effectiveness ratio (ICER), incremental net benefit (INB), incremental cost (ΔC), and incremental effectiveness (quality-adjusted life years (QALYs)). Studies were excluded if they were review studies. Furthermore, studies with insufficient data for pooling were excluded from meta-analysis.
Data Extraction
Data were extracted by two researchers separately and recorded in a structured spreadsheet. Any disagreements were resolved through discussion with a third researcher. The data extraction was conducted based on the Consolidated Health Economic Evaluation Reporting Standard (CHEERS) statement[15,16], the structured abstracts of economic evaluations in the NHS Economic Evaluation Database (NHS EED)[17], and the Centre for Reviews and Dissemination (CRD) guidance[18].
The data extraction form included the following five components:
1)General article information including date of data extraction, study ID, first author and correspondence author, journal and year of publication.
2)General study characteristics including country of the study, country income level, type of economic evaluation, modelling approach, study perspectives, setting level.
3)General characteristic of participants and intervention/comparison including study participants, age distribution of participants, screening strategies, status quo, and diagnostic performance of AI and human graders.
4)Study methods of economic evaluation include time horizon, cycle length, discount rate applied to costs and health outcomes, base cost year and currency, type of outcome measures, cost-effective threshold, and sensitivity analyses performed.
5)The health economic outcomes include mean and incremental cost (ΔC), mean and incremental effectiveness (ΔE), and ICERs. Standard deviation (SD) or 95% confidence interval (CI) for these parameters were also extracted if possible. Data for pooling were extracted including mean of cost or outcome with their dispersion.
Where health economic outcomes were not explicitly reported in numerical values, data were extracted from cost-effectiveness plane graphs, when available. Additionally, cost-effectiveness thresholds or willingness-to-pay (WTP) thresholds were recorded. If WTP thresholds were not reported, they were estimated based on three times the per-capita gross domestic product (GDP) for the country in the publication year, following the World Health Organization (WHO)’s recommendation.
Risk of bias assessment
Considering model- or trial-based economic evaluations, risk of bias was assessed using the Bias in Economic Evaluation (ECOBIAS) checklist consisting of 22 items. Each item was rated as yes, no, partly, unclear, or not applicable, depending on the study's adherence to methodological standards.
Interventions and economic outcomes
Interested interventions were comparisons of AI-based screening strategies for DR screening. The comparator was determined as non-AI screening (manual screening) or no screening. The primary economic outcome was the ICER, defined as the difference in the cost between two interventions divided by the difference in their effect in term of humanistic outcome including QALYs.
Data Preparation
Since individual studies used different currencies and base years, all costs were converted to a 2023 cost metric using the consumer price index (CPI) and expressed in United States dollars (US$). The primary economic outcome measure was INB[19–21], which was calculated as follows: INB = K(ΔE)−ΔC, where K is the cost-effectiveness thresh- old or WTP, ΔE, and ΔC are the difference of QALYs and cost between intervention and comparator.
Here we selected INB as the effect measure rather than ICER because its positive or negative value directly indicates cost-effectiveness or non-cost-effectiveness, respectively, and its linearity facilitates straightforward statistical analysis. In contrast, negative ICER values may indicate a lower cost with higher effectiveness or high cost with lower effectiveness, which could introduce interpretive ambiguity. Those studies reported the ICERs were converted to the INB as INB = ΔE (K−ICER). Variance of INB could be calculated using the following formulas: Var(INB) = K2σ2ΔE + σ2ICER, or Var(INB) = K2σ2ΔE + σ2ΔC – 2KρΔCΔE. However, due to variation in reporting across economic evaluation studies, INB and its variance were estimated under five scenarios, following previous recommendations[22](Supplementary Table 2).
Statistical analysis
Statistical analysis was conducted to evaluate the pooled INB across studies. Meta-analyses were stratified by country income levels, classified according to the World Bank classification (high income countries (HICs), upper- or lower-middle income countries (U/LMICs), etc) and study perspective (healthcare system/payer and societal). For studies reporting results on multiple populations, the weighted average estimate of INB and its variance was used in the main analysis.
Fixed-effect modeling using the inverse-variance method was applied when no significant heterogeneity was detected, other-wise a random effect model was applied[23]. The intervention was considered cost-effective if the pooled INB was positive (i.e., favoring the intervention), otherwise the new intervention was not cost-effective. Heterogeneity was assessed by the Cochrane’s Q test, with a p value < 0.1 indicating significant heterogeneity, and the I² statistic was used to quantify the degree of heterogeneity. Subgroup analyses were conducted to explore potential sources of heterogeneity, such as differences in country income levels or comparison methods. In addition, a 95% CI was calculated estimate whether the pooled INB would remain cost-effective in other settings. Publication bias was assessed using funnel plots and Egger’s test. If asymmetry was identified, contour-enhanced funnel plots were used to differentiate potential causes. A series of pre-specified sensitivity analyses were performed by excluding studies with following conditions: (1) time horizon <10 years; (2) high risk of bias and (3) scenario 5 (imputing variance using absolute value borrowing from similar studies). Heterogeneity was further explored through meta-regression. Data pooling was undertaken using Microsoft® Excel version 2019 and analyzed using STATA® version 16. A two-sided p value < 0.05 was considered statistically significant.
3. Results
Of the 600 identified studies, 9 studies were eligible for the meta-analysis[9,11–13,24–28](Figure 1), including 11 comparisons: 9 comparisons between AI screening and manual screening, 2 comparisons between AI screening and no screening.
Study characteristics
Geographically, three studies were conducted in high-income countries, while six were conducted in middle-income countries. All studies evaluated manual tiered screening strategies, with two also assessing no-screening scenarios. The Markov model was used in all studies. Healthcare provider or health system perspective was most commonly adopted analytical perspective, featured in seven studies, while four adopted a societal perspective. Sensitivity and specificity of AI-based screening strategies ranged from 0.80 to 0.98 and 0.91 to 0.99, respectively. In comparison, the sensitivity and specificity of the status quo (non-AI screening) ranged from 0.73 to 1.00 and 0.92 to 1.00, respectively. Discount rates for cost and efficiency were between 3% and 3.5%, and time horizons varied from one year to lifetime. Details of these studies were shown in Table 1.
Risk-of-bias assessment
We used the ECOBIAS checklist to evaluate the risk of bias. Biases were reported in the dissemination bias, limited time horizon bias, data identification and incorporation bias, limited sensitivity analysis and scope bias, and bias related to internal consistency. Two studies were classified as low risk of bias (2/9, 22.2%), five studies (5/9, 55.6%) were identified with moderate and two were with high risk of bias (2/9, 22.2%). Results from a risk of bias assessment are described in Supplementary Table 3.
Pooled INBs based on healthcare system/payer perspective
Seven studies, including eight comparisons of AI-based screening versus the status quo, were analyzed. Among these studies, three were conducted in HIC while the remaining four were conducted in U/LMICs. In general, the pooled INB showed that AI-based DR screening was significantly and robustly cost-effective when compared to conventional manual screening from the healthcare system/payer perspective (INB= 615.77, 95% CI: 558.27, 673.27). While heterogeneity was noted, publication bias was not detected based on Egger’s test (Egger’s test, P=0.32). Therefore, we performed subgroup analyses based on the country’s income level. AI-based DR screening was found cost-effective in both HICs (INB = 613.62, 95% CI: 556.06, 671.18) and U/LMICs (INB = 1739.97, 95% CI: 423.13, 3056.82) with low heterogeneity (HIC, I2=28.9%, P=0.25; U/LMIC, I2=54.4%, P=0.07).
Pooled INBs based on societal perspective
Four studies with 6 comparisons of AI screening versus status quo were analyzed from societal perspectives. The pooled INB indicated that AI-based DR screening was cost-effective in general from societal perspective (INB= 5102.33, 95% CI: -815.47, 11020.13), but the results did not reach statistical significance. Significant heterogeneity was observed in funnel plot, yet Egger’s test indicated no evidence of publication bias (Egger’s test, P=0.18). Since all the studies were conducted in U/LMICs, we stratified the studies by the comparator strategy (either manual screening or no screening). AI-based DR screening was found cost-effective in comparison with both strategies (manual screening INB = 1506.87, 95% CI: -1986.74, 5000.48; no screening INB = 11906.00, 95% CI: -910.58, 24722.59), although neither subgroup analysis reached statistical significance and substantial heterogeneity persisted (manual screening I2=64.5%; no screening I2=93.2%). The 95% CI indicated that the true effect in future settings could be either null or aligned with the direction of the pooled INB.
Sensitivity analysis and sources of heterogeneity
From a healthcare system perspective, AI-based DR screening showed modest and robust INBs, particularly in HICs, where the INB was 615.77 (95% CI: 558.27, 673.27) with low heterogeneity (I2 = 25.3%). When limiting the analysis to studies with time horizons of >5 years or excluding studies using scenario 5 in high-income countries, the INB increased slightly to 620.99 (95% CI: 562.65, 679.32), and heterogeneity was completely eliminated (I2 = 0.0%). Excluding studies with high risk of bias also yielded similar statistically significant results, further supporting the robustness of the results (INB: 609.95, 95% CI: 551.59, 668.32). In U/LMICs, while AI screening is cost-effective overall, the results were not statistically robust when short time horizon, high-risk studies or specific scenarios are excluded, suggesting that the cost-effectiveness of AI screening in U/LMICs is more sensitive to study design and quality (Table 2). In contrast, under the societal perspective, sensitivity analyses revealed substantial heterogeneity across most comparisons, indicating variability among included studies. Excluding high-risk bias studies or scenario 5 can eliminate the heterogeneity in manual screening comparisons (I2 = 0%). However, in these cases, the INBs indicated that AI screening was not cost-effective, as no statistical significance was observed (INB=-302.15, 95% CI: -1211.81, 607.50) (Table 3). Overall, AI screening is generally cost-effective in HICs, and heterogeneity in time horizon, study quality and highlights variability across studies particularly in U/LMICs and from a societal perspective. Besides, our meta-regression analysis showed that AI specificity was positively correlated with the INBs from a healthcare system perspective (P=0.031), in addition to status quo comparator type (P=0.031 and P=0.011 in societal and healthcare perspective, respectively) (Table 4).
Discussion
As AI continues to gain recognition for its potential in DR screening, questions around the cost-effectiveness of AI-based screening strategy are increasingly critical to consider. This systematic review and meta-analysis included 9 studies with 11 comparisons assessing the cost-effectiveness of AI-based screening versus manual or no screening. Overall, AI-based DR screening was generally cost-effective from a healthcare system perspective, particularly in HICs. From a societal perspective, AI-based screening also demonstrated cost-effectiveness but lacked statistical significance; yet significant heterogeneity was observed, highlighting the need for cautious interpretation when generalizing findings across different settings.
Our findings indicated that from a healthcare system or payer perspective, AI-based DR screening demonstrated strong cost-effectiveness, and higher INBs were found in U/LMICs compared to HICs. Two studies in US and one study in Australia indicated that AI screening was more cost-effective than human graders[12,13,27]. One predominant reason was that the costs of manual grading were relatively high In HICs, making AI systems a more cost-saving alternative. However, the cost-effectiveness of AI screening was a more complex trade-off in U/LMICs, as it depended on the balance between the added value of automation and the local cost dynamics of traditional screening methods. On one hand, in U/LMICs, traditional screening methods were often constrained by limited infrastructure and a shortage of trained personnel[29], and AI screening can address these gaps through automation and telemedicine systems, enabling earlier and more accurate diagnoses[8]. On the other hand, low labor costs in U/LMICs may reduce the cost-saving advantage of AI screening, which may explain why studies in Brazil and Thailand found AI algorithms to be less cost-effective than human grading[24,28]. Despite these nuances, our meta-analysis systematically proved that AI-based screening strategy was universally cost-effective and statistically significant, suggesting its promise to address systemic healthcare gaps in both HICs and U/LMICs.
From a societal perspective, AI-based DR screening was also cost-effective though showing less robustness and higher variability. Most studies on this topic have been conducted in U/LMICs, where the implementation of AI-based screening faces challenges. For example, the cost advantage of AI was less apparent in these regions due to its relatively high deployment costs, including technical maintenance and equipment upgrades, further limited its scalability and adoption potent in these regions. In HICs, DR screening is typically provided through national systematic programs with insurance reimbursement, whereas in U/LMICs, it is often provided opportunistically [29]. Although some regions offer free screening initiatives, extending such programs nationwide is often unsustainable for countries with limited healthcare budgets. Moreover, prior research indicated that even minimal out-of-pocket payments can substantially reduce patient participation in DR screening in U/LMICs. Therefore, reducing these costs while increasing screening uptake and referral adherence is critical for ensuring early diagnosis and treatment of sight-threatening conditions. In this context, leveraging low-cost telemedicine networks can be highly effective. For example, in Thailand, deep learning-based AI software is deployed in primary care centers, where non-physician staff capture and transmit retinal images for remote evaluation[24]. By utilizing existing infrastructure and telemedicine technology, overall costs are minimized. Additionally, public-private partnerships can further reduce initial investments, with governments providing the primary care framework and private partners offering AI solutions and technical support.
High heterogeneity was observed in AI-based DR screening, especially in societal perspective. One of the reasons might be the different assumed values for input parameters. Wide variations in patient compliance (50.4–100%) among our included studies significantly influenced the pooled INB estimates. Notably, a one-year cost-effectiveness analysis in pediatric diabetes population showed that compared to traditional screening, AI-based DR-screening was the preferred strategy only when at least 23% of patients adhere to screening[30]. Additionally, cost composition varied considerably among studies. Given that AI cost calculations have not been standardized, we recommend including implementation and maintenance costs over a 10-year lifespan to capture true initial capital investment [31]. In the sensitivity analysis, we identified studies with short time horizons or lacking dispersion metrics or a cost-effectiveness (CE) plane could introduce heterogeneity into the analysis, suggesting that standardizing research methods is essential for accurate estimation on the value of AI-based DR screening. Furthermore, in meta-regression analysis, we found that higher AI specificity was positively correlated with INB from a healthcare perspective. Higher specificity lowers false-positive rates, thereby reducing unnecessary referrals and additional diagnostic tests in the healthcare system. It is also noteworthy that cost-effectiveness analyses can differ based on perspective: while the healthcare system perspective emphasizes direct costs (e.g., referral expenses and the following treatment fee), the societal perspective considers long-term impacts including blindness. Although improving AI performance is crucial to balancing reduced up-front screening costs against potential downstream expenses, the most accurate AI model may not necessarily be the most cost-effective[32]. Further research is needed to enhance or balance the sensitivity and specificity of AI algorithms to achieve optimal cost-effectiveness.
This study has several limitations. First, the included studies showed high heterogeneity, which may not have been fully addressed despite performing sensitivity analyses. Second, despite many studies assuming the same compliance rate for AI-based and traditional screening, the immediate feedback provided by AI could realistically boost referral adherence beyond that of traditional telemedicine approaches, which often entail a one- to two-week delay. For instance, Liu et al. demonstrated that AI-based screening raised compliance from 18.7% to 55.4% [33]. Failing to consider this potential improvement in compliance may underestimate the cost-effectiveness of AI screening. Third, many studies lacked consistent reporting of key parameters, such as the variance of cost and outcome, potentially obscuring the true impact of important factors. Future studies are encouraged to use standardized methods and provide detailed reporting of critical parameters to improve the reliability and comparability of results.
Conclusion
AI-based DR screening is generally cost-effective from a healthcare system perspective, particularly in HICs. Given its advantages in reducing healthcare disparities and optimizing resource allocation, AI has the potential to become a powerful tool for DR screening. However, heterogeneity introduced by different assumed values (e.g., compliance, cost, time horizon), status quo and AI performance remains a significant challenge in the comprehensive interpretation of economic evaluation of AI-based screening strategies. To address these challenges, future research should focus on standardized methodologies and reporting critical parameters in detail. Such efforts should improve the generalizability of findings and provide stronger evidence to guide policy-making and implementation strategies for AI-based DR screening.
Correction notice
None
Acknowledgement
We thank the InnoHK HKSAR Government for providing valuable supports. The research work described in this paper was majorly conducted in the JC STEM Lab of Innovative Light Therapy for Eye Diseases funded by The Hong Kong Jockey Club Charities Trust.
Author Contributions
(Ⅰ) Conception and design: Xiaotong Han, Mingguang He
(Ⅱ) Administrative support: Mingguang He
(Ⅲ) Provision of study materials or patients: Yueye Wang, Keyao Zhou, Jian Zhang
(Ⅳ) Collection and assembly of data: Yue Wu, Yueye Wang
(Ⅴ) Data analysis and interpretation: Yue Wu
(Ⅵ) Manuscript writing: All authors
(Ⅶ) Final approval of manuscript: Xiaotong Han, Mingguang He
Fundings
The study was supported by the Global STEM Professorship Scheme (P0046113), and Henry G. Leong Endowed Professorship in Elderly Vision Health.
Conflict of Interests
None of the authors has any conflicts of interest to disclose. All authors have declared in the completed the ICMJE uniform disclosure form.
Patient consent for publication
None
Ethics approval and consent to participate
None
Data availability statement
None
Open access
This is an Open Access article distributed in accordance with the Creative Commons AttributionNonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license).