Automated real-time assessment of intracranial hemorrhage detection AI using an ensembled monitoring model (EMM) - Nature - News Bunkers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
npj Digital Medicine volume 8, Article number: 608 (2025) Cite this article
5022 Accesses
5 Citations
16 Altmetric
Metrics details
Artificial intelligence (AI) tools for radiology are commonly unmonitored once deployed. The lack of real-time case-by-case assessments of AI prediction confidence requires users to independently distinguish between trustworthy and unreliable AI predictions, which increases cognitive burden, reduces productivity, and potentially leads to misdiagnoses. To address these challenges, we introduce Ensembled Monitoring Model (EMM), a framework inspired by clinical consensus practices using multiple expert reviews. Designed specifically for black-box commercial AI products, EMM operates independently without requiring access to internal AI components or intermediate outputs, while still providing robust confidence measurements. Using intracranial hemorrhage detection as our test case on a large, diverse dataset of 2919 studies, we demonstrate that EMM can successfully categorize confidence in the AI-generated prediction, suggest appropriate actions, and help physicians recognize low confidence scenarios, ultimately reducing cognitive burden. Importantly, we provide key technical considerations and best practices for successfully translating EMM into clinical settings.
The landscape of healthcare has rapidly evolved in recent years, with an exponential increase in FDA-cleared artificial intelligence (AI) software as medical devices, especially in radiology¹. Despite the number of AI applications available, the clinical adoption of radiological AI tools has been slow due to safety concerns regarding potential increases in misdiagnosis, which can erode overall trust in AI systems². Such inaccurate predictions force meticulous verification of each AI result to avoid cognitive pitfalls such as automation and confirmation biases^2,3, ultimately adding to the user’s cognitive workload rather than fulfilling AI’s promise to enhance clinical efficiency⁴. These cognitive influences on clinical decision making, along with potential mismatches between expected and actual AI performance, indicate a clear need for real-time monitoring to inform physicians on a case-by-case basis about how confident they can be in each prediction. Such real-time monitoring alongside a physician’s image interpretation also goes hand-in-hand with the latest guidance issued by the FDA focusing on total life-cycle management of AI tools, rather than the status-quo of pre-deployment validation⁵. However, there are currently limited guidelines or best practices for real-time monitoring to communicate reduced confidence in an AI model’s prediction.
Current monitoring of radiological AI devices is performed retrospectively based on concordance between AI model outputs and manual labels, which require laborious radiologist-led annotation⁶. Due to the resource-intensive nature of generating these labels, the vast majority of retrospective evaluations are limited to small data subsets, providing only a partial view of real-world performance⁷. While recent advances in large language models (LLMs) have shown promise in the analysis of clinical reports^8,9,10, including automated extraction of diagnosis labels from radiology reports^11,12,13, this solution remains retrospective. Moreover, with report-based monitoring, regardless of the extraction technique, the “quality control” mechanism for algorithm performance remains a manual task. An automated quality monitoring solution may help decrease the user’s cognitive burden and provide additional objective information regarding the performance of the AI model, including performance drift.
In response to these limitations, various real-time monitoring techniques have been proposed, including methods that can directly predict model confidence using the same training dataset used to develop the AI model being monitored^{14,15,16,17,18,19}. Common classes of confidence estimation rely on methods such as SoftMax probability calibration^20,21,22, Bayesian neural networks^23,24,25,26, and Monte Carlo dropout^27,28. Deep ensemble approaches have also emerged to evaluate prediction reliability by utilizing groups of models derived from the same parent model with varying augmentations^{29,30,31,32,33}. However, these methods require access to either the training dataset, model weights, or intermediate outputs, which is not practical when monitoring commercially available models. Since all FDA-cleared radiological AI models that are deployed clinically are black-box in nature, there currently exist no techniques to monitor such models in production in real time.
Thus, there remains a critical need for a real-time monitoring system to automatically characterize model confidence at the point-of-care (i.e., when the radiologists review the images and the black-box AI prediction in question). To address this need, we developed the Ensembled Monitoring Model (EMM) approach, which is inspired by clinical consensus practices, where individual opinions are validated through multiple expert reviews. EMM comprises multiple sub-models trained for an identical task and estimates model confidence in the black-box AI prediction based on the level of agreement seen between the sub-models and the black-box primary model. This approach enables prospective real-time case-by-case monitoring, without requiring ground-truth labels or access to internal AI model components, making it deployable for black-box systems. Here, we demonstrate the effectiveness of the EMM approach in characterizing the confidence of intracranial hemorrhage (ICH) detection AI systems (one FDA-cleared and one open-source) operating on head computed tomography (CT) imaging. In this clinically significant application requiring high reliability, we show how EMM can monitor AI model performance in real time and inform subsequent actions in cases flagged for decreased accuracy. The complementary use of a primary model with EMM can improve accuracy and user trust in the AI model, while potentially reducing the cognitive burden of interpreting ambiguous cases. We further investigate and provide key considerations for translating and implementing the EMM approach across different clinical scenarios.
Emulating how clinicians achieve group consensus through a group of experts, the EMM framework was developed to estimate consensus among a group of models. Here, we refer to the model being monitored as the “primary model”. In this study, the EMM comprised five sub-models with diverse architectures trained for the identical task of detecting the presence of ICH (Fig. 1). Each sub-model within the EMM independently processed the same input to generate its own binary prediction (e.g., the input image is positive or negative for ICH), in parallel to the primary ICH detection model. Each of the five EMM outputs were compared to the primary model’s output using unweighted vote counting to measure agreement in discrete 20% increments from 0% (no EMM sub-models agreed) to 100% (all five sub-models agreed). This level of agreement can translate into increased, similar, and reduced confidence levels in the primary output.
Each sub-model within the EMM is trained to perform the same task as the primary ICH detection model. The independent sub-model outputs are then used to compute the level of agreement between the primary ICH detection model and EMM, helping quantify confidence in the reference prediction and suggesting an appropriate subsequent action.
To identify the features most commonly associated with high EMM agreement, we visually examined all 2919 CT scans for two complementary primary models, an FDA-cleared model with higher specificity, precision and accuracy, and an open-source model^34,35 with higher sensitivity (Supplementary Fig. 1).
Results of the FDA-cleared primary model are shown in Fig. 2a, and the results of the open-source primary model are shown in Supplementary Fig. 2. The FDA-cleared primary model and EMM demonstrated 100% agreement and correct classifications in 1479 cases (51%, 632 ICH-positive, 847 ICH-negative), primarily in cases with obvious hemorrhage or clearly normal brain anatomy. EMM showed partial agreement with the FDA-cleared primary model in 848 cases (29%, 151 ICH-positive, 697 ICH-negative) when the FDA-cleared primary model was correct. EMM also showed partial agreement in 454 cases (16%, 415 ICH-positive, 39 ICH-negative) when the FDA-cleared primary model was incorrect. Visual examination revealed that the cases with partial agreement typically presented with subtle ICH or contained imaging features that mimicked hemorrhage (e.g., hyperdensity, such as calcification or tumor). These cases of partial agreement provide an opportunity for further radiologist review. Finally, in 138 cases (4%, 117 ICH-positive, 21 ICH-negative), EMM demonstrated 100% agreement with the FDA-cleared primary model, but EMM failed to detect that the FDA-cleared primary model’s prediction was wrong. These cases predominantly involved either extremely subtle hemorrhages or CT features that strongly mimicked hemorrhage patterns, confusing both the FDA-cleared primary model and EMM.
a Example cases for which EMM showed different levels of agreement with the FDA-cleared primary model. Cases with full EMM agreement typically showed clear presence or absence of ICH, while cases with partial agreement often displayed subtle ICH or features mimicking hemorrhage. b Quantitative analysis on the importance of features affecting EMM agreement with the FDA-cleared primary model in ICH-positive and ICH-negative cases. The normalized weight of importance for all features sums to 100%.
We then quantitatively examined which features affected EMM agreement using Shapley analysis³⁶. This analysis was performed on a data subset (N = 281) with a comprehensive set of features manually labeled by radiologists spanning multiple categories, including pathology, patient positioning, patient information, image acquisition, and reconstruction parameters. In ICH-positive cases, hemorrhage volume emerged as the dominant feature for high EMM agreement, with larger volumes strongly corresponding to higher agreement (Fig. 2b), as seen in our visual analysis. For ICH-negative cases, the predictive features for EMM agreement were more balanced. The top predictors for EMM agreement were brain volume, patient age, and image rotation. Some directional relationships between feature values and EMM agreement were also identified (Supplementary Fig. 2). For ICH-positive cases, high hemorrhage volume and multi-compartmental hemorrhages resulted in higher EMM agreement. In ICH-negative cases, the presence of features that mimicked hemorrhages led to lower EMM agreement.
The level of EMM agreement with the primary ICH detection model and the resulting degree of confidence also enables radiologists to make different decisions on a case-by-case basis. By setting stratification thresholds (Fig. 3) based on EMM agreement levels that are higher than, similar to, or lower than the primary AI’s baseline performance (Supplementary Fig. 3), primary predictions can be grouped into three categories, indicating scenarios where the radiologist may have increased, similar, or decreased confidence. This stratification allows radiologists to adjust their actions accordingly for each image read. For example, the primary model’s prediction might not be used in cases with decreased confidence, and these cases should be reviewed following a radiologist’s conventional image interpretation protocol. As we show in the following paragraph, such optimization may potentially improve radiologist efficiency and reduce cognitive load.
The stratification thresholds were selected by evaluating the primary model’s performance at different levels of EMM agreement, specifically, those where performance was higher than, similar to, or lower than its baseline (Supplementary Fig. 3). This categorization enables radiologists to make tailored decisions based on the confidence level derived from the EMM agreement levels. For ICH-positive predictions, confidence was stratified using EMM agreement thresholds of 100% (increased), 60, 80% (similar), and 0, 20, 40% (decreased). For ICH-negative predictions, thresholds were 100% (increased), 20, 40, 60, 80% (similar), and 0% (decreased).
Following the EMM thresholds and suggested actions outlined in Fig. 3, we then evaluated the overall accuracy of the primary model together with EMM for cases classified as increased, similar, and decreased confidence for the FDA-cleared primary model (Fig. 4a). This evaluation was also performed across three different prevalences of 30, 15, and 5%, which are close to the prevalences observed at our institution across emergency, in-patient, and out-patient settings. As expected, overall accuracy was highest for cases in which the primary model and EMM showed high agreement, and overall accuracy was lowest when EMM showed a lower agreement level with the primary model. This was observed for both ICH-positive and ICH-negative primary predictions and across all prevalence levels. Of the cases analyzed, most were classified as increased confidence based on EMM thresholds, followed by similar confidence, and lastly decreased confidence (Fig. 4b).
a Cases stratified by EMM agreement levels demonstrated increased (green), similar (yellow), or decreased (red) accuracies compared to the baseline accuracy of the primary model without EMM (gray). b Distribution (%) of cases classified as increased (green), similar (yellow), or decreased (red) confidence based on the EMM agreement thresholds. c For the cases in which EMM indicated decreased confidence, a more detailed radiologist review was called for. Cases flagged for decreased confidence, but for which the primary model’s prediction was correct, were defined as false alarms. Substantial relative gains over baseline accuracy using only the primary model were observed across all prevalence levels for ICH-positive primary model predictions, outweighing the burden of false alarms. For ICH-negative primary model predictions, however, this favorable balance between relative accuracy gains and false alarm burden was only observed at 30% prevalence. *, **, ***, and NS indicate p < 0.05, <0.01, <0.001, and not significant, respectively.
To assess the practical value of the EMM suggested actions, we analyzed the relative gains of the model compared to the cognitive load and loss of trust associated with incorrect classifications. Among the cases flagged for decreased confidence, those for which the primary model’s prediction remained correct despite low EMM agreement (and thus the decreased confidence classification) were considered false alarms. As shown in Fig. 4c, alerting radiologists to potentially incorrect positive outputs from the primary model enabled them to correct these cases, substantially improving relative accuracy and outweighing the false-alarm rate across all prevalence levels (relative accuracy improvements of 4.66, 11.28, and 38.57% versus false-alarm rates of 0.89, 0.45, and 0.14% at prevalence levels of 30, 15, and 5%, respectively). However, this net benefit was only observed for cases with ICH-negative primary model predictions at 30% prevalence (false-alarm rate of 1.05% versus relative accuracy improvement of 3.35%). At lower prevalence levels (15 and 5%), the already-high baseline accuracy of the primary model for ICH-negative cases (i.e., accuracy = 0.93 and 0.98, respectively) meant that the burden of false alarms (1.27 and 1.40%) did not exceed the relative accuracy gains (1.37 and 0.41%). Similar results were also observed for the open-source primary model across all prevalences (Supplementary Fig. 4).
To enable broader application and adoptability, we conducted a comprehensive analysis of how three key factors influence EMM performance and assessed its data efficiency and generalizability: (i) amount of training data used: 100% of the dataset (N = 18,370), 50% of the dataset (N = 9185), 25% of the dataset (N = 4592), and 5% of the dataset (N = 918), (ii) number of EMM sub-models (1–5), and (iii) EMM sub-model size in relation to training data volume. EMM performance was measured by its ability to detect errors made by the primary model using error detection sensitivity-PPV area under curve (ED-SPAUC) and specificity-NPV area under curve (ED-SNAUC) across prevalences, as these metrics consider the overall error detection performance regardless of the agreement level threshold applied.
As illustrated in Fig. 5a, EMM’s ED-SPAUC for the FDA-cleared primary model generally decreased as the training data were reduced from 100 to 5% of the original dataset across all three prevalences. This suggested that EMM generally improves with increased training data volume, though the benefits begin to saturate after approximately 4,600 studies (i.e., 25% of data). This trend was also observed in the ED-SNAUC and the open-source primary model (Supplementary Fig. 5a).
a Training data volume. b Number of sub-models. c Sub-model sizes. Error detection sensitivity-PPV area under curve (ED-SPAUC) and specificity-NPV area under curve (ED-SNAUC) were measured across prevalences; higher values are desirable for both. S-Net represents small sub-model networks for EMM and L-Net represents large sub-model networks for EMM. Similar results for the open-source primary model are shown in Supplementary Fig. 5.
As shown in Fig. 5b, EMM’s ED-SPAUC increased as the number of sub-models increased from 1 to 4, before generally stabilizing at 5 across all three prevalence levels. Conversely, ED-SNAUC consistently improved as the number of models increased from 1 to 5, across all prevalences. Similarly, both error detection metrics showed consistent improvement as the number of networks increased for the open-source primary model (Supplementary Fig. 5b). These results suggest that EMM performance generally increases with more sub-models, with 4 or 5 sub-models serving as an effective starting point for future applications.
We next investigated how combinations of EMM sub-model size and training data volume affected EMM’s performance in monitoring the primary model. We examined four scenarios: (i) ensemble of small networks trained with 5% of the dataset (S-Net 5%-Data), (ii) ensemble of small networks trained with 100% of the dataset (S-Net 100%-Data), (iii) ensemble of large networks trained with 5% of the dataset (L-Net 5%-Data), and (iv) ensemble of large networks trained with 100% of the dataset (L-Net 100%-Data). As shown in Fig. 5c, EMM generally achieved the best ED-SPAUC and ED-SNAUC with large networks and 100% of the training data and worst performance with a large network and 5% of training data, suggesting that a larger training dataset could benefit EMM performance. In a 5% prevalence setting, an ensemble of small networks and 5% of the training data achieved the highest ED-SPAUC and ED-SNAUC values. Similar findings were also observed for the open-source primary model (Supplementary Fig. 5c).
Additionally, we analyzed EMM performance across gender, age and race groups and found discrepancies in some subgroups, potentially due to either underrepresentation in the EMM training data or the distribution mismatch between the EMM training data and our internal testing dataset for these subgroups (Supplementary Fig. 6).
Taken together, these results provide insights into how the EMM approach can be developed and tailored for various real-time monitoring applications.
In this paper, we introduce EMM, a framework that monitors black-box clinical AI systems in real-time without requiring manual labels or access to the primary model’s internal components. Using an ensemble of independently trained sub-models that mirror the primary task, our framework measures confidence in AI predictions through agreement levels between the EMM sub-models and the primary model on a case-by-case basis. We also showed that the EMM agreement level can be used to stratify cases by confidence in the primary model’s prediction and suggest a subsequent action. For example, reviewing cases where EMM showed decreased confidence led to substantial relative accuracy gains that outweighed the burden of false alarms. Taken together, these results show that EMM is a valuable tool that fills the critical gap in real-time, case-by-case monitoring for FDA-cleared black-box AI systems, which would otherwise remain unmonitored. Finally, we explored how EMM can be generalized for future applications and demonstrated how EMM performance varies with different amounts of training data, as well as the number and size of sub-models, providing insights into how EMM can be adapted for various resource settings.
Our approach enables quantification of confidence through EMM agreement levels with the primary model’s predictions. By applying appropriate thresholds to the level of agreement between the EMM and primary model, radiologists can differentiate between which predictions they can be confident in, therefore optimizing their attention allocation and cognitive load. For example, EMM indicated high confidence in over half of all cases interpreted in our test dataset. Notably, EMM also reliably identified cases of low confidence, allowing for focused review of these cases and greatly improving overall ICH detection accuracy. In our testing, EMM only failed alongside the primary model in a small percentage of cases (4%). These cases of both EMM and primary model being incorrect represent those with small ICH volumes or ICH-mimicking features.
The stratification of confidence levels based on the EMM agreement levels also enables radiologists to make tailored decisions for each case. The thresholds for defining the three accuracy groups in this study were established based on expert radiologist assessment and the primary model’s performance at different EMM agreement levels (Supplementary Fig. 3), with separate analyses for ICH-positive and ICH-negative primary model predictions. The thresholds to indicate increased and decreased confidence were specifically designated so that the overall ICH detection accuracy would be significantly higher or lower, respectively, than that with only the primary model (baseline). However, suboptimal EMM agreement thresholds (resulting in too many cases categorized as decreased confidence) can create an unfavorable trade-off where the burden of further reviewing false alarms, and the associated loss in trust in the EMM, outweighs the relative gains in accuracy. This inefficiency particularly impacts low-prevalence settings, where radiologists may waste valuable time reviewing cases that the primary model had already classified correctly (Supplementary Fig. 7). This illustrates that although the overall EMM framework can be applied to broad applications, the agreement levels and thresholds may need use-case-specific definitions depending on the disease and its prevalence level.
In our analysis of features associated with high EMM agreement levels, we also identified some less intuitive directional relationships. For instance, higher EMM agreement levels in ICH-negative cases were associated with smaller brain volume and reduced rotation and translation from image registration, such as in the MNI152 template space³⁷. These associations may reflect underlying confounding factors. For example, a smaller brain volume may cause a hemorrhage to appear proportionally larger relative to the total brain size, making it more easily detectable by the primary model. Younger age may also be associated with fewer ICH-mimicking features that may confuse the primary model, such as calcifications³⁸. Similarly, higher confidence in images with minimal translation and rotation may suggest that the primary model training dataset is biased toward orientations close to the MNI152 template. This may be due to a known limitation that many CNN models lack rotational equivariance³⁹. Although other confounders may be present and future studies should further investigate how these features directly relate to EMM’s capability to estimate confidence, EMM was still able to estimate confidence and stratify cases accordingly.
Varying the technical parameters of EMM also revealed insight into the best practices for applying EMM to other clinical use cases. Our ablation study revealed that, expectedly, larger datasets, a larger number of sub-models, and larger sub-models generally improve the EMM’s capability to detect errors in the primary model. We observed that EMM, when trained on only 25% of the data, achieved near-optimal performance at disease prevalences of 30 and 15%, indicating EMM’s strong generalizability in data-scarce settings with relatively high prevalence. We also observed that at 5% prevalence, large sub-models trained with 25% of the data or small models trained with 5% of the data achieved optimal performance. This behavior can be explained by the relationship between model complexity and data volume. Specifically, large sub-models trained on the full dataset (with 41% prevalence) likely became too calibrated/overfitted to that specific prevalence distribution, causing suboptimal performance when testing on data with significantly different prevalence (5%)⁴⁰. Using large sub-model training with 25% of data or using small sub-model training with 5% of data may help the EMM balance the bias-variance tradeoffs by learning meaningful patterns for generalizability, while not overfitting to the training prevalence. The differences observed in optimal dataset size and sub-model size across different prevalence levels can help inform the best technical parameters to start developing an EMM for a different use case, promoting greater adoptability across diseases.
Beyond using EMM to improve case-by-case primary model performance, as shown in this study, another potential application of EMM can be monitoring longitudinal changes in primary model performance. As the EMM agreement levels are tracked over time, perturbations in the expected ranges can be identified over daily, weekly, or monthly periods. For example, any significant drifts in EMM agreement level distribution may signal changes in primary model performance due to shifts in patient demographics, image acquisition parameters, or clinical workflows. In this manner, the EMM approach can provide another dimension to the current radiology statistical process/quality control pipelines for continuous background monitoring^41,42, in addition to reporting concordance.
While the EMM approach demonstrates several advantages in case-by-case AI monitoring, some limitations persist. Although the EMM does not require labels to perform its monitoring task after deployment, a key constraint is the need for labeled use-case-specific datasets when training the EMM for each clinical application, which could potentially limit broader adoption across diverse clinical institutions with different labeling and computing resources. Similarly, the extent of retraining, recalibration, or additional data required as the primary model, prevalence, and imaging protocol evolve remains an open question, and future work should aim to systematically evaluate the extent of adaptation needed to maintain EMM performance in such scenarios. With the recent maturity of LLMs and self-supervised model training techniques, these data limitations may be largely overcome. For example, labels can now be automatically extracted from existing radiology reports using LLMs^11,12,13. Self-supervised training^43,44 also enables large foundation model training without manual annotation^45,46, and only a small amount of labeled data would be required to further fine-tune the model for each use case. These recent developments enable periodic updates to EMM, allowing it to adapt to changes in patient populations, scanners, and imaging protocols, thereby maintaining consistent and robust performance over time.
In this study, our ground-truth labels were generated independently by two radiologists without a secondary inter-reader review (one radiologist annotated the ICH types, while the other performed the ICH segmentation). Although both annotators have several years of experience (six and three years, respectively), labeling errors may still occur due to human error. In addition, although the RSNA 2019 dataset prohibits commercial use, it is unclear whether the FDA-cleared primary model was trained on this dataset. Nonetheless, even with the possibility of an overlap in the training data between the primary model and EMM, our results derived from our private testing dataset indicate that the EMM approach maintains strong performance despite this potential limitation.
While we explored EMM performance across different primary models, varying amounts of training data, different numbers and sizes of sub-models, and a range of disease prevalences, all evaluations were conducted on a single internal dataset, which may be subject to biases specific to our patient population or our ICH use case. For example, we observed discrepancies in EMM performance in gender, age, and race subgroups (Supplementary Fig. 6). This may be due to dataset bias commonly observed in healthcare AI⁴⁴, potentially resulting from underrepresentation in the EMM training dataset or a mismatch between the training and testing subgroup distributions. To address these limitations, future work is needed not just to investigate the generalizability of the EMM framework to applications beyond ICH and across institutions with diverse patient populations, scanner types, and imaging protocols, but also to develop methods that can enhance its generalizability in these varied settings. In addition, clinician-centered evaluations are needed to assess how EMM’s case-specific confidence impacts radiologists’ workflow, efficiency, trust in AI, and cognitive load. The alignment between EMM’s confidence and radiologists’ interpretations may vary based on individual traits and levels of experience, underscoring the importance of human-centered assessment⁴⁷.
Another limitation is EMM’s susceptibility to similar failure patterns as the primary model being monitored, such as in cases involving small, low-contrast hemorrhages or ICH-mimicking pathologies in this study. Of particular concern are instances where EMM fails simultaneously with the primary model while indicating complete consensus, as this could instill false confidence in clinicians and potentially increase misdiagnosis risk. This risk might be mitigated by training EMM in the future using synthetic datasets⁴⁸ with artificially generated difficult cases, such as those with less obvious hemorrhages, with ICH-mimicking features, representing a diverse patient population, or with various artifacts. As AI technology rapidly develops, many of the limitations currently facing EMM may be quickly overcome, presenting greater opportunities to not only improve EMM performance but also the resources required to implement the EMM approach itself.
In conclusion, our EMM framework represents a significant advancement in black-box clinical AI monitoring, enabling case-by-case confidence estimation without requiring access to primary model parameters or intermediate outputs. By leveraging ensemble agreement levels, EMM provides actionable insights, potentially enhancing diagnostic confidence while reducing cognitive burden. As AI continues to integrate into clinical workflows, approaches like EMM that provide transparent confidence measures will be essential for maintaining trust, ensuring quality, and ultimately improving patient outcomes in resource-constrained environments.
We evaluated EMM’s monitoring capabilities on two distinct ICH detection AI systems: an FDA-cleared commercial product and an open-source model that secured second place in the RSNA 2019 ICH detection challenge^35,49.
The FDA-cleared primary model is a black-box system with undisclosed training data and architecture that provides binary labels for the presence of ICH and identifies suspicious slices. We monitored this model using EMM trained on the complete (100%, N = 18,370) RSNA 2019 ICH Detection Challenge dataset. While this dataset’s license restricts usage to academic and non-commercial purposes, we do not have access to information regarding whether the FDA-cleared primary model utilized this dataset during its development.
The open-source RSNA 2019 Challenge second-place model³⁴ employs a 2D ResNext-101⁵⁰ network for slice-level feature extraction, followed by two levels of Bidirectional LSTM networks for feature summarization and ICH detection. We selected the second-place model rather than the first-place winner because retraining the top model would require extensive time while offering only marginal performance improvement (≤2.3%) based on the leaderboard. Although the original open-source primary model was trained on the complete RSNA Challenge dataset, we retrained it using only 50% of the data and reserved the remaining 50% for EMM training. This simulates real-world deployment scenarios where the primary model and EMM are trained on different datasets. For both the FDA-cleared and the open-source primary models, preprocessing is already built into the software; therefore, no additional image preprocessing was performed, and the original DICOM was sent as the input.
EMM consisted of five independently trained 3D convolutional neural networks (CNNs), comprising two versions with different numbers of trainable parameters: a large version utilizing ResNet⁵¹ 101 and 152, and DenseNet⁵² 121, 169, and 201; and a small version employing ResNet 18, 34, 50, 101, and 152. These networks were initialized using 2D ImageNet⁵³ pre-trained weights and adapted to 3D via the Inflated 3D (I3D)⁵⁴ method, which has shown success previously⁴⁶. EMM sub-models were trained using the open-source RSNA 2019 ICH Detection Challenge^35,49 dataset and was evaluated using an independent dataset collected at our institution. We trained the models on different subsets of the RSNA dataset to investigate EMM’s data efficiency across different amounts of training data used, including 18,370 (100%), 9185 (50%), 4592 (25%), and 918 (5%) studies. All subsets had an ICH prevalence of about 41%. EMM training details and parameters are shown in Supplementary Table 1.
We evaluated the EMM using an independent dataset of 2919 CT studies collected at our institution, with no overlap with the EMM training data. The dataset included 1315 ICH-positive and 1604 ICH-negative cases (45% ICH prevalence), featured a balanced gender distribution (50.1% male and 49.8% female), and spanned a wide age range (0.16–104.58 years; median: 67.13 years; interquartile range: 49.65–80.00 years). Since AI model performance is known to vary with disease prevalence^55,56, we evaluated both the primary model and EMM performance across different prevalence levels. A recent internal evaluation at our institution covering 8935 studies between July and November 2024 revealed ICH prevalences of 34.77% for in-patient, 9.09% for out-patient, and 6.52% for emergency units, with an overall average prevalence of 16.70%. Based on these observations, we selected three representative prevalence levels for evaluation: 30, 15, and 5%. This study was approved by the Stanford Institutional Review Board (IRB-58903) with a waiver of informed consent due to the use of retrospective data.
To prepare input data for the EMM, we preprocessed all non-contrast axial head CT DICOM images using the Medical Open Network for AI (MONAI) toolkit⁵⁷. The preprocessing pipeline consisted of several standardization steps: reorienting images to the “left-posterior-superior” (LPS) coordinate system, normalizing the in-plane resolution to 0.45 mm, and resizing (either cropping or padding, depending on the matrix size) the in-plane matrix dimensions to 512 × 512 pixels using PyTorch’s adaptive average pool method, while preserving the original slice resolution. During training, we employed random cropping in the slice dimension, selecting a contiguous block of 30 slices. For testing, we used a sliding window of 30 slices and averaged the ICH SoftMax probabilities across overlapping windows to generate the final prediction.
To comprehensively analyze features that drive high EMM agreement, we manually annotated a smaller dataset (N = 281), including ICH segmentation, volume measurements, and identification of mimicking imaging features. This curated dataset comprised 210 ICH-positive and 71 ICH-negative subjects and their associated studies. The ICH-positive cases span seven distinct ICH subtypes: subdural (SDH, N = 35), subarachnoid (SAH, N = 50), epidural (EDH, N = 15), intraparenchymal (IPH, N = 19), intraventricular (IVH, N = 2), diffuse axonal injury (DAI, N = 1), and multi-compartmental hemorrhages (Multi-H, N = 88). Among the 71 ICH-negative cases, 43 cases were specifically selected to include features that mimic hemorrhages (e.g., hyperdensity such as calcification or tumor), while 28 were from normal subjects. A neuroradiology fellow with 6 years of experience reviewed and validated all clinical labels to ensure accurate ground truth for our analysis.
In Shapley analysis, we prepared a comprehensive list of features including pathology-related metrics (ICH volume and type), patient characteristics (brain volume, age, and gender), positioning parameters (rotation and translation), image acquisition parameters (pixel spacing, slice thickness, kVp, X-ray tube current, and CT scanner manufacturer), and image reconstruction parameters (reconstruction convolution kernel and filter type). It is worth noting that these features are not used by EMM to make predictions, but rather represent real-world factors that may correlate with the EMM agreement levels.
To elucidate the features contributing to the high level of agreement between EMM sub-models and the primary model, we conducted Shapley analysis³⁶ using the Python “shap” package (v0.46.0). This analysis employed an XGBoost⁵⁸ (v2.1.1) classifier to learn the relationship between feature values and EMM agreement and to evaluate the importance of each feature leading to high EMM agreement, quantified by the probability ranges between 0 and 1. Higher Shapley values indicate features important for 100% EMM agreement.
To evaluate whether ICH volume influences EMM monitoring performance, we implemented a systematic protocol for ICH volume estimation. First, we employed Viola-UNet⁵⁹, the winning model from the Instance 2022 ICH Segmentation Challenge^60,61, to generate initial ICH segmentations. A radiology resident with 3 years of experience reviewed these segmentations and marked any errors directly on the images. A trained researcher then manually corrected the marked discrepancies using 3D Slicer software (version 5.6.2) to ensure accurate hemorrhage delineation. Finally, we calculated ICH volumes using the corrected ICH masks and image resolution data from the DICOM headers.
Since hemorrhage detection can be challenging in brains of different sizes or certain brain orientations, we analyzed brain volume and orientation as potential factors affecting EMM performance, alongside the previously mentioned features. Using the FMRIB Software Library⁶² (FSL 6.0.7.13), we developed an automated pipeline following an established protocol⁶³ to extract brain masks and estimate brain volumes. We then employed FSL FLIRT (FMRIB’s Linear Image Registration Tool) to perform 9-degree-of-freedom brain registration, aligning each image to the MNI 2019b non-symmetrical T1 brain template³⁷. The resulting rotation, translation, and scaling parameters were incorporated into our Shapley analysis as quantitative measures of brain orientation.
When the decreased confidence group in Fig. 3 is further reviewed by radiologists, some cases may actually be found to be labeled correctly by the primary model; we consider these cases to be false alarms. The false-alarm rate is defined as the percentage of unnecessary reviews of correctly labeled cases. After further reviewing the cases in the decreased confidence group, we assumed that the radiologists would always correctly label the cases, improving overall accuracy. While this assumption may not always hold, especially in complex borderline cases, this analysis demonstrates the maximum accuracy improvement that can be gained from using the EMM framework. We define relative improvement in accuracy as the percentage increase in accuracy after reviewing the decreased confidence group compared to the baseline accuracy of the primary model, i.e.,
To assess the reliability of both primary models and EMM’s performance metrics, we calculated 95% confidence intervals (CIs) using bootstrapping. We conducted 1000 random draws across studies with replacement from the set of ground-truth labels and corresponding model predictions. To create an evaluation dataset at target prevalence levels (30, 15, and 5%) different from the original distribution (45%), we down-sampled ICH-positive and resampled ICH-negative cases. For example, to create datasets with a controlled 30% prevalence of ICH-positive cases, we performed random sampling with replacement from our original dataset. Specifically, we randomly selected 0.3 × N_n ICH-positive cases and N_n ICH-negative cases (where N_n represents the total number of ICH-negative cases in the original dataset). After each draw, we computed key performance metrics such as sensitivity, positive predictive value (PPV), specificity, and negative predictive value (NPV). We then determined the 95% confidence intervals by identifying the 2.5th and 97.5th percentiles of these metrics across all bootstrap iterations.
To test the significance of differences in metrics between different groups, percentile bootstrapping paired-samples test was also applied to estimate the p value, with the null hypothesis that there is no difference between the two paired groups. This test was performed using a customized Python script mirroring computations in R boot.paired.per function (https://rdrr.io/cran/wBoot/man/boot.paired.per.html). For the sex, age, and race subgroup analysis, we performed a non-parametric bootstrap ANOVA test for ED-SPAUC and ED-SNAUC. Following bootstrapping resampling and variance comparison, we computed an observed F-statistic from bootstrapped samples and compared it to a null distribution generated by permuting group labels⁶⁴.
The EMM training dataset is based on the RSNA 2019 ICH detection challenge, and it can be found at https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019. Internal validation data are under internal review and will publish through Stanford University.
EMM code and model weights are available publicly on GitHub (https://github.com/stanfordaide/ICH_EMM).
Joshi, G. et al. FDA-approved artificial intelligence and machine learning (AI/ML)-enabled medical devices: an updated landscape. Electronics 13, 498 (2024).
Article Google Scholar
Challen, R. et al. Artificial intelligence, bias and clinical safety. BMJ Qual. Saf. 28, 231–237 (2019).
Article PubMed PubMed Central Google Scholar
Khera, R., Simon, M. A. & Ross, J. S. Automation bias and assistive AI: risk of harm from AI-driven clinical decision support. JAMA 330, 2255–2257 (2023).
Article PubMed Google Scholar
Del Gaizo, A. J., Osborne, T. F., Shahoumian, T. & Sherrier, R. Deep learning to detect intracranial hemorrhage in a national teleradiology program and the impact on interpretation time. Radiol. Artif. Intell. 6, e240067 (2024).
Health, C. for D. and R. Blog: a lifecycle management approach toward delivering safe, effective AI-enabled health Care. FDA (2025).
Allen, B. et al. Evaluation and real-world performance monitoring of artificial intelligence models in clinical practice: try It, buy It, check It. J. Am. Coll. Radiol. 18, 1489–1496 (2021).
Article PubMed Google Scholar
Chow, J., Lee, R. & Wu, H. How do radiologists currently monitor AI in radiology and what challenges do they face? An interview study and qualitative analysis. J. Digit. Imaging Inform. Med. https://doi.org/10.1007/s10278-025-01493-8 (2025).
Larson, D. B. et al. Assessing completeness of clinical histories accompanying imaging orders using adapted open-source and closed-source large language models. Radiology 314, e241051 (2025).
Article PubMed Google Scholar
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30, 1134–1142 (2024).
Article PubMed PubMed Central Google Scholar
Li, L. et al. A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs). Preprint at https://doi.org/10.48550/arXiv.2405.03066 (2024).
Le Guellec, B. et al. Performance of an open-source large language model in extracting information from free-text radiology reports. Radiol. Artif. Intell. 6, e230364 (2024).
Article PubMed PubMed Central Google Scholar
Reichenpfader, D., Müller, H. & Denecke, K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. BMJ Open 13, e076865 (2023).
Article PubMed PubMed Central Google Scholar
Reichenpfader, D., Müller, H. & Denecke, K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit. Med. 7, 222 (2024).
Article PubMed PubMed Central Google Scholar
Lambert, B., Forbes, F., Doyle, S., Dehaene, H. & Dojat, M. Trustworthy clinical AI solutions: a unified review of uncertainty quantification in deep learning models for medical image analysis. Artif. Intell. Med. 150, 102830 (2024).
Article PubMed Google Scholar
Gawlikowski, J. et al. A survey of uncertainty in deep neural networks. Artif. Intell. Rev. 56, 1513–1589 (2023).
Article Google Scholar
Kiyasseh, D., Cohen, A., Jiang, C. & Altieri, N. A framework for evaluating clinical artificial intelligence systems without ground-truth annotations. Nat. Commun. 15, 1808 (2024).
Article PubMed PubMed Central CAS Google Scholar
Ramalho, T. & Miranda, M. In International Workshop on Engineering Dependable and Secure Machine Learning Systems 84-96 (2020).
Raghu, M. et al. Direct uncertainty prediction for medical second opinions. In Proc. 36th International Conference on Machine Learning 5281–5290 (2019).
Malinin, A. & Gales, M. in Advances in Neural Information Processing Systems 31 (2018).
Kull, M. et al. in Advances in Neural Information Processing Systems Vol. 32 (Curran Associates, Inc., 2019).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proc. 34th International Conference on Machine Learning 1321–1330 (2017).
Kumar, A., Liang, P. S. & Ma, T. Verified uncertainty calibration. In Advances in Neural Information Processing Systems (2019).
Louizos, C. & Welling, M. Multiplicative normalizing flows for variational Bayesian neural networks. In Proc. 34th International Conference on Machine Learning 2218–2227 (2017).
Ritter, H., Botev, A. & Barber, D. A scalable Laplace approximation for neural networks. In Proc. 6th International Conference on Learning Representations (ICLR) 6 (2018).
Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proc. 28th International Conference on Machine Learning 681–688 (2011).
Graves, A. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24 (2011).
Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proc. 33rd International Conference on Machine Learning 1050–1059 (2016).
Lemay, A. et al. Improving the repeatability of deep learning models with Monte Carlo dropout. npj Digit. Med. 5, 1–11 (2022).
Article Google Scholar
Egele, R. et al. AutoDEUQ: automated deep Ensemble with uncertainty quantification. In Proc. 26th International Conference on Pattern Recognition (ICPR) 1908–1914 (2022).
Mehrtash, A., Wells, W. M., Tempany, C. M., Abolmaesumi, P. & Kapur, T. Confidence calibration and predictive uncertainty estimation for deep medical image segmentation. IEEE Trans. Med. Imaging 39, 3868–3878 (2020).
Article PubMed PubMed Central Google Scholar
Wenzel, F., Snoek, J., Tran, D. & Jenatton, R. Hyperparameter Ensembles for robustness and uncertainty quantification. In Advances in Neural Information Processing Systems 6514–6527 (2020).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep Ensembles. In Advances in Neural Information Processing Systems 30 (2017).
Kwon, Y., Won, J.-H., Kim, B. J. & Paik, M. C. Uncertainty quantification using Bayesian neural networks in classification: application to biomedical image segmentation. Computat. Stat. Data Anal. 142, 106816 (2020).
Article Google Scholar
Hanley, D. RSNA intracranial hemorrhage detection. Second place winner (2024).
Flanders, A. E. et al. Construction of a machine learning dataset through collaboration: the RSNA 2019 brain CT hemorrhage challenge. Radiol. Artif. Intell. 2, e190211 (2020).
Lundberg, S. M. & Lee, S. I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (2017).
Fonov, V., Evans, A., McKinstry, R., Almli, C. & Collins, D. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood. Neuroimage 47, S102 (2009).
Article Google Scholar
Saade, C. et al. Intracranial calcifications on CT: an updated review. J. Radiol. Case Rep. 13, 1–18 (2019).
Article PubMed PubMed Central Google Scholar
Winkels, M. & Cohen, T. S. Pulmonary nodule detection in CT scans with equivariant CNNs. Med. Image Anal. 55, 15–26 (2019).
Article PubMed Google Scholar
Mutasa, S., Sun, S. & Ha, R. Understanding artificial intelligence based radiology studies: What is overfitting?. Clin. Imaging 65, 96–99 (2020).
Article PubMed PubMed Central Google Scholar
Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 1–9 (2022).
Article Google Scholar
Larson, D. B. A vision for global CT radiation dose optimization. J. Am. Coll. Radiol. 21, 1311–1317 (2024).
Article PubMed Google Scholar
Huang, S.-C. et al. Multimodal foundation models for medical imaging – a systematic review and implementation guidelines. Preprint at https://doi.org/10.1101/2024.10.23.24316003 (2024).
Huang, S.-C. et al. Self-supervised learning for medical image classification: a systematic review and implementation guidelines. npj Digit. Med. 6, 1–16 (2023).
Article CAS Google Scholar
Chen, Z. et al. A vision-language foundation model to enhance efficiency of chest X-ray interpretation. Preprint at https://doi.org/10.48550/arXiv.2401.12208 (2024).
Blankemeier, L. et al. Merlin: a vision language foundation model for 3D computed tomography. Preprint at https://doi.org/10.48550/arXiv.2406.06512 (2024).
Küper, A. & Krämer, N. Psychological traits and appropriate reliance: factors shaping trust in AI. Int. J. Hum.–Comput. Interact. 41, 4115–4131 (2025).
Google Scholar
Bluethgen, C. et al. A vision–language foundation model for the generation of realistic chest X-ray images. Nat. Biomed. Eng. https://doi.org/10.1038/s41551-024-01246-y (2024).
RSNA intracranial hemorrhage detection callenge. https://www.rsna.org/rsnai/ai-image-challenge/rsna-intracranial-hemorrhage-detection-challenge-2019 (2019).
Xie, S., Girshick, R., Dollar, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 5987–5995 (2017).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. of the IEEE conference on computer vision and pattern recognition 770-778 (2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2261–2269 (2017).
Russakovsky, O. et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211-252 (2016).
Carreira, J. & Zisserman, A. Quo Vadis, action recognition? A new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4724–4733 (2017).
Godau, P. et al. Navigating prevalence shifts in image analysis algorithm deployment. Med. Image Anal. 102, 103504 (2025).
Article PubMed Google Scholar
Park, S. H. & Han, K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 286, 800–809 (2018).
Article PubMed Google Scholar
Cardoso, M. J. et al. MONAI: an open-source framework for deep learning in healthcare. Preprint at https://doi.org/10.48550/ARXIV.2211.02701 (2022).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (2016).
Liu, Q. et al. Voxels intersecting along orthogonal levels attention U-net for intracerebral haemorrhage segmentation in head CT. In IEEE 20th International Symposium on Biomedical Imaging (ISBI) 1–5 (2023).
Li, X. et al. Hematoma expansion context guided intracranial hemorrhage segmentation and uncertainty estimation. IEEE J. Biomed. Health Inform. 26, 1140–1151 (2022).
Article PubMed Google Scholar
Li, X. et al. The state-of-the-art 3D anisotropic intracranial hemorrhage segmentation on non-contrast head CT: the INSTANCE challenge. Preprint at https://doi.org/10.48550/arXiv.2301.03281 (2023).
Smith, S. M. et al. Advances in functional and structural MR image analysis and implementation as FSL. Neuroimage 23, S208–S219 (2004).
Article PubMed Google Scholar
Muschelli, J. et al. Validated automatic brain extraction of head CT images. Neuroimage 114, 379–385 (2015).
Article PubMed Google Scholar
Maris, E. & Oostenveld, R. Nonparametric statistical testing of EEG- and MEG-data. J. Neurosci. Methods 164, 177–190 (2007).
Article PubMed Google Scholar
Download references
Stanford Department of Radiology; Stanford 3D and Quantitative Imaging Laboratory (3DQ Lab).
These co-senior authors contributed equally: Akshay S. Chaudhari, David B. Larson.
Department of Radiology, School of Medicine, Stanford University, Stanford, CA, 94304, USA
Zhongnan Fang, Andrew Johnston, Lina Y. Cheuy, Hye Sun Na, Magdalini Paschali, Camila Gonzalez, Bonnie A. Armstrong, Arogya Koirala, Derrick Laurel, Andrew Walker Campion, Michael Iv, Akshay S. Chaudhari & David B. Larson
AI Development and Evaluation Laboratory (AIDE), School of Medicine, Stanford University, Stanford, CA, 94304, USA
Zhongnan Fang, Andrew Johnston, Lina Y. Cheuy, Hye Sun Na, Magdalini Paschali, Camila Gonzalez, Bonnie A. Armstrong, Arogya Koirala, Akshay S. Chaudhari & David B. Larson
3D and Quantitative Imaging Laboratory (3DQ), School of Medicine, Stanford University, Stanford, CA, 94304, USA
Derrick Laurel
Department of Biomedical Data Science, School of Medicine, Stanford University, Stanford, CA, 94304, USA
Akshay S. Chaudhari
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Search author on:PubMed Google Scholar
Manuscript drafting and manuscript revision for important intellectual content, all authors; Study concepts and design: Z.F., A.S.C., and D.B.L.; Data/statistical analysis: Z.F.; Data collection: D.L., A.W.C., and M.I.; Data cleaning and annotation: Z.F., A.J., H.S.N., M.P., C.G., A.K., D.L., and A.W.C.; Literature research: Z.F., L.Y.C., A.S.C., M.P., and C.G.
Correspondence to Zhongnan Fang.
Z.F. Stock option holder of LVIS Corp. A.J. No relevant relationships. L.Y.C. No relevant relationships. H.S.N. No relevant relationships. M.P. No relevant relationships. C.G. No relevant relationships. B.A.A. No relevant relationships. A.K. No relevant relationships. D.L. No relevant relationships. A.W.C. No relevant relationships. M.I. No relevant relationships. D.B.L. Member of the Board of Chancellors of the American College of Radiology and Board of Trustees of the American Board of Radiology, shareholder in Bunkerhill Health; receives research funding from the Gordon and Betty Moore Foundation. A.S.C. receives research support from NIH grants R01 HL167974, R01HL169345, R01 AR077604, R01 EB002524, R01 AR079431, P41 EB027060; ARPA-H grants AY2AX000045 and 1AYSAX0000024-01; and NIH contracts 75N92020C00008 and 75N92020C00021.Unrelated to this work, A.S.C. receives research support from GE Healthcare, Philips, Microsoft, Amazon, Google, NVIDIA, Stability; has provided consulting services to Patient Square Capital, Chondrometrics GmbH, and Elucid Bioimaging; is co-founder of Cognita; has equity interest in Cognita, Subtle Medical, LVIS Corp, and Brain Key.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Fang, Z., Johnston, A., Cheuy, L.Y. et al. Automated real-time assessment of intracranial hemorrhage detection AI using an ensembled monitoring model (EMM). npj Digit. Med. 8, 608 (2025). https://doi.org/10.1038/s41746-025-02007-0
Download citation
Received: 16 May 2025
Accepted: 15 September 2025
Published: 16 October 2025
Version of record: 16 October 2025
DOI: https://doi.org/10.1038/s41746-025-02007-0
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
npj Digital Medicine (npj Digit. Med.)
ISSN 2398-6352 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

source

Automated real-time assessment of intracranial hemorrhage detection AI using an ensembled monitoring model (EMM) – Nature

Leave a Reply Cancel Reply