Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Advertisement
Scientific Reports volume 16, Article number: 18172 (2026)
The dynamic environment of medicine, particularly in settings such as the Emergency Department, challenges physicians with an influx of patient data and the need for swift diagnosis and treatment under high-pressure conditions. Numerous AI models have been developed for medical applications, yet their integration into hospital systems remains limited because they often do not align with existing clinical workflows. To address these challenges, I developed AIDx, an AI-powered system designed to help physicians streamline clinical decision making, improve diagnostic support, and provide an integrated platform for AI-assisted analysis. Unlike many standalone systems, AIDx is designed for interoperability with electronic health record (EHR) systems and can leverage supplementary tools such as updated medical knowledge bases and auxiliary AI models. At its core, AIDx includes AIDx-Copilot, a large language model fine-tuned on de-identified EHR data and optionally grounded using retrieval-augmented generation (RAG) from open medical references. This study is a text-only benchmark assessment. I evaluated AIDx-Copilot using the MultiMedQA benchmark suite under a unified, deterministic, single-pass protocol (temperature=0; no voting). Across nine MultiMedQA subsets (MedQA, PubMedQA, MedMCQA, and medically focused MMLU subjects), AIDx-Copilot achieves a mean accuracy of 83.61% (SD 7.37). I report per-dataset Wilson 95% confidence intervals to quantify uncertainty from finite test set sizes. To isolate component contributions, I conducted ablation experiments comparing the base model, the fine-tuned model without retrieval, and the fine-tuned model with RAG enabled. EHR-based fine-tuning accounts for the majority of the performance gain (+17.8 percentage points over the base model on average), while RAG provides a modest additional benefit (+0.4 points on average) that varies by dataset. A qualitative error analysis of 200 incorrectly answered items identifies knowledge gaps (41.0%) and reasoning errors (38.0%) as the dominant failure modes. Comparative numbers for larger proprietary systems are provided only as context in the supplementary material because they may use different prompts and inference settings. The deployment configuration (quantized weights and on-premises serving) supports fast inference and local deployment. Under the quantized configuration, the system achieves a median latency of 0.84 seconds per query and fits within 28.1 GB of VRAM across two commodity GPUs. The primary results are based on public benchmark evaluations without RAG; I did not perform clinical, user-study, or real-world validations. In addition, while AIDx is designed to support fully local operation, specific deployments may choose to use third-party embedding or vector-search services; this paper documents the data-flow boundary and a fully local configuration path.
The emergency department, a critical entry point in healthcare, faces a daily influx of nearly 400,000 patients in the United States alone1. High volume contributes to prolonged wait times and throughput bottlenecks that delay care2. In time-sensitive conditions, delays can worsen outcomes3. ED crowding also strains clinicians, contributing to burnout and reducing the margin for error.
Artificial intelligence (AI) holds promise for improving clinical workflows, but hospital integration remains slow. Existing models often target narrow tasks and do not interoperate cleanly with EHR workflows4. Very large models also impose substantial infrastructure and governance burdens, limiting practical adoption. Meanwhile, general-purpose tools (e.g., ChatGPT5) have increased clinician interest, and recent peer-reviewed work has benchmarked LLMs on medical exams and clinical tasks6,7,8. These studies also highlight concerns about reliability, protected health information (PHI), and the need for governance and regulatory clarity9.
In response, I developed AIDx, an integrated AI-driven software system intended to support physicians in diagnosis and treatment planning while aligning closely with existing clinical workflows. AIDx retrieves patient context from the EHR, can ground responses with retrieval from open medical references, and returns structured responses through a simple interface. AIDx-Copilot, the core LLM, was fine-tuned on de-identified records derived from MIMIC-IV10,11,12. This paper evaluates benchmark performance, reports ablation and error analyses, documents the system design and operational metrics, and does not test clinical effectiveness.
From a regulatory and ethical standpoint, AIDx was developed with privacy and governance in mind. On-premises processing supports HIPAA-aligned operation in the United States. Internationally, frameworks such as the General Data Protection Regulation (GDPR)13 and the EU Artificial Intelligence Act classify medical AI as high-risk and require transparency, traceability, and human oversight14. These requirements motivate audit logging, role-based access control (RBAC), versioning, and incident response processes.
This study investigates the following questions within a text-only benchmark setting:
RQ1: How does AIDx-Copilot perform across the MultiMedQA suite under a unified, deterministic, single-pass evaluation protocol?
RQ2: What are the individual contributions of EHR-based fine-tuning and retrieval-augmented generation (RAG) to benchmark performance, and what are the dominant failure modes?
RQ3: What deployment characteristics (local inference latency, memory footprint, EHR integration, grounding via retrieval) make AIDx a plausible candidate for on-premises use, pending clinical validation?
In this paper, “plausible candidate” means deterministic benchmark performance in the range of published baselines together with a documented fully local deployment path (local inference, embeddings, and vector search), acceptable operational metrics (sub-second latency per query, deployment on a single multi-GPU node), and operational safeguards (audit logging and RBAC).
The goal of this work was to design and assess a locally deployable LLM-based clinical assistant using de-identified EHR data for training and public benchmarks for evaluation. I conducted no clinical, prospective, or user studies. All experiments are single-modal (text). Imaging and other multimodal inputs are outside scope.
Methodological contributions and scope
This work documents: (1) a patient-timeline construction procedure that restructures de-identified EHR notes into temporally indexed snapshots suitable for supervised instruction tuning; (2) a retrieval pipeline grounded in open medical references and integrated into prompting for answer grounding; (3) a locally deployable inference stack (quantized model + OpenAI-compatible API) that interoperates with EHR retrieval; and (4) ablation experiments and error analysis isolating the contributions of fine-tuning and RAG.
This paper evaluates the end-to-end system under multiple configurations (with and without RAG) and a deterministic protocol. (Fig. 1)
Data processing pipeline for preparing supervised training data for AIDx-Copilot. De-identified EHR records from emergency, inpatient, and ICU settings are consolidated into a holistic visit representation. Static attributes (e.g., demographics, history) and dynamic attributes (e.g., labs, orders, diagnoses) are separated, and dynamic events are organized into temporally ordered visit timelines. Timeline snapshots are used to generate predictive question–answer pairs for instruction tuning. The patient timeline shown is illustrative and not a complete clinical record.
AIDx-Copilot was trained using de-identified clinical records derived from MIMIC-IV10,11,12. I consolidated records from emergency, inpatient, and intensive care settings to cover common hospital documentation patterns.
Data preparation and temporal alignment
Each patient chart was restructured into static attributes (demographics, chronic conditions, medical history) and dynamic attributes (laboratory results, orders, procedures, diagnoses). I generated visit timelines by appending a new “snapshot” when any dynamic field changed.
To reduce temporal leakage in supervised prediction (i.e., using future information to answer past queries), snapshot construction orders events by timestamp and generates questions from the current snapshot while sourcing supervision targets only from strictly later snapshots.
Chat sample generation
To simulate clinical query–response interactions, I generated synthetic dialogue pairs in which questions were derived from dynamic variables (e.g., “What does the next lab result show?”) and answers from subsequent timestamps. This yielded approximately eight million question–answer samples for supervised fine-tuning.
Model fine-tuning
The base model was Mixtral-8x7B-Instruct-v0.115. I fine-tuned it using Low-Rank Adaptation (LoRA)16 with DeepSpeed ZeRO optimizations17,18. Table 1 lists all training hyperparameters. Following training, LoRA weights were merged with the base model and quantized using ExLLaMA v219 to reduce memory footprint and enable fast inference.
Deployment stack
AIDx-Copilot is served behind an OpenAI-compatible API implemented with TabbyAPI20. The backend orchestrates EHR retrieval, optional retrieval grounding, and response generation. (Fig. 2)
Workflow of AIDx, illustrating integration with an EHR database and optional retrieval grounding from medical references before response generation.
To provide access to external medical reference material, AIDx supports retrieval-augmented generation (RAG)21. I constructed a reference database from open-access medical textbooks in the LibreTexts Medicine Library22. Text was chunked using a recursive splitter (chunk size 1,000 characters; no overlap) and indexed for similarity search. For background and terminology, I follow recent surveys of retrieval-augmented generation for LLMs23. Table 2 summarizes the RAG configuration.
Data-flow boundary and deployment modes
The reference implementation can embed text using OpenAI text-embedding-3-small and store vectors in Pinecone24. These choices may involve third-party services depending on deployment. For privacy-sensitive on-premises settings, an equivalent fully local configuration replaces the embedding model with a locally hosted BGE-base-en-v1.5 encoder and replaces the vector store with a self-hosted FAISS index. The application-layer interface (retrieve top passages, append to prompt) remains unchanged.
RAG usage in benchmark evaluation
To clearly delineate system components: the primary benchmark results reported in Table 3 were obtained without RAG. This means the primary benchmarks test the fine-tuned model in isolation and do not reflect retrieval grounding. Ablation experiments in Section 6.3 separately evaluate the effect of enabling RAG on benchmark performance.
Evaluation used the MultiMedQA benchmark suite25. I loaded all benchmarks from the Open Life Science AI MultiMedQA collection on Hugging Face (https://huggingface.co/collections/openlifescienceai/multimedqa).
Dataset identifiers and revisions
To make the benchmark configuration unambiguous, I record the dataset repositories and the revision (Git commit on the main branch) used at the time of evaluation:
MedQA (USMLE): openlifescienceai/medqa @ 153e61c.
PubMedQA: openlifescienceai/pubmedqa @ 50fc41d.
MedMCQA: openlifescienceai/medmcqa @ c8b1a7c.
MMLU Clinical Knowledge: openlifescienceai/mmlu_clinical_knowledge @ e151167.
MMLU Professional Medicine: openlifescienceai/mmlu_professional_medicine @ a7a792b.
MMLU Anatomy: openlifescienceai/mmlu_anatomy @ a7a792b.
MMLU College Biology: openlifescienceai/mmlu_college_biology @ 94b1278.
MMLU College Medicine: openlifescienceai/mmlu_college_medicine @ d983527.
MMLU Medical Genetics: openlifescienceai/mmlu_medical_genetics @ f248e4a.
For each dataset and MMLU subject, I used the test split.
Prompt template
For each question, AIDx-Copilot received the prompt shown in Listing 1. The system message establishes the medical assistant role. The user message contains the question stem and answer options verbatim from the dataset. The formatting constraint restricts the model to outputting a single-letter answer.
Listing 1 Exact prompt template used for all benchmark evaluations. Placeholders question and options are filled from each dataset item. For RAG-enabled ablation runs, retrieved passages are prepended to the user message as shown in Section 3.5.
Decoding and scoring
For each question, AIDx-Copilot produced a single answer under deterministic decoding (temperature=0, top_p=1, repetition_penalty=1.0; no self-consistency voting, no chain-of-thought). A regular expression (BEST ANSWER:(backslash)s*([A-D]) for four-option items; BEST ANSWER:(backslash)s*([A-C]) for PubMedQA) extracted the predicted letter. Items where the regex failed to match were scored as incorrect. Accuracy was computed as exact match with the reference label.
Uncertainty reporting
For each dataset, I report a Wilson 95% confidence interval for accuracy to reflect uncertainty due to finite test set size.
To isolate the contributions of EHR-based fine-tuning and retrieval grounding, I evaluated four configurations on the full MultiMedQA suite under the identical deterministic protocol described above:
Base: Mixtral-8x7B-Instruct-v0.1 without fine-tuning or RAG. This is the unmodified base model with quantization applied identically to the fine-tuned variant.
Base + RAG: Base model with retrieval grounding enabled. For each benchmark question, the question text was used as the retrieval query, and the top-5 passages from the LibreTexts Medicine index were prepended to the user message under the header “Reference context:”.
Fine-tuned (FT): AIDx-Copilot after EHR-based fine-tuning, without RAG. This is the configuration reported in the primary results (Table 3).
Fine-tuned + RAG (FT + RAG): AIDx-Copilot with retrieval grounding enabled using the same RAG procedure as configuration 2.
All four configurations used the same quantized inference stack, prompt template (with or without prepended context), and scoring procedure.
To assess deployment feasibility, I measured inference latency, throughput, VRAM consumption, and retrieval overhead on the hardware configuration described in Table 6. Latency was measured as wall-clock time from prompt submission to final token generation, averaged over 200 randomly sampled MedQA questions. Throughput was measured as questions processed per minute in sequential mode. VRAM was measured at peak allocation during inference using nvidia-smi.
AIDx is intended as a clinician-in-the-loop system. In practice, on-premises deployment requires basic governance controls so hospitals can answer three questions for any given output: who used the system, what exactly ran, and what information influenced the answer. The controls below describe the minimum operational safeguards supported by the AIDx design.
Audit logging (what happened): record the request timestamp, user/role identifier, patient record identifier (or pseudonym), model version, retrieval index version, and the identifiers of retrieved reference passages. Store response metadata (e.g., token counts, latency) and keep logs within the institutional boundary.
Role-based access control (who can do what): restrict actions by role (e.g., clinician: query and view results; administrator: configure deployment settings; auditor: review logs). Require explicit permissions for configuration changes and access to any log data.
Versioning and change control (what exactly ran): version the deployed model artifact and retrieval index, and record the active versions on each request. Support rollback to a prior approved version if a regression or incident is detected.
Incident response (what to do when things go wrong): define triggers for investigation (e.g., unusual error rates, suspected prompt injection, unauthorized access), assign ownership for triage, and document procedures to disable retrieval, roll back a model/index, or take the system offline.
All benchmark data used are publicly available. Training data were derived from de-identified MIMIC-IV records. AIDx can be deployed on-premises to support HIPAA-aligned operation. GDPR and the EU AI Act motivate transparency, logging, risk management, and human oversight13,14.
To make the evaluation transparent and replicable at the benchmark level, this work specifies: datasets, splits, and dataset revisions (Section 4.4); the exact prompt template (Listing 1); decoding settings (temperature=0, top_p=1, repetition_penalty=1.0); the scoring regex and failure-handling rule; all training hyperparameters (Table 1); and RAG configuration details (Table 2). Per-dataset confidence intervals are reported in Table 3.
To support independent reproduction of the evaluation protocol, Supplementary Algorithm S1 provides pseudocode for the scoring script. The full evaluation code — including the benchmark runner, prompt construction, regex scoring, Wilson confidence interval computation, and the RAG retrieval reference implementation — is deposited at Zenodo (https://doi.org/10.5281/zenodo.19207085). The evaluation can be executed against any OpenAI-compatible API endpoint. The AIDx training code, serving stack, and model artifacts are proprietary and are not publicly released.
Timeline snapshot construction and leakage control
Deterministic benchmark evaluation
Retrieval grounding (optional) for physician queries
I evaluated AIDx-Copilot on MultiMedQA25 using deterministic, single-pass inference (temperature=0; no voting). RAG was not enabled for the primary evaluation; these results reflect the fine-tuned model only. Figure 3A summarizes per-dataset accuracy. Table 3 reports accuracy with Wilson 95% confidence intervals.
AIDx-Copilot performs strongly across clinically relevant subsets. Accuracy is highest on MMLU Professional Medicine (93.4%), MMLU Clinical Knowledge (90.0%), and MMLU College Biology (90.5%), and remains high on MedQA (USMLE) (84.6%). Performance is lower on MedMCQA (70.7%) and MMLU Anatomy (78.1%), consistent with heterogeneity in question style and difficulty within MultiMedQA.
Across all nine datasets, the regex extraction failure rate was 0.4% (36 out of 9,012 items), indicating that the formatting constraint was effective.
I report the mean accuracy across the nine datasets and the standard deviation across datasets. AIDx-Copilot attains a mean of 83.61% (SD 7.37), indicating consistently high performance with moderate variability across task types (Figure 3B).
AIDx accuracy across datasets and overall. (A): Per-dataset accuracy on MultiMedQA (fine-tuned model, no RAG). (B): Mean accuracy across the nine datasets. Error bars indicate standard deviation across datasets (not the per-dataset confidence intervals).
Table 4 reports accuracy for each of the four configurations described in Section 4.5. The results isolate the contributions of EHR-based fine-tuning and RAG.
Effect of fine-tuning
EHR-based fine-tuning produced a mean accuracy gain of +17.8 percentage points over the base Mixtral model ((Delta _{text {FT}})), with the largest improvement on MedQA (+23.1) and consistently strong gains across all nine subsets. The base Mixtral-8x7B-Instruct model achieved a mean of 65.8%, roughly between the GPT-3.5 and GPT-4 context baselines reported in Supplementary Table S1. Fine-tuning on clinical QA pairs raised the mean to 83.6%, indicating that exposure to temporal EHR patterns and clinical reasoning during training transfers to medical exam-style benchmarks.
Effect of RAG
Adding retrieval grounding to the fine-tuned model yielded a mean change of +0.4 percentage points ((Delta _{text {RAG}})). The effect was mixed across datasets. RAG improved performance on subsets where questions often required factual recall or reference-style knowledge: PubMedQA (+1.4), MedQA (+0.7), MedMCQA (+0.7), and MMLU College Medicine (+1.0). On subsets where AIDx-Copilot already scored above 90% or where questions emphasized clinical reasoning over factual recall, RAG had negligible or slightly negative effects (e.g., MMLU Professional Medicine −0.4, MMLU Anatomy −0.3). This pattern suggests that retrieved passages are most useful when questions exceed the model’s parametric knowledge, but can introduce noise or distract from reasoning when the model already performs well without them.
To characterize the dominant failure modes of AIDx-Copilot, I manually reviewed a stratified random sample of 200 incorrectly answered items from MedQA and MedMCQA (the two largest test sets). Each error was assigned to one of the following categories:
Knowledge gap: The question required factual knowledge absent from both the model’s parametric memory and the retrieval corpus (e.g., rare disease mechanisms, recently updated clinical guidelines).
Reasoning error: The model possessed the relevant knowledge (as evidenced by correct intermediate reasoning or related correct answers) but applied it incorrectly, such as confusing similar conditions, reversing the direction of a physiological relationship, or selecting a partially correct distractor.
Question interpretation failure: The model misinterpreted the question stem, for example by focusing on an incidental finding rather than the primary clinical question, or by misreading negation (“which is NOT”).
Extraction/formatting failure: The model’s free-text output was correct or partially correct, but the regex failed to extract a valid answer letter.
Table 5 reports the distribution across categories.
Qualitative patterns
Knowledge gaps accounted for the largest share of errors (41.0%), particularly in MedMCQA items covering pharmacology and ophthalmology subspecialty content that is underrepresented in the MIMIC-IV training distribution. Many of these items tested recall of specific drug interactions, dosing thresholds, or rare genetic syndromes that are unlikely to appear frequently in de-identified ED and ICU records. Reasoning errors (38.0%) frequently involved multi-step clinical vignettes where the model selected a plausible but suboptimal management step, often confusing conditions that share overlapping presentations (e.g., selecting the correct organ system but the wrong specific diagnosis). Question interpretation failures (15.0%) clustered around negation-style questions and items with complex multi-clause stems. Extraction failures were rare (6.0%), confirming that the formatting constraint was effective for the vast majority of items.
Table 6 reports practical deployment metrics for AIDx-Copilot under the quantized inference configuration.
The quantized model fits within 28.1 GB of VRAM, distributed across two commodity RTX 4090 GPUs. Median inference latency of 0.84 seconds per query without RAG supports interactive clinical use. Adding RAG increases latency by approximately 0.63 seconds per query, primarily from embedding the query and performing vector search. The 312 MB RAG index and the overall VRAM footprint are within the capacity of hardware available to hospital IT departments.
All accuracy values are point estimates from a single deterministic pass per question. Uncertainty from finite test sizes is captured by per-dataset confidence intervals (Table 3). I do not perform statistical testing against literature-reported baselines because they may use different prompts, splits, or decoding settings; such numbers are presented only as context in Supplementary Table S1.
Related work and context
Recent reviews summarize how LLMs are being explored for clinical documentation, education, and decision support, while emphasizing that benchmark scores do not substitute for prospective validation6,26.
In this work I introduced AIDx, a locally deployable, text-only clinical assistant designed to interoperate with electronic health record (EHR) systems and to optionally ground responses with retrieval from medical reference sources. Within the scope of benchmark evaluation, AIDx-Copilot demonstrates strong accuracy across diverse MultiMedQA subsets while operating with a comparatively compact configuration.
What distinguishes AIDx from prior assistants
AIDx emphasizes (i) EHR-aligned patient context construction via timeline snapshots, (ii) optional retrieval grounding from open medical references, and (iii) an integration-friendly deployment model behind an OpenAI-compatible interface. This combination targets workflow fit and operational review rather than scaling model size alone.
Ablation findings
The ablation study (Table 4) demonstrates that EHR-based fine-tuning is the primary driver of performance gains. Fine-tuning on temporal clinical QA pairs raised accuracy by an average of 17.8 percentage points over the base Mixtral model, a consistent and large effect across all nine subsets. This is consistent with the hypothesis that exposure to clinical language patterns and structured reasoning during training transfers to exam-style benchmarks, even though the training data (EHR-derived QA) and the evaluation data (medical exam questions) differ in format and content.
RAG provided a modest additional benefit of +0.4 points on average. The effect was positive on subsets requiring factual recall (PubMedQA, MedQA, MedMCQA, MMLU College Medicine) and negligible or slightly negative on high-performing clinical reasoning subsets. This supports the design decision to make RAG optional: it provides a mechanism for grounding and updating knowledge without requiring retraining, but its benefit depends on corpus coverage and question type. In production, RAG is more likely to help with queries that fall outside the model’s training distribution, such as questions about recently published guidelines or rare conditions.
Error patterns and implications
The error analysis (Sect. 6.4) identifies knowledge gaps (41.0%) and reasoning errors (38.0%) as the two dominant failure modes. Knowledge gaps were concentrated in subspecialty content underrepresented in the MIMIC-IV training distribution, particularly pharmacology and ophthalmology. This has a practical implication: expanding the RAG corpus with targeted subspecialty references could reduce this category of errors without retraining. Reasoning errors, by contrast, reflect model limitations in multi-step clinical inference and are less amenable to retrieval-based mitigation. Future work on chain-of-thought prompting or ensemble strategies may address this category. The low rate of extraction failures (6.0%) confirms that the deterministic prompting strategy is reliable for structured output.
Operational feasibility
The operational metrics (Table 6) indicate that AIDx-Copilot can run on two commodity RTX 4090 GPUs with sub-second latency, supporting interactive clinical use. The quantized model’s VRAM footprint of 28.1 GB and the RAG index size of 312 MB are within the capacity of hardware available to hospital IT departments. Adding RAG increases per-query latency to 1.47 seconds, which remains acceptable for clinical workflows where the physician is reviewing and synthesizing information alongside the system output. These measurements support the claim that AIDx is a plausible candidate for on-premises deployment, pending clinical validation.
Privacy, governance, and on-premises operation
AIDx can be deployed in a fully local configuration where inference, embeddings, and vector search run within the institutional boundary. If third-party embedding or vector services are used, the deployment should document what data may traverse the boundary and apply appropriate contractual and technical controls. Operational safeguards include RBAC, audit logging, model/index versioning, and incident response.
Limitations
This study reports performance on public, text-based benchmarks only; I did not conduct clinical trials, usability testing, or retrospective chart review. Results should not be interpreted as evidence of clinical effectiveness or safety. The system is single-modal (text) and does not ingest imaging or waveform data. The error analysis is based on a stratified sample of 200 items and manual coding by a single reviewer; inter-rater reliability was not assessed. The ablation study varies configurations on standard benchmarks and does not include clinical task evaluations. Operational metrics were measured on a specific hardware configuration; actual deployment performance will vary with hardware, concurrent load, and query complexity.
Future work
Next steps include an IRB-approved study measuring task completion time, clinician agreement, error interception, and perceived workload with and without AIDx. Additional work should expand the error analysis with multiple reviewers and inter-rater agreement, evaluate failure modes on clinical (non-benchmark) tasks, compare against fully replicable open baselines under the same deterministic protocol, and expand the system to multimodal inputs where appropriate.
Publicly available de-identified patient data used in this study was provided by the MIMIC-IV dataset. Access to MIMIC-IV can be obtained at https://physionet.org/content/mimiciv. All benchmark datasets are publicly available through the Hugging Face repositories listed in the Evaluation Protocol section. The evaluation scripts and analysis code are deposited at Zenodo (https://doi.org/10.5281/zenodo.19207085).
Artificial Intelligence
Artificial Intelligence Diagnosis
Electronic Health Records
Retrieval-Augmented Generation
Low-Rank Adaptation
Health Insurance Portability and Accountability Act
Graphics Processing Unit
Multi-Domain Medical Question Answering
United States Medical Licensing Examination
Intensive Care Unit
Medical Pathways Language Model
Generative Pre-trained Transformer
Massive Multi-Task Language Understanding
PubMed Question Answering
Medical Multiple Choice Question Answering
Medical Question Answering
Video Random Access Memory
Maximal Marginal Relevance
Facebook AI Similarity Search
CDC. CDC FastStats – Emergency Department Visits. https://www.cdc.gov/nchs/fastats/emergency-department.htm (2023).
Cairns, C. & Kang, K. National Hospital Ambulatory Medical Care Survey: 2020 Emergency Department Summary Tables. Tech. Rep., National Center for Health Statistics (U.S.) (2022).
Gross, T. K. et al. Crowding in the emergency department: Challenges and best practices for the care of children. Pediatrics 151, e2022060972 (2023).
Article PubMed Google Scholar
Ji, M., Chen, X., Genchev, G. Z., Wei, M. & Yu, G. Status of AI-enabled clinical decision support systems implementations in China. Methods Inf. Med. 60, 123–132 (2021).
Article PubMed Google Scholar
OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt (2022).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nature Med. 29, 1930–1940 (2023).
Article CAS PubMed Google Scholar
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health 2, e0000198 (2023).
Article PubMed PubMed Central Google Scholar
Nori, H., King, N., McKinney, S., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on medical challenge problems. arXiv:2303.13375 (2023).
Gilbert, S., Harvey, H., Melvin, T., Vollebregt, E. & Wicks, P. Large language model AI chatbots require approval as medical devices. Nature Med. 29, 1924–1926 (2023).
Article Google Scholar
Johnson, A. et al. MIMIC-IV.
Johnson, A. E. W. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).
Article CAS PubMed PubMed Central Google Scholar
Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101, e215–e220 (2000).
Article CAS PubMed Google Scholar
Regulation (eu) 2016/679 of the european parliament and of the council on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation). https://eur-lex.europa.eu/eli/reg/2016/679/oj (2016). Official Journal of the European Union, L119/1.
Regulation (eu) 2024/1689 of the european parliament and of the council laying down harmonised rules on artificial intelligence (artificial intelligence act). https://eur-lex.europa.eu/eli/reg/2024/1689/oj (2024). Official Journal of the European Union, L202/1.
AI, M. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/ (2023).
Hu, E. J. et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 (2021).
Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. for Computing Machinery, A. (ed.) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. (ed.for Computing Machinery, A.) Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, 3505–3506 (Association for Computing Machinery, New York, NY, USA, 2020).
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Press, I. (ed.) ZeRO: Memory optimizations toward training trillion parameter models. (ed.Press, I.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20, 1–16 (IEEE Press, Atlanta, Georgia, 2020).
turboderp. Turboderp/exllamav2 (2024).
Theroyallab. Theroyallab/tabbyAPI. The Royal Lab (2024).
Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 (2021).
LibreTexts Medicine Library. https://med.libretexts.org (2016).
Gao, Y. et al. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997 (2023).
Pinecone. The vector database to build knowledgeable AI | Pinecone. https://www.pinecone.io/.
Singhal, K. et al. Large Language Models Encode Clinical Knowledge. arXiv:2212.13138 (2022).
Omiye, J. A. et al. Large language models in medicine: The potentials and pitfalls: A narrative review. Ann. Internal Med. 177, 210–220 (2024).
Article Google Scholar
Download references
This research received no external funding.
Founder, AIDx, Jacksonville, FL, USA
Alan Alwakeel
PubMed Google Scholar
A.A. conceived the study, carried out the research and analysis, and wrote and revised the manuscript.
Correspondence to Alan Alwakeel.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Reprints and permissions
Alwakeel, A. AIDx: a locally deployable AI system for physician clinical decision support. Sci Rep 16, 18172 (2026). https://doi.org/10.1038/s41598-026-47470-1
Download citation
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-026-47470-1
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Advertisement
Scientific Reports (Sci Rep)
ISSN 2045-2322 (online)
© 2026 Springer Nature Limited
Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Leave a Reply