Healthcare AI Safety: A Review of Evaluation Frameworks

In the first installment of this series, we explored a set of prominent frameworks that have shaped how we evaluate and govern AI in healthcare. Since then, a number of new resources—spanning global health organizations, academic collaborations, and policy groups—have contributed critical perspectives to the conversation. In this second part, we build on our initial review by highlighting a new set of frameworks and research that help advance healthcare AI safety and responsible AI governance across the healthcare lifecycle.

1. WHO Framework for AI-Based Medical Devices

The World Health Organization’s “Generating Evidence for Artificial Intelligence Based Medical Devices” provides a detailed framework for evaluating AI-based tools across training, validation, and clinical use. It emphasizes contextual relevance, clinical performance, and regulatory preparedness. This resource is especially valuable for developers working at the intersection of AI and medical device regulation.

2. CRAFT-MD: Conversational Reasoning Assessment Framework for Testing in Medicine

CRAFT-MD focuses on evaluating AI systems—particularly large language models—in medical reasoning tasks. It introduces a novel assessment protocol that centers on clinical reasoning quality, offering a more structured and clinically meaningful alternative to traditional accuracy metrics. This is a critical step forward for generative AI governance in healthcare.

3. HAIRA: A Maturity Model for AI Governance

The HAIRA framework proposes a governance maturity model tailored to healthcare AI systems. Developed through a systematic review, HAIRA offers a staged approach to governance, allowing organizations to benchmark progress across policy, transparency, stakeholder engagement, and risk management.

4. TPLC: Total Product Lifecycle

Adopted from FDA guidance, the Total Product Lifecycle framework recognizes that AI systems in healthcare are not static. TPLC provides a structure for managing performance, safety, and oversight across pre-market, deployment, and post-market phases. It aligns well with the needs of continuously learning AI systems and supports adaptive regulatory mechanisms, making it an important model for strengthening healthcare AI safety.

5. AIGG: Aotearoa New Zealand’s Approach to AI Governance

AIGG serves as a national case study in implementing AI governance in healthcare systems. It focuses on principles like Māori data sovereignty, equity, and transparency, demonstrating how cultural context can be embedded into governance frameworks in meaningful and actionable ways.

6. SPIRIT-AI: Clinical Trial Design for AI Interventions

The SPIRIT-AI extension provides detailed guidelines for clinical trial protocols involving AI-based interventions. As clinical validation becomes a requirement for many healthcare AI tools, SPIRIT-AI offers essential direction for designing trials that meet regulatory and scientific standards, strengthening overall healthcare AI safety in real-world deployment.

7. OPTICA: Evaluation in Health Organizations

OPTICA offers a real-world evaluation framework focused on how AI solutions are integrated, adopted, and assessed within healthcare organizations. It considers not only technical performance but also organizational readiness, workflow impact, and clinician trust.

8. SALIENT: End-to-End Clinical AI Implementation

SALIENT stands for “Systematic Approaches to Learning, Implementation, Evaluation and Translation.” This framework fills a critical gap by offering an end-to-end implementation guide for clinical AI, from early development through to scaling and post-deployment monitoring.

9. JAMA Principles for Addressing Algorithmic Bias

In a 2022 article, JAMA authors proposed a set of guiding principles to address how algorithmic bias may reinforce racial and ethnic disparities in care. These include incorporating equity audits, data diversity assessments, and community engagement into the AI development lifecycle.

10. Health Equity and Racial Justice Integration (Project MUSE)

This framework, published in Project MUSE, proposes embedding health equity and racial justice principles throughout the AI lifecycle—from data selection to algorithmic impact assessment. It offers a compelling argument for redefining success metrics beyond technical accuracy, emphasizing the role of equity as a core pillar of healthcare AI safety.

11. NPJ Digital Medicine: Prospective AI Evaluation

Coombs et al. (2022) demonstrate a machine learning framework that supports real-time clinical decision-making in oncology. Their approach shows how AI evaluation can be structured prospectively, enabling tighter alignment between model predictions and patient outcomes in practice.

12. Guidelines for Prediction Models (de Hond et al.)

A comprehensive scoping review by de Hond et al. outlines quality criteria for AI-based prediction models in healthcare. These guidelines cover everything from dataset construction to performance reporting and validation standards, helping raise the bar for transparency and reproducibility – both essential to strengthening healthcare AI safety.

13. EDI in AI Lifecycle (Nyariro et al.)

This scoping review protocol from BMJ Open provides a blueprint for how to incorporate Equity, Diversity, and Inclusion (EDI) principles throughout the AI lifecycle. From stakeholder representation to algorithmic fairness, it offers a structured path for making healthcare AI more inclusive.

14. Do No Harm Roadmap (Wiens et al.)

Finally, Wiens et al.’s widely cited paper from Nature Medicine introduces a “Do No Harm” roadmap for responsible ML in healthcare. The roadmap provides pragmatic guidance for bias mitigation, model robustness, and post-deployment monitoring—anchoring responsible AI development in clinical realities and reinforcing the importance of healthcare AI safety.

Where Do We Go From Here?

As the governance landscape continues to evolve, these new frameworks signal a shift toward more holistic, lifecycle-aware, and equity-driven approaches. Whether you’re a developer, clinician, policymaker, or researcher, there’s growing consensus that healthcare AI must be safe, effective, and just—not just in principle, but in practice. The June 2025 edition of the Pacific AI Policy Suite covers the unified recommendations from all of the above policies – providing you with an actionable set of controls and best practices you can apply today. We will continue to survey the developments and publications in this space, so that you can rest assured that the policies and tools you adopt will always represent the cumulative published knowledge of the AI governance community.

Healthcare AI Governance: A Review of Evaluation Frameworks – Part 1

Healthcare AI Safety: A Review of Evaluation Frameworks – Part 2