MedHELM and the Next Phase of Open Medical AI Evaluation

As foundation models transition from research to clinical decision support, there is a critical need for rigorous evaluation frameworks tailored to real-world clinical tasks. This keynote describes the evolution of the open-source MedHELM (Medical Holistic Evaluation of Language Models) project, built by the Stanford Center for Research on Foundation Models (CRFM). Unlike traditional benchmarks that rely on narrow accuracy metrics, MedHELM provides a transparent, reproducible, and holistic framework for assessing large language models and multimodal systems across clinical reasoning, safety, and reliability. Since its inception, MedHELM has established itself as a premier open-source standard, evaluating models across over 25 diverse medical tasks and providing benchmarks for the industry’s most prominent foundation models.

The session detailed the next milestone of the library, as Pacific AI assumes stewardship of the MedHELM project to ensure its long-term sustainability and accessibility for the open-source healthcare community. We will outline a technical roadmap focused on simplifying the installation and deployment of MedHELM, including:

  • Refactoring the library to remove unnecessary dependencies
  • Optimizing for air-gapped research environments
  • Lowering the barrier to entry for health system researchers, making it easier to contribute new clinical datasets and specialized test tasks

By moving from a centralized research project to a community-driven infrastructure, MedHELM aims to remain the definitive, open-source compass for the safe and effective evaluation of medical AI.

Reliable and verified information compiled by our editorial and professional team. Pacific AI Editorial Policy.

About the speakers
Suhana Bedi
PhD Candidate at Stanford University

Suhana Bedi is a PhD candidate in Biomedical Data Science at Stanford University, advised by Nigam Shah and Sanmi Koyejo. Her research focuses on trustworthy evaluation of medical AI systems and how to make large language models more useful in real clinical settings.

Miguel Fuentes
Research Engineer at Stanford Medicine

Miguel Fuentes is a Research Engineer at Stanford Medicine and holds a MS in Computer Science from Stanford, advised by Nigam Shan and Sneha Jain. His work focuses on the implementation and evaluation of AI systems at Stanford Healt

Scaling Vendor Due Diligence: Automating AI Risk Assessment from Policy to Contract

Evaluating third-party AI systems requires rigorous, standardized questionnaires to generate consistent risk scores. However, the process of mapping these questions to a vendor's broad, scattered, or sometimes non-existent documentation is...