As foundation models transition from research to clinical decision support, there is a critical need for rigorous evaluation frameworks tailored to real-world clinical tasks. This keynote describes the evolution of the open-source MedHELM (Medical Holistic Evaluation of Language Models) project, built by the Stanford Center for Research on Foundation Models (CRFM). Unlike traditional benchmarks that rely on narrow accuracy metrics, MedHELM provides a transparent, reproducible, and holistic framework for assessing large language models and multimodal systems across clinical reasoning, safety, and reliability. Since its inception, MedHELM has established itself as a premier open-source standard, evaluating models across over 25 diverse medical tasks and providing benchmarks for the industry’s most prominent foundation models.
The session detailed the next milestone of the library, as Pacific AI assumes stewardship of the MedHELM project to ensure its long-term sustainability and accessibility for the open-source healthcare community. We will outline a technical roadmap focused on simplifying the installation and deployment of MedHELM, including:
- Refactoring the library to remove unnecessary dependencies
- Optimizing for air-gapped research environments
- Lowering the barrier to entry for health system researchers, making it easier to contribute new clinical datasets and specialized test tasks
By moving from a centralized research project to a community-driven infrastructure, MedHELM aims to remain the definitive, open-source compass for the safe and effective evaluation of medical AI.








