MedHELM and The Next Phase of Open-Source Medical AI Evaluation

In the current gold rush of clinical LLMs, the industry has hit a paradoxical wall. We have models that can pass the USMLE with flying colors, yet health systems remain hesitant to deploy them for actual clinical workflows. The reason is simple: a multiple-choice exam is a poor proxy for the messy reality of summarizing a 2,000-page patient record or the nuanced requirements of a tumor board.

To bridge this gap, the Medical Holistic Evaluation of Language Models (MedHELM) was created. It was made possible by a unique collaboration between the Center for Research on Foundation Models, Technology and Digital Solutions at Stanford Healthcare, and Microsoft Healthcare and Life Sciences in partnership with faculty in the Departments of Medicine, Computer Science, Anesthesiology, Dermatology, Pediatrics and Biomedical Data Science as well as trainees from the MCiM program at the Clinical Excellence Research Center.

MedHELM has emerged as the definitive open-source standard for clinical AI validation – used routinely to evaluate the clinical capabilities of novel LLMs and Generative AI systems.

We are thrilled to announce a major milestone: MedHELM is transitioning to an independent, community-led open-source project. Pacific AI has assumed technical stewardship of the codebase, providing the engineering resources to transform MedHELM from a research project into a production-grade infrastructure for the global medical community.

MedHELM is – and will always remain – a fully open-source project under the Apache 2.0 license. Pacific AI’s role is not ownership but service: we are committed to supporting releases, upgrades, and technical maintenance for the benefit of the entire healthcare ecosystem.

Why MedHELM? Moving Beyond the Leaderboard Flex

For a data scientist, the limitation of traditional benchmarks like MedQA or PubMedQA is their narrow construct validity. Most existing medical benchmarks are essentially closed-book knowledge retrieval tests.

MedHELM is fundamentally different. To further enhance evaluation rigor, this framework can be complemented by a Multi-Domain Red Teaming approach, which systematically stress-tests models across adversarial and safety-critical clinical scenarios. It is a holistic framework designed to measure what actually matters in a clinical environment, based on real-world tasks built with clinicians:

Taxonomy-Driven Coverage: While USMLE tests general knowledge, MedHELM evaluates 121 distinct tasks across 22 subcategories such as Clinical Note Generation, Patient Communication, and Diagnostic Support.
Real-World Scenarios: MedHELM utilizes a mix of 31 diverse datasets, including 12 private and 6 gated-access collections, reflecting authentic EHR data rather than curated exam questions.
Multi-Metric Evaluation: MedHELM doesn’t just look at “accuracy.” It measures calibration, robustness to prompt variation, and writing style.
LLM-as-a-Jury: For open-ended tasks (like summarizing a discharge note), MedHELM uses a sophisticated ensemble of three strong LLMs (the “Jury”) to rate outputs on accuracy, completeness, and clarity. Research shows this approach achieves a higher agreement with expert clinicians (ICC = 0.47) than traditional NLP metrics like ROUGE or BERTScore.

Technical Contributions by Pacific AI

The transition to Pacific AI stewardship marks a shift toward Ease of Use and Scalability. Our engineering team has already completed several foundational updates to the MedHELM library, available openly on the MedHELM GitHub repo, to make it accessible to every medical data science team.

1. Decoupling from HELM Core

Historically, MedHELM was a module within the broader Stanford HELM repository. This created some “dependency hell” for medical researchers who only needed clinical benchmarks.

MedHELM is now a standalone library (uv pip install medhelm). It retains the rigorous HELM core engine but eliminates thousands of lines of unnecessary code related to general-purpose or vision-language models, making it faster and lighter.

2. Simplified Installation and “One-Command” Evaluation

We have refactored the installation process to support modern package managers like uv. You can now run a quick test on models like GPT-4o or Qwen in seconds:



uv run medhelm-run --run-entries "pubmed_qa:model=openai/gpt-4o"

--suite test_suite --max-eval-instances 2

We also introduced a modular installation system. Users can choose medhelm[standard] for simple QA, or medhelm[summarization] which includes other libraries like bert-score and nltk only when needed for clinical NLP tasks.

3. Air-Gapped and Local-First Architecture

Privacy is the north star of healthcare. We’ve optimized MedHELM to run against local inference servers, like vLLM or Ollama. This allows health systems to evaluate models on sensitive patient data without a single packet leaving their secure, private cloud.

4. Expansion of Test Suites

We have added support for more nuanced clinical tests, including MedCalc-Bench (for medical calculations) and MedHallu (specifically designed to catch hallucinations in clinical reasoning). The roadmap includes expanding these to include integrating the HealthBench benchmarks from OpenAI, as well as a suite of Agentic AI benchmarks, evaluating how well AI can navigate a multi-step clinical workflow. To see how MedHELM test suites run alongside bias, safety, and regulatory probes in a unified testing program, join our June 10 webinar: Testing Healthcare AI in 2026: 60+ Peer-Reviewed Evaluations for Clinical Tasks, Bias, Safety, and Regulation.

How to Get Started with MedHELM

MedHELM is built by the community, for the community. We invite you to use it as your internal “source of truth” for model selection. Organizations can also use an AI Risk Management Audit alongside MedHELM evaluations to strengthen governance, identify hidden deployment risks, and ensure clinical AI systems meet internal safety and compliance standards.

Installation

You can install the stable release directly from PyPI, python version 3.12 to use:



uv pip install medhelm

# Or for the full clinical NLP suite:

pip install "medhelm[summarization]"

Running Your First Benchmark

After installing, you can run a summary of results and view them in a local web dashboard:

1. Run:

uv run medhelm-run --run-entries "med_qa:model=..."

--suite my_first_eval

2. Summarize:

uv run helm-summarize --suite my_first_eval

3. Visualize:

uv run helm-server --suite my_first_eval

Open http://localhost:8000 to inspect individual prompts, model responses, and scoring logic.

A Call to the Open-Source Community

As an Apache 2.0 project, the future of MedHELM depends on your contributions. Pacific AI is here to maintain the infrastructure, but the clinical soul of the project comes from researchers like you. There are multiple ways for you to contribute:

New Scenarios: Have a specialized clinical dataset? Follow the Scenario Definition guide on medhelm.org to add it to the library.
Refining Metrics: Improve the “LLM-as-a-Jury” prompts to better capture nuances in specialty-specific documentation.
Feature Requests: Report bugs or suggest new evaluation capabilities on the GitHub Issue tracker.

The healthcare AI community is moving away from proprietary, black-box evaluations. In their place, we are building a transparent, reproducible, and scientifically rigorous standard that belongs to the world.

Explore the code, run your benchmarks, and join the conversation at medhelm.org.

Learn more about Install Software and how to deploy enterprise AI governance infrastructure in minutes.