MedATLAS-Bench

A Comprehensive and Diverse Multi-modal Medical Benchmark for Large Language Models

Xiaotian Ma*, Anand R. Mysorekar*, Xiaomin Liang, Yu-Chun Hsu
Saber Malekmohammadi, Xiaoqian Jiang, Shayan Shams

McWilliams School of Biomedical Informatics, UTHealth Houston

*Equal contribution.

MedATLAS-Bench evaluates multimodal large language models across diverse clinical inputs — including structured text, 2D images, 3D volumetric data, and video — enabling robust and realistic assessment in real-world medical scenarios.

The benchmark spans multiple clinical tasks such as classification, generation, and localization across varied datasets and conditions, including 430 samples in total.

Overall Leaderboard

Ranked model performance with average score across all samples.

Model

Score†

No results found

† Average score for all samples

* The results of GPT models exclude surgical videos due to HIPAA-compliant Azure content policy.

- All models are evaluated by Pass@4.

Acknowledgments

Funding

This work was supported by funding from the National Institutes of Health (NIH) (1R01NS138765-01), Google LLC, and the Ovarian Cancer Research Alliance (OCRA) (CRDGAI-2023–3-1002).

Collaborators

We are grateful to the following collaborators for their support in data acquisition and clinical expertise: Santiago Aristizabal Ortiz, MD, Mario E. Mahecha, MD, Laura A. Ocasio, Roy F. Riascos-Castaneda, MD, Elaine Stur, PhD, Anil Sood, MD, and Sunil Sheth, MD.

Data Access

For questions or requests regarding access to private datasets, please contact Shayan Shams, PhD at [email protected].