RESEARCH27
GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks
DEV.to AIΒ·May 7, 2026
A new study reveals that multimodal large language models (LLMs) like GPT-4.1 exhibit a significant drop in diagnostic accuracy when applied to real hospital dermatology cases, compared to public benchmarks. The research, encompassing 5,811 cases, found GPT-4.1 achieved 24.65% accuracy in real clinical settings versus 42.25% on benchmarks.
Read original β