Abstract
Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs.
Methods: Hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss’ κ), case-level agreement profiles, and subgroup performance were assessed.
Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error. Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. Phalangeal fractures were reliably detected by top models; scaphoid fractures remained challenging.
Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection.