RESEARCH30
Robust Reasoning Benchmark
arXiv CS.LGΒ·April 13, 2026
This study proposes a new perturbation pipeline to evaluate the robustness of LLM reasoning, applying it to the AIME 2024 dataset. While frontier models show resilience, open-weight models suffer catastrophic accuracy drops, exposing structural fragility and potential issues with working memory or mechanical parsing.
Read original β