RESEARCH30

Robust Reasoning Benchmark

arXiv CS.LG·April 13, 2026

This study proposes a new perturbation pipeline to evaluate the robustness of LLM reasoning, applying it to the AIME 2024 dataset. While frontier models show resilience, open-weight models suffer catastrophic accuracy drops, exposing structural fragility and potential issues with working memory or mechanical parsing.

robustness LLMs Model Evaluation Reasoning Benchmarks

Read original ↗