RESEARCH27
How Far Can a Small Coding Model Go With a Better Harness?
DEV.to AIΒ·May 20, 2026
The article explores the performance of a small coding model (GPT-5.1-Codex-Mini) on Terminal-Bench 2.0, achieving a 61.6% score by optimizing its "harness" rather than swapping for a larger model. It highlights that the model's wrapper plays a crucial role in performance, especially evident when using smaller models where harness mistakes have a greater impact.
Read original β