← heapsort
RESEARCH27

How Far Can a Small Coding Model Go With a Better Harness?

DEV.to AIΒ·May 20, 2026

The article explores the performance of a small coding model (GPT-5.1-Codex-Mini) on Terminal-Bench 2.0, achieving a 61.6% score by optimizing its "harness" rather than swapping for a larger model. It highlights that the model's wrapper plays a crucial role in performance, especially evident when using smaller models where harness mistakes have a greater impact.

Read original β†—