RESEARCHarXiv CS.AI·5/4/2026
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
This work introduces AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, to evaluate tool-use abilities in AI models. Results indicate that small and mid-sized open-weight models are sufficient for much of the short-horizon, structured tool-use work prevalent in real agent pipelines.
28
