RESEARCH28

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

arXiv CS.AI·May 4, 2026

This work introduces AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, to evaluate tool-use abilities in AI models. Results indicate that small and mid-sized open-weight models are sufficient for much of the short-horizon, structured tool-use work prevalent in real agent pipelines.

Open-Weight Models LLMs Benchmarking tool use AI agents

Read original ↗