Was looking at a ICLR 2025 Oral paper and I am shocked it got oral [D]
A user expresses shock regarding an ICLR 2025 Oral paper, criticizing its evaluation methodology for SQL code generation by LLMs. The paper reportedly used natural language metrics instead of execution metrics, leading to an approximately 20% false positive rate.