One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]
The author expresses frustration that benchmark performance often fails to predict whether an AI workflow will succeed in real production usage. This is due to factors like ambiguous user intent and messy contexts, suggesting evaluation still prioritizes clean-task optimization over behavioral robustness.
