RESEARCH27
ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
arXiv CS.AIΒ·April 23, 2026
ThermoQA is a new three-tier benchmark of 293 open-ended engineering thermodynamics problems introduced to evaluate thermodynamic reasoning in LLMs. Leading LLMs like Claude Opus 4.6 and GPT-5.4 achieve high scores, but cross-tier degradation confirms that property memorization does not imply thermodynamic reasoning, with the dataset and code being open-source.
Read original β