RESEARCH27

ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

arXiv CS.AI·April 23, 2026

ThermoQA is a new three-tier benchmark of 293 open-ended engineering thermodynamics problems introduced to evaluate thermodynamic reasoning in LLMs. Leading LLMs like Claude Opus 4.6 and GPT-5.4 achieve high scores, but cross-tier degradation confirms that property memorization does not imply thermodynamic reasoning, with the dataset and code being open-source.

Dataset Benchmarking large language models AI evaluation

Read original ↗