Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
A systematic inspection of extsf{FOLIO} and extsf{MALLS} validation splits revealed high rates of incorrect FOL formalizations and ambiguous NL sentences, distorting AI model evaluation. The authors developed and released corrected ground truths for these datasets, demonstrating how annotation errors impact the evaluation of state-of-the-art LLMs.