Multi-turn conversations

3 items

RESEARCHarXiv CS.CL·5/1/2026

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

CarryOnBench is introduced as the first interactive benchmark to measure how LLMs recover utility and revise user intent interpretation in multi-turn, safe conversations. It reveals that current models fulfill only 10.5-37.6% of benign user information needs at the initial turn, highlighting a gap in safety-aligned LLMs regarding helpfulness recovery.

Multi-turn conversations Benchmarking AI safety user interaction

RESEARCHarXiv CS.CL·5/4/2026

Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations

This research introduces a scalable framework for safety evaluation of multi-turn interactions with AI companion applications, addressing concerns about their emotional engagement risks. It integrates persona construction, scenario generation, simulation, and harm evaluation, applying it to Replika with high-risk user personas like those with depression or anxiety.

Multi-turn conversations Persona Modeling Harm Evaluation AI companions

RESEARCHarXiv CS.CL·18d ago

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge is introduced as a benchmark generator for evaluating LLM-as-a-judge in multi-turn conversations, addressing the complexity existing Q&A-focused benchmarks fail to capture. It creates paired conversations with single injected flaws, allowing unambiguous labeling and precise isolation for model developers relying on auto-evaluation.

Multi-turn conversations LLM-as-a-judge Benchmarking Generative AI