RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
RankJudge is introduced as a benchmark generator for evaluating LLM-as-a-judge in multi-turn conversations, addressing the complexity existing Q&A-focused benchmarks fail to capture. It creates paired conversations with single injected flaws, allowing unambiguous labeling and precise isolation for model developers relying on auto-evaluation.