RESEARCH27

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

arXiv CS.CL·May 12, 2026

Magis-Bench is a new benchmark for evaluating Large Language Models (LLMs) on magistrate-level legal tasks, using 74 questions from recent Brazilian judicial competitive examinations. It evaluates 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with strong inter-judge agreement.

LLMs Legal AI Judicial tasks Benchmarks AI evaluation

Read original ↗