RESEARCH28

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

arXiv CS.CL·May 15, 2026

This comprehensive replication study evaluates the efficacy of DExperts, an inference-time mitigation technique, to reduce toxicity in Large Language Models. The research establishes baseline toxicity measurements, implements DExperts to mitigate explicit toxicity, and stress-tests the method against implicit hate speech.

DExperts security Toxicity large language models Replication Study

Read original ↗