RESEARCH28
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
arXiv CS.CLΒ·May 15, 2026
This comprehensive replication study evaluates the efficacy of DExperts, an inference-time mitigation technique, to reduce toxicity in Large Language Models. The research establishes baseline toxicity measurements, implements DExperts to mitigate explicit toxicity, and stress-tests the method against implicit hate speech.
Read original β