← heapsort
RESEARCH28

Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM Annotations

arXiv CS.CLΒ·April 14, 2026

This research explores improving cross-lingual hate speech detection by leveraging large-scale unlabelled web data and LLM-based synthetic annotations. It shows that continued pre-training of BERT models on web data and fine-tuning with synthetic labels generated by an ensemble of LLMs significantly boosts performance, especially in low-resource settings.

Read original β†—