RESEARCH28

NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

arXiv CS.CL·May 4, 2026

NorBERTo is a new ModernBERT model trained on a 331 billion token Brazilian Portuguese corpus (Aurora-PT), designed for long-context support and efficient attention mechanisms. It achieves state-of-the-art results among evaluated encoder models on semantic similarity, textual entailment, and classification tasks using datasets like ASSIN 2 and PLUE.

AI models BERT Portuguese NLP large language models

Read original ↗