Low-resource languages

9 items

RESEARCHarXiv CS.CL·20h ago

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

This study presents the first application of a Retrieval Augmented Generation (RAG) model for Nepali legal question answering, addressing data scarcity in low-resource languages. Using BM25 on chunked documents, the RAG pipeline achieved high precision and truthfulness, demonstrating its effectiveness in the Nepali legal domain.

Retrieval Augmented Generation Legal AI Question Answering natural language processing

RESEARCHarXiv CS.CL·5/7/2026

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

Nsanku is a systematic benchmark evaluating the zero-shot machine translation performance of 19 LLMs across 43 Ghanaian languages. It utilizes Bible sentences and metrics like BLEU and chrF, with gemini-2.5-flash achieving the highest overall average score.

LLMs Benchmarking machine translation Low-resource languages

RESEARCHarXiv CS.CL·4/22/2026

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

This paper introduces a novel in-context learning approach for low-resource Coptic to English machine translation, augmenting inputs with syntactic information from Universal Dependencies parses. Combining this syntactic data with dictionary-based glosses achieves significant gains and sets a new state-of-the-art.

universal-dependencies natural language processing machine translation in-context learning

RESEARCHarXiv CS.CL·4/9/2026

Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

Este artigo propõe um arcabouço teórico para estudar a transferência interlinguística e a adaptação eficiente de parâmetros em LLMs multilingues para a família de línguas túrquicas. Ele busca abordar a sub-representação de línguas de baixos recursos nesses modelos, como azerbaijano, cazaque, uzbeque, turcomeno e gagauz.

LLMs Turkic languages cross-lingual transfer Parameter-efficient adaptation

RESEARCHarXiv CS.CL·4/24/2026

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

This paper introduces AFRILANGDICT, a collection of African language-English dictionary entries, and AFRILANGEDU, a dataset. These resources are used to train AI models, called AFRILANGTUTOR, for language tutoring in low-resource African languages, addressing the scarcity of AI systems for local languages on the African continent.

LLMs language education Africa Low-resource languages

RESEARCHarXiv CS.CL·29d ago

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

MultiSoc-4D is a new Bengali social media dataset benchmark designed to diagnose LLM behavior in closed-set annotation. The research identifies "instruction-induced label collapse," a phenomenon where LLMs systematically prefer fallback labels, leading to under-detection of minority categories.

LLMs natural language processing Data Annotation Benchmarks

RESEARCHarXiv CS.CL·20d ago

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

Low-resource natural language processing has experienced explosive growth, but its evaluation faces a critical challenge: the scarcity of sociolinguistic expertise needed to assess complex generative systems. This creates an "Annotation Scarcity Paradox," where the technical capacity to scale models vastly outpaces the human infrastructure required for authentic evaluation.

machine learning NLP Low-resource languages AI evaluation

RESEARCHarXiv CS.CL·12d ago

Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

This research addresses the Stability-Expressivity Gap in Spoken Language Models (SLMs) for low-resource languages, caused by the extensive use of synthetic data. While synthetic data improves phonetic accuracy, it degrades prosodic expressivity, a phenomenon termed Synthetic Erosion. The paper introduces self-alignment frameworks to recover expressivity.

synthetic data speech synthesis spoken language models Low-resource languages

RESEARCHarXiv CS.CL·4/6/2026

An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages

Este estudo empírico investiga o aprendizado em contexto (ICL) de muitos exemplos para tradução automática de inglês para dez idiomas de baixo recurso. Os achados mostram que o ICL se torna mais eficaz com o aumento do número de exemplos, e a recuperação baseada em BM25 melhora substancialmente a eficiência dos dados.

LLMs Many-Shot Learning NLP machine translation