← heapsort-ai

Natural Language Processing

168 items

RESEARCHarXiv CS.CL·4/20/2026

DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation

DALM (Domain-Algebraic Language Model) is proposed to address knowledge interference in LLMs by replacing unconstrained generation with structured denoising over a domain lattice. It uses a three-phase generation path (domain, relation, concept uncertainty) under algebraic constraints, requiring a domain lattice, relation typing, and fiber partition to prevent cross-domain contamination.

27
RESEARCHarXiv CS.CL·4/17/2026

Hierarchical Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text

This paper introduces H-TechniqueRAG, a novel hierarchical Retrieval-Augmented Generation (RAG) framework designed to improve the annotation of adversarial techniques in Cyber Threat Intelligence (CTI) text. It addresses the limitation of flat RAG approaches by incorporating the inherent tactic-technique taxonomy of the MITRE ATT&CK framework through a two-stage retrieval mechanism.

27
RESEARCHarXiv CS.CL·4/22/2026

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

This paper introduces a novel in-context learning approach for low-resource Coptic to English machine translation, augmenting inputs with syntactic information from Universal Dependencies parses. Combining this syntactic data with dictionary-based glosses achieves significant gains and sets a new state-of-the-art.

27
RESEARCHarXiv CS.CL·4/13/2026

Uncertainty Estimation for the Open-Set Text Classification systems

This paper focuses on accurate uncertainty estimation for open-set text classification (OSTC) systems, where text samples can be classified into existing classes or rejected as unknown. It adapts the Holistic Uncertainty Estimation (HolUE) method for the text domain to capture text and gallery uncertainties, and proposes a new OSTC benchmark.

27
RESEARCHarXiv CS.CL·21d ago

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval addresses the challenge of evaluating multi-turn dialogue systems by modeling dialogue as an evolving Semantic Knowledge Graph (SKG). This framework incrementally updates the graph through structured triple extraction to detect long-range issues like contradiction and inconsistency, offering improved evaluation beyond turn-isolated representations.

27
RESEARCHarXiv CS.CL·7d ago

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of depression in online text. The study compares a TF-IDF baseline model with a hybrid DistilBERT HRR model, showing the latter achieves a significantly higher macro F1 score of 0.94.

27
RESEARCHDEV.to AI·26d ago

Generative Simulation Benchmarking for heritage language revitalization programs for extreme data sparsity scenarios

The text discusses the challenge of building language models for critically endangered heritage languages under extreme data sparsity scenarios. The author recounts their personal experience with a minuscule dataset for a language like Halkomelem, highlighting the need for novel approaches for such situations.

27
CASEAWS Machine Learning Blog·12d ago

Training Azerbaijani language models on Amazon SageMaker AI

Azercell Telecom partnered with the AWS Generative AI Innovation Center to develop an Azerbaijani large language model (LLM) on Amazon SageMaker AI. This six-week collaboration established a production-ready framework for telecom use cases and a customer-facing chatbot, overcoming data scarcity and linguistic complexity challenges.

27