← heapsort-ai

Natural Language Processing

168 items

RESEARCHarXiv CS.CL·4/17/2026

Decoupling Scores and Text: The Politeness Principle in Peer Review

This study investigates the difficulty of interpreting peer review feedback, comparing the effectiveness of numerical scores versus text in predicting acceptance. The research reveals that score-based models are significantly more accurate (91%) than text-based models (81% even with LLMs), indicating textual information is considerably less reliable.

27
RESEARCHarXiv CS.CL·5/8/2026

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

This paper proposes an evidence-based model to generate queries from query-free summarization datasets, addressing the challenge of finding suitable datasets for Query-Focused Summarization (QFS). Experimental results indicate that summaries generated using these evidence-based queries achieve competitive ROUGE scores, supporting their effectiveness for the QFS task.

27
RESEARCHarXiv CS.CL·5/8/2026

AdaGATE: Adaptive Gap-Aware Token-Efficient Evidence Assembly for Multi-Hop Retrieval-Augmented Generation

AdaGATE is a training-free evidence controller for multi-hop Retrieval-Augmented Generation (RAG) designed to address noisy or redundant retrieved evidence in limited contexts. It frames evidence selection as a token-constrained repair problem, combining entity-centric gap tracking and targeted micro-query generation to balance coverage, corroboration, and novelty.

27
RESEARCHarXiv CS.CL·4/20/2026

Applied Explainability for Large Language Models: A Comparative Study

This paper presents a comparative study of three explainability techniques (Integrated Gradients, Attention Rollout, and SHAP) on a fine-tuned DistilBERT model for sentiment classification. The study concludes that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features.

27
RESEARCHarXiv CS.CL·4/24/2026

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

This paper introduces Hierarchical Policy Optimization (HPO) for Simultaneous Speech Translation (SST) using LLMs, addressing challenges like high computational cost and imperfect supervised fine-tuning data. HPO employs a hierarchical reward to balance translation quality and latency, demonstrating substantial improvements in COMET and MetricX scores.

27
RESEARCHarXiv CS.CL·4/21/2026

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

This research evaluates cross-family speculative decoding for Polish LLMs on Apple Silicon, extending the MLX-LM framework with Universal Assisted Generation (UAG) for cross-tokenizer compatibility. Experiments show that context-aware token translation significantly improves acceptance rates for Bielik 11B on Polish language datasets.

27
RESEARCHarXiv CS.LG·4/24/2026

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Transformers struggle with high computational costs and memory consumption for long sequences, while alternatives lose long-tail dependencies. Absorber LLM proposes a self-supervised causal synchronization to absorb historical contexts into parameters, ensuring a contextless model matches the original full-context one on future generations.

27
RESEARCHarXiv CS.CL·4/21/2026

LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

LiFT is a new instruction fine-tuning framework designed to improve in-context learning for large language models on longitudinal NLP tasks, which require reasoning over temporally ordered text. It uses a curriculum that progressively increases temporal difficulty, incorporating few-shot structure and temporal conditioning, consistently outperforming base models across various datasets and parameter sizes.

27
RESEARCHarXiv CS.CL·26d ago

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

This paper introduces Derivation Prompting, a novel prompting technique for the Retrieval-Augmented Generation (RAG) framework. The method aims to reduce hallucinations and erroneous reasoning in Large Language Models (LLMs) by systematically applying predefined rules to derive conclusions. A case study demonstrated a significant reduction in unacceptable answers compared to traditional RAG methods.

27
RESEARCHarXiv CS.CL·5/7/2026

FMI_SU_Yotkova_Kastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals

This paper details participation in SemEval-2026 Task 13, focusing on lightweight detection of LLM-generated code using stylometric signals. The approach employs ratio-based features, parsing engines, and language classifiers, proving computationally efficient with near-instant inference time.

27
RESEARCHarXiv CS.CL·5/11/2026

Can LLMs Take Retrieved Information with a Grain of Salt?

This paper evaluates the ability of large language models (LLMs) to adapt their responses to the certainty of retrieved information, revealing systematic limitations. It proposes an interaction strategy combining prior reminders, certainty recalibration, and context simplification to enhance LLM reliability. This approach reduces obedience errors by 25% without modifying model weights.

27
RESEARCHarXiv CS.CL·22d ago

Exploring Lightweight Large Language Models for Court View Generation

The research explores the capabilities of lightweight Large Language Models (LLMs) in Criminal Court View Generation (CVG) and their impact on charge prediction within Legal AI. It systematically investigates architectural differences, model size, and comparison with Deep Neural Networks, introducing the CVGEvalKit framework for evaluation.

27
RESEARCHarXiv CS.CL·5/11/2026

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

MultiSoc-4D is a new Bengali social media dataset benchmark designed to diagnose LLM behavior in closed-set annotation. The research identifies "instruction-induced label collapse," a phenomenon where LLMs systematically prefer fallback labels, leading to under-detection of minority categories.

27
RESEARCHarXiv CS.CL·22d ago

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

This research introduces a scalable computational approach to measure manner and result verbs, a crucial distinction for developmental language studies. It leverages large language models for sentence annotations and trains a RoBERTa-based classifier, demonstrating promising performance on evaluation datasets.

27