LLMs

722 items

RESEARCHarXiv CS.CL·4/24/2026

TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

This paper introduces TRACES, a lightweight framework designed to optimize Language Reasoning Models (LRMs) by tagging reasoning steps in real-time. It enables adaptive, cost-efficient early stopping of LRM inferences, addressing their current inefficiency and over-generation of verification steps.

LLMs early stopping Reasoning inference optimization

RESEARCHarXiv CS.AI·5/4/2026

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

This paper investigates minimal, local, causal explanations for the success of jailbreak attacks in large language models (LLMs). The research addresses the current lack of robust understanding regarding LLM susceptibility to these attacks, which enable harmful responses despite safety training.

LLMs jailbreak security AI safety

RESEARCHarXiv CS.CL·4/24/2026

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

This paper introduces AFRILANGDICT, a collection of African language-English dictionary entries, and AFRILANGEDU, a dataset. These resources are used to train AI models, called AFRILANGTUTOR, for language tutoring in low-resource African languages, addressing the scarcity of AI systems for local languages on the African continent.

LLMs language education Africa Low-resource languages

RESEARCHarXiv CS.CL·5/4/2026

Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues

New research addresses the gap in evaluating cultural reasoning in LLMs, introducing ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries. Experiments indicate that models perform worse on cultural reasoning, translation, and generation tasks in dialectal setups compared to Modern Standard Arabic.

LLMs Arabic dialects cultural reasoning benchmarking

RESEARCHarXiv CS.AI·18d ago

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

MindLoom is a framework for synthesizing frontier-level reasoning data, addressing the challenge of limited diversity and unstable difficulty in existing methods. It achieves this by decomposing problem solutions into "thought mode chains" and training a retrieval model to guide the reasoning process.

data synthesis Thought Modes LLMs AI frameworks

RESEARCHarXiv CS.LG·18d ago

Harnesses for Inference-Time Alignment over Execution Trajectories

This research investigates harness engineering as an inference-time technique for large language model (LLM) agents, focusing on improving long-term performance via task decomposition and guided execution. It quantifies how design elements like workflow granularity and guidance impact performance, revealing common failure modes such as over-decomposition and hallucinated execution.

inference LLMs machine learning Task Decomposition

RESEARCHarXiv CS.CL·4/21/2026

Multimodal Claim Extraction for Fact-Checking

This work introduces the first benchmark for multimodal claim extraction from social media posts, essential for automated fact-checking. It evaluates state-of-the-art MLLMs and proposes MICE, an intent-aware framework, to address challenges in modeling rhetorical intent and contextual cues.

multimodal AI LLMs social media misinformation

RESEARCHarXiv CS.CL·4/21/2026

LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

LiFT is a new instruction fine-tuning framework designed to improve in-context learning for large language models on longitudinal NLP tasks, which require reasoning over temporally ordered text. It uses a curriculum that progressively increases temporal difficulty, incorporating few-shot structure and temporal conditioning, consistently outperforming base models across various datasets and parameter sizes.

LLMs temporal reasoning Natural Language Processing in-context learning

RESEARCHarXiv CS.CL·26d ago

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

This paper introduces PEML, a method for parameter-efficient multi-task learning with optimized continuous prompts for Large Language Models. It addresses the shortcomings of existing PEFT methods like LoRA and Prefix Tuning by enabling more efficient fine-tuning across multiple tasks and facilitating resource consolidation.

Resource efficiency multi-task learning LLMs Prompt tuning

RESEARCHarXiv CS.CL·26d ago

Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

This paper introduces Derivation Prompting, a novel prompting technique for the Retrieval-Augmented Generation (RAG) framework. The method aims to reduce hallucinations and erroneous reasoning in Large Language Models (LLMs) by systematically applying predefined rules to derive conclusions. A case study demonstrated a significant reduction in unacceptable answers compared to traditional RAG methods.

LLMs RAG Prompting Natural Language Processing

RESEARCHarXiv CS.LG·4/24/2026

Reinforcing privacy reasoning in LLMs via normative simulacra from fiction

This paper proposes a novel method to enhance privacy reasoning in LLMs by extracting normative simulacra from fiction novels. The approach involves fine-tuning LLMs via supervised learning followed by GRPO reinforcement learning, using a composite reward function to align information handling practices with user privacy expectations.

LLMs privacy security machine learning

RESEARCHarXiv CS.CL·5/7/2026

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

This research introduces Adaptive Power-Mean Policy Optimization (APMPO) to improve Large Language Model (LLM) reasoning capabilities within Reinforcement Learning with Verifiable Rewards (RLVR). APMPO combines a generalized power-mean objective and feedback-adaptive clipping to enhance learning dynamics and performance, addressing limitations of static optimization schemes.

Policy optimization LLMs reinforcement learning machine learning

RESEARCHarXiv CS.CL·8d ago

ART: Attention Run-time Termination for Efficient Large Language Model Decoding

Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth of the Key-Value (KV) cache. This paper proposes Attention Run-time Termination (ART), a lightweight mechanism that optimizes KV cache access, leading to a 20% higher generation throughput.

LLMs memory management decoding performance

RESEARCHarXiv CS.CL·5/11/2026

Can LLMs Take Retrieved Information with a Grain of Salt?

This paper evaluates the ability of large language models (LLMs) to adapt their responses to the certainty of retrieved information, revealing systematic limitations. It proposes an interaction strategy combining prior reminders, certainty recalibration, and context simplification to enhance LLM reliability. This approach reduces obedience errors by 25% without modifying model weights.

LLMs context certainty Natural Language Processing AI reliability

RESEARCHarXiv CS.CL·4/24/2026

DWTSumm: Discrete Wavelet Transform for Document Summarization

This research proposes a Discrete Wavelet Transform (DWT)-based framework to enhance document summarization, particularly for long, domain-specific texts where LLMs struggle. The method creates compact representations that improve semantic similarity, grounding, and factual consistency compared to a GPT-4o baseline.

LLMs wavelet transform NLP Document Summarization

RESEARCHarXiv CS.CL·5/11/2026

Reflections and New Directions for Human-Centered Large Language Models

This work introduces a framework for Human-Centered Large Language Models (HCLLMs), integrating perspectives from NLP, HCI, and responsible AI. It argues for prioritizing human concerns, preferences, and values rigorously at every stage of LLM development, rather than as a mere post-training consideration.

LLMs HCI NLP AI ethics

RESEARCHarXiv CS.LG·26d ago

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

The paper addresses the challenge of training large language models (LLMs) on private, distributed data, especially in regulated sectors like healthcare and finance. It proposes a practical approach to leverage this valuable, yet unsharable, non-IID data, aiming for LLMs with deeper domain expertise.

LLMs private data privacy benchmarking

RESEARCHarXiv CS.CL·5/11/2026

MultiSoc-4D: A Benchmark for Diagnosing Instruction-Induced Label Collapse in Closed-Set LLM Annotation of Bengali Social Media

MultiSoc-4D is a new Bengali social media dataset benchmark designed to diagnose LLM behavior in closed-set annotation. The research identifies "instruction-induced label collapse," a phenomenon where LLMs systematically prefer fallback labels, leading to under-detection of minority categories.

LLMs Natural Language Processing Data Annotation benchmarks

RESEARCHarXiv CS.CL·5/7/2026

Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa

This paper evaluates open-weight and domain-adapted Large Language Models (LLMs) for conflict event classification in West Africa. The study reveals that open-weight models exhibit a "False Illegitimation" bias, while domain-adapted models achieve directional neutrality but retain an actor-based selection bias.

LLMs Model Evaluation Conflict Monitoring Humanitarian Accountability

RESEARCHarXiv CS.CL·5/7/2026

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

FREIA is a novel reinforcement learning algorithm designed to enhance LLMs for unsupervised reasoning, addressing the lack of adaptability in existing methods. It employs Free Energy-Driven Reward (FER) to balance consensus and exploration, and Adaptive Advantage Shaping (AAS) to adjust learning signals. FREIA outperforms unsupervised baselines across various reasoning tasks, particularly in mathematical reasoning.

LLMs reinforcement learning AI algorithms Reasoning