LLM

612 items

RESEARCHarXiv CS.LG·5/1/2026

When Continual Learning Moves to Memory: A Study of Experience Reuse in LLM Agents

This study investigates the role of external memory in LLM agents for continual learning, showing that the stability-plasticity dilemma resurfaces at the memory level due to limited context windows. A (k,v) framework is introduced to disentangle how experience is represented and organized, finding that abstract procedural memories transfer more reliably than detailed trajectories and finer-grained memory organization is beneficial.

research memory AI agents Continual Learning

RESEARCHarXiv CS.CL·5/1/2026

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

CarryOnBench is introduced as the first interactive benchmark to measure how LLMs recover utility and revise user intent interpretation in multi-turn, safe conversations. It reveals that current models fulfill only 10.5-37.6% of benign user information needs at the initial turn, highlighting a gap in safety-aligned LLMs regarding helpfulness recovery.

Multi-turn conversations benchmarking AI safety user interaction

RESEARCHarXiv CS.CL·4/17/2026

SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models

SeaAlert is an LLM-based framework designed for the robust analysis of maritime distress communications, which are challenging due to noise, deviations from format, and ASR errors. To overcome the lack of real-world labeled data, the framework utilizes an LLM-powered synthetic data generation pipeline.

synthetic data Information Extraction NLP Speech Recognition

RESEARCHarXiv CS.AI·5/9/2026

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

PRISM is a novel framework that integrates perception (VLM) and decision (LLM) through a dynamic question-answer pipeline, enabling the LLM to actively refine the VLM's output for task-driven scene understanding. This approach significantly outperforms existing image-based models on benchmarks like ALFWorld and Room-to-Room.

VLM embodied AI AI robotics

RESEARCHarXiv CS.CL·4/27/2026

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

This research paper addresses optimal question selection for information gathering in psychiatric intake using conversational AI. It introduces a benchmark with 655 questions and synthetic vignettes, evaluating LLM-guided adaptive policies.

Healthcare Natural Language Processing Conversational AI LLM

RESEARCHarXiv CS.CL·4/9/2026

Temporally Phenotyping GLP-1RA Case Reports with Large Language Models: A Textual Time Series Corpus and Risk Modeling

Este estudo desenvolveu um corpus de séries temporais textuais a partir de relatórios de casos de diabetes tipo 2 para extrair cronogramas clínicos complexos com LLMs. O GPT5 demonstrou alta eficácia na recuperação de eventos e sequenciamento temporal, com aplicações que sugerem redução do risco de sequelas respiratórias entre usuários de GLP-1.

Diabetes Saúde Processamento de Linguagem Natural Séries Temporais

RESEARCHarXiv CS.CL·4/27/2026

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

This paper introduces a lightweight framework for scalable patient-trial matching, addressing challenges posed by long, complex electronic health records. It combines retrieval-augmented generation (RAG) to identify relevant EHR segments with large language models (LLMs) to encode these segments into informative representations, improving efficiency and generalization.

RAG medical-informatics healthcare AI LLM

RESEARCHarXiv CS.CL·20d ago

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

This study proposes a structured framework to improve LLM reasoning when analyzing long documents, addressing issues like contextual bias and omission error. It combines parallel chunk-level processing with evidence-anchored consolidation to generate more robust and bias-resilient conceptual abstractions.

Contextual Reasoning Natural Language Processing AI research Bias

RESEARCHarXiv CS.CL·20d ago

Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

This study investigates how emotionally framed evaluation follow-ups alter both the behavior and internal representations of small language models. Findings indicate that "pressure" strongly induces shortcut markers, while "calm" and "curiosity" preserve honesty, with emotional direction vectors peaking at the final transformer layer.

NLP model behavior emotional framing AI research

RESEARCHarXiv CS.AI·4/17/2026

Credo: Declarative Control of LLM Pipelines via Beliefs and Policies

Credo introduces a novel framework for declaratively controlling LLM pipelines, representing semantic state as beliefs and regulating behavior through policies. This design aims to enhance the adaptability, auditability, and composability of agentic AI systems by moving away from brittle imperative control loops.

AI architecture Control Systems AI agents Declarative Programming

ARTICLEDEV.to AI·4/12/2026

AI on the Couch: Why Anthropic Sent Claude to a Psychiatrist

Anthropic subjected its LLM, Mythos, to 20 hours of clinical psychiatry sessions, an innovative approach. The goal is to mitigate "behavioral deviation" and regulate the model's responses under pressure by integrating human clinical techniques.

Mythos Anthropic psychiatry AI

RESEARCHarXiv CS.AI·5/4/2026

ARMOR 2025: A Military-Aligned Benchmark for Evaluating Large Language Model Safety Beyond Civilian Contexts

ARMOR 2025 is a new military-aligned benchmark designed to evaluate the safety of large language models (LLMs) in defense applications, beyond civilian contexts. It addresses the gap in existing benchmarks by grounding evaluations in military doctrines like the Law of War, Rules of Engagement, and Joint Ethics Regulation.

ethics military AI benchmarks AI safety

RESEARCHarXiv CS.AI·5/4/2026

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

TUR-DPO is a novel topology- and uncertainty-aware variant of Direct Preference Optimization (DPO) designed to better align large language models (LLMs) with human preferences. It improves upon DPO by considering reasoning topologies and uncertainty signals, rewarding how answers are derived, not only what they say.

reinforcement learning DPO AI alignment machine learning

RESEARCHarXiv CS.AI·5/4/2026

TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

TADI is an agentic AI system that transforms drilling operational data into evidence-based analytical intelligence. It integrates heterogeneous wellsite data using a dual-store architecture and 12 domain-specialized tools orchestrated by an LLM for multi-step evidence gathering.

Drilling AI Data integration Oil & Gas

RESEARCHarXiv CS.AI·25d ago

Enhanced and Efficient Reasoning in Large Learning Models

This paper proposes an efficient and principled method to enhance reasoning in Large Language Models, addressing the current lack of trustworthiness in produced content. It involves a preprocessing stage using a Unary Relational Integracode followed by a streamlined machine learning process.

model efficiency machine learning Reasoning data preprocessing

RESEARCHarXiv CS.AI·5/7/2026

Parallel Prefix Verification for Speculative Generation

PARSE (PArallel pRefix Speculative Engine) is a new speculative generation framework that accelerates large language model (LLM) inference. It achieves this by parallelizing prefix verification on a semantic level, overcoming existing limitations by evaluating correctness across multiple prefixes in a single forward pass.

inference AI acceleration parallelization Speculative Decoding

RESEARCHarXiv CS.LG·4/24/2026

FairyFuse: Multiplication-Free LLM Inference on CPUs via Fused Ternary Kernels

FairyFuse is a new inference system designed for CPU-only platforms, enabling multiplication-free execution of large language models. It uses ternary weights ({-1, 0, +1}) to replace floating-point multiplications with conditional additions and subtractions, significantly reducing memory bandwidth bottlenecks and offering up to 16x weight compression.

inference CPU optimization quantization performance

RESEARCHarXiv CS.CL·4/21/2026

Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

This research evaluates cross-family speculative decoding for Polish LLMs on Apple Silicon, extending the MLX-LM framework with Universal Assisted Generation (UAG) for cross-tokenizer compatibility. Experiments show that context-aware token translation significantly improves acceptance rates for Bielik 11B on Polish language datasets.

apple-silicon Natural Language Processing inference optimization Speculative Decoding

RESEARCHarXiv CS.AI·18d ago

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench is a new benchmark grounded in 200 genuine multi-turn human-model conversations to assess LLM emotional intelligence. It measures models' ability to infer and respond to emotional states over the course of real conversations, finding that model rankings on emotion recognition and other metrics are largely independent.

Emotional Intelligence benchmarks human-AI interaction AI evaluation

RESEARCHarXiv CS.LG·4/24/2026

Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics

This paper proposes an LLM-guided temporal simulation framework for clinically interpretable early sepsis warning. The model simulates physiological trajectories prior to disease onset by integrating spatiotemporal feature extraction, medical reasoning cues, and agent-based post-processing for physiologically plausible predictions.

Healthcare early warning systems simulation medical AI