← heapsort-ai

LLMs

722 items

RESEARCHarXiv CS.AI·29d ago

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

This research proposes distinguishing between capability elicitation and capability creation in large language model post-training. It argues that elicitation reweights existing behaviors within a model's accessible support, while creation changes that support itself, developing this through a free-energy view.

27
RESEARCHarXiv CS.AI·5/11/2026

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

This paper introduces Deployment-Time Learning (DTL) as a new stage for LLMs, allowing them to continually adapt from experience post-training without modifying core parameters. It presents CASCADE, a framework that uses an explicit, evolving episodic memory for LLM agents, formalizing experience reuse as a contextual bandit problem with no-regret guarantees.

27
RESEARCHarXiv CS.AI·5/11/2026

GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

This paper introduces GraphDC, a Divide-and-Conquer multi-agent system designed to enhance graph algorithm reasoning in Large Language Models (LLMs). It improves performance by decomposing large graphs into smaller subgraphs for specialized agents, with a master agent integrating the results, leading to better scalability and robustness.

27
RESEARCHarXiv CS.AI·23d ago

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

This paper introduces a new paradigm for interactively evaluating Theory of Mind (ToM) improvements in Large Language Models (LLMs) for human-AI interactions. Empirical findings from real-world datasets and a user study reveal that ToM enhancements on static benchmarks do not always translate to benefits in dynamic human-AI interactions.

27
RESEARCHarXiv CS.CL·26d ago

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

This research evaluates large language models (LLMs) in biomedical question answering, specifically addressing their reliability when faced with conflicting or incomplete evidence. It reveals that LLM accuracy significantly drops, and predictions flip, when the order of correct and contradictory documents is reversed, highlighting issues with order effects and the need for conflict-aware abstention.

27
RESEARCHarXiv CS.AI·23d ago

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

This paper introduces CAX-Agent, a lightweight agent harness designed to enhance the reliability of large language models (LLMs) in MAPDL finite-element simulations. It addresses issues like inconsistent outputs and task failures through structured execution control, tool encapsulation, and robust fault recovery mechanisms, evaluating various recovery strategies.

27
RESEARCHarXiv CS.CL·21d ago

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

This paper argues that current Uncertainty Quantification (UQ) methods for LLMs are essentially unsupervised clustering algorithms, measuring internal consistency rather than external correctness. Consequently, these methods fail to detect "confident hallucinations" and may create a deceptive sense of safety when deploying LLMs in high-stakes domains.

27
RESEARCHarXiv CS.CL·27d ago

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Multilingual large language models (MLLMs) often exhibit inconsistent behavior regarding cultural identity when the prompt's language changes. Researchers introduce a new metric, Singleton Fleiss's "k_S", and a consensus-driven alignment framework, C-3PO, to mitigate these cross-lingual cultural inconsistencies, achieving significant improvements.

27