LLMs

722 items

RESEARCHarXiv CS.LG·5/7/2026

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

This research introduces EdgeRazor, a lightweight framework designed to deploy Large Language Models on resource-constrained devices. It leverages mixed-precision quantization-aware distillation to convert full-precision models into lower-bit formats, overcoming limitations of previous quantization methods.

LLMs deep learning quantization model optimization

RESEARCHarXiv CS.AI·29d ago

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

MemQ integrates TD($\lambda$) eligibility traces with memory Q-values, propagating credit backward through a provenance DAG to account for memory dependencies. This approach significantly improves LLM agents' ability to accumulate and retrieve experience, achieving high success rates across various benchmarks.

memory systems LLMs machine learning Q-learning

RESEARCHarXiv CS.AI·18d ago

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

The paper introduces SMDD-Bench, a new challenging multi-turn benchmark with 502 guaranteed-solvable tasks to evaluate LLM agents' performance in real-world small molecule drug design. It aims to standardize evaluation across diverse chemistries and targets, requiring strong chemical, biological, and 3D intuition.

LLMs Scientific Discovery benchmarks drug design

RESEARCHarXiv CS.AI·29d ago

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

This research proposes distinguishing between capability elicitation and capability creation in large language model post-training. It argues that elicitation reweights existing behaviors within a model's accessible support, while creation changes that support itself, developing this through a free-energy view.

LLMs AI capabilities Machine Learning Theory learning

RESEARCHarXiv CS.LG·8d ago

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

This research explores whether LLMs can serve as a lens for understanding neural representations of emotional valence in the human brain, focusing on EEG. It builds a valence axis from LLMs and demonstrates its mapping onto human neural activity, suggesting a shared representation.

LLMs emotion Neuroscience Cognition

RESEARCHarXiv CS.AI·5/11/2026

CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

This paper introduces Deployment-Time Learning (DTL) as a new stage for LLMs, allowing them to continually adapt from experience post-training without modifying core parameters. It presents CASCADE, a framework that uses an explicit, evolving episodic memory for LLM agents, formalizing experience reuse as a contextual bandit problem with no-regret guarantees.

LLMs adaptation machine learning AI deployment

RESEARCHarXiv CS.AI·18d ago

Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs

This research introduces MOOD, a benchmark designed to study the detection of out-of-distribution (OOD) alignment failures in large language models (LLMs) using monitoring pipelines. It proposes combining guard models with OOD detectors to improve the generalization of safety classifiers, which often fail in OOD scenarios.

Model Monitoring OOD Detection LLMs benchmarking

RESEARCHarXiv CS.AI·5/11/2026

GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning

This paper introduces GraphDC, a Divide-and-Conquer multi-agent system designed to enhance graph algorithm reasoning in Large Language Models (LLMs). It improves performance by decomposing large graphs into smaller subgraphs for specialized agents, with a master agent integrating the results, leading to better scalability and robustness.

LLMs scalable AI AI Reasoning multi-agent systems

RESEARCHarXiv CS.LG·18d ago

Predicting Performance of Symbolic and Prompt Programs with Examples

This research paper introduces a coin-flip model to predict the performance of symbolic and prompt-based LLM programs using a few in-domain examples and a performance prior. It finds that symbolic programs exhibit an "all or nothing" performance prior, while prompt programs have a diffuse prior.

LLMs prompt-engineering Symbolic AI machine learning

RESEARCHarXiv CS.AI·29d ago

Belief or Circuitry? Causal Evidence for In-Context Graph Learning

This paper investigates how LLMs learn in-context, using a graph random-walk task to explore whether they pattern-match or infer latent structure. It reveals that neither account alone is sufficient, presenting evidence of simultaneous encoding of graph topologies and causal interventions.

LLMs learning interpretability graph learning

RESEARCHarXiv CS.AI·21d ago

AgentNLQ: A General-Purpose Agent for Natural Language to SQL

This study introduces AgentNLQ, a new multi-agent method for Natural Language to SQL (NL2SQL) conversion, achieving 78.1% semantic accuracy on the BIRD benchmark. It leverages LLMs in an optimized orchestrator for planning, reflection, and self-correction to generate accurate SQL queries from enriched schemas and business rules.

LLMs benchmarking NL2SQL database

RESEARCHarXiv CS.AI·23d ago

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

This paper introduces a new paradigm for interactively evaluating Theory of Mind (ToM) improvements in Large Language Models (LLMs) for human-AI interactions. Empirical findings from real-world datasets and a user study reveal that ToM enhancements on static benchmarks do not always translate to benefits in dynamic human-AI interactions.

LLMs evaluation human-AI interaction empirical study

RESEARCHarXiv CS.CL·26d ago

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

This research evaluates large language models (LLMs) in biomedical question answering, specifically addressing their reliability when faced with conflicting or incomplete evidence. It reveals that LLM accuracy significantly drops, and predictions flip, when the order of correct and contradictory documents is reversed, highlighting issues with order effects and the need for conflict-aware abstention.

LLMs evaluation Reliability Biomedical AI

RESEARCHarXiv CS.CL·5/11/2026

Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas

This research paper presents an atlas of domain-level metacognitive monitoring across 33 frontier LLMs, analyzing 1,500 MMLU items across six domains. It reveals significant within-model variation, with Applied/Professional knowledge being the easiest and Formal Reasoning/Natural Science the hardest domains to monitor.

LLMs Metacognition cognitive AI benchmarks

RESEARCHarXiv CS.AI·23d ago

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

This paper introduces CAX-Agent, a lightweight agent harness designed to enhance the reliability of large language models (LLMs) in MAPDL finite-element simulations. It addresses issues like inconsistent outputs and task failures through structured execution control, tool encapsulation, and robust fault recovery mechanisms, evaluating various recovery strategies.

LLMs simulation automation fault tolerance

RESEARCHarXiv CS.CL·21d ago

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

This paper argues that current Uncertainty Quantification (UQ) methods for LLMs are essentially unsupervised clustering algorithms, measuring internal consistency rather than external correctness. Consequently, these methods fail to detect "confident hallucinations" and may create a deceptive sense of safety when deploying LLMs in high-stakes domains.

LLMs uncertainty quantification hallucinations AI safety

RESEARCHarXiv CS.LG·8d ago

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

This paper studies tool-calling in large language model (LLM) agents, examining its effectiveness and efficiency. It analyzes evaluation pipelines, showing results are sensitive to implementation choices, and identifies computational waste in reinforcement learning training.

LLMs evaluation reinforcement learning tool-calling

RESEARCHarXiv CS.CL·27d ago

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Multilingual large language models (MLLMs) often exhibit inconsistent behavior regarding cultural identity when the prompt's language changes. Researchers introduce a new metric, Singleton Fleiss's "k_S", and a consensus-driven alignment framework, C-3PO, to mitigate these cross-lingual cultural inconsistencies, achieving significant improvements.

Multilingual AI LLMs AI alignment Cultural Bias

RESEARCHarXiv CS.CL·27d ago

ToolWeave: Structured Synthesis of Complex Multi-Turn Tool-Calling Dialogues

ToolWeave is a structured framework for synthesizing realistic multi-turn tool-calling dialogues, essential for LLMs to function as autonomous agents. It addresses challenges in existing synthetic data generation by supporting realistic multi-step workflows and reducing parameter hallucination.

data synthesis LLMs tool-calling dialogue systems

CASEDEV.to AI·4/28/2026

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack

This content details the creation of a 24/7 autonomous AI agent system on a $6/month VPS, leveraging OpenClaw, DeepSeek V4 Pro, and Playwright for automation. The system manages social media posts, Dev.to articles, and a Gumroad store, showcasing cost-effective and efficient AI automation.

LLMs DevOps Cost Optimization automation