LLMs

722 items

RESEARCHarXiv CS.CL·28d ago

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision introduces a method to scale computer-use agents by reducing temporal visual redundancy in interaction trajectories. It employs a learned patch selector to remove redundant visual tokens, cutting token usage by approximately 46% and improving efficiency for multimodal language models across benchmarks.

multimodal AI LLMs efficiency computer vision

RESEARCHarXiv CS.CL·27d ago

Domain Adaptation of Large Language Models for Polymer-Composite Additive Manufacturing Using Retrieval-Augmented Generation and Fine-Tuning

This study explores strategies for adapting general-purpose large language models (LLMs) to specialized engineering domains, specifically additive manufacturing, to enhance answer accuracy and relevance. It investigates the use of domain-specific fine-tuning and retrieval-augmented generation (RAG) by constructing a curated corpus for evaluation.

LLMs RAG Additive Manufacturing Domain Adaptation

RESEARCHarXiv CS.LG·23d ago

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

This study investigates the impact of post-training quantization on Large Language Models (LLMs) quality, revealing that compression can lead to bias emergence. 3-bit quantization caused 6-21% of previously unbiased items to develop new stereotypical behaviors in models like Qwen2.5-7B, Mistral-7B, and Phi-3.5-mini. This follows a clear dose-response pattern across various precision levels.

Model Compression LLMs quantization model quality

RESEARCHarXiv CS.AI·28d ago

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

On-policy distillation (OPD) and self-distillation (OPSD) are promising post-training methods for large language models, but their effectiveness is inconsistent. This research empirically investigates their successes and failures, identifying sensitivities to teacher choice and issues with privileged information.

LLMs distillation learning machine learning

RESEARCHarXiv CS.CL·28d ago

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

This research addresses the lack of diversity in LLM outputs, attributing it to how models allocate probability mass across valid and invalid continuations during decoding. It introduces a validity-diversity framework that decomposes the problem into two complementary forms of miscalibration: order calibration and shape calibration.

Calibration diversity LLMs decoding

RESEARCHarXiv CS.CL·21d ago

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

This paper introduces Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that diagnoses multi-step reasoning failures by assigning step-level confidence. SCA applies the Information Bottleneck principle, flagging deviations from consensus structures as potential errors, and proposes two complementary methods: NIBS and GIBS.

LLMs information bottleneck Reasoning confidence estimation

ARTICLEDEV.to AI·4/21/2026

Hermes Agent v0.10: Local AGI Stack & Browser Guide

Hermes Agent v0.10 has been released, emphasizing local AI deployment with Ollama integration and enhanced browser automation capabilities. This update significantly benefits developers seeking to run AI agents without API costs and needing multi-profile browser control.

LLMs Local AI browser automation developer tools

RESEARCHarXiv CS.AI·12d ago

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

This research analyzes conversational trajectories of approximately 12,000 Microsoft Bing Copilot users, comparing them with WildChat-4.8M data. It finds that individual user habits are overwhelmingly sticky despite population-level trends, with more active users engaging in more successful and complex conversations.

LLMs Longitudinal Study user behavior Conversational AI

RESEARCHarXiv CS.CL·19d ago

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

This study investigates how large language models (LLMs), specifically Claude Haiku, interpret vague intensity words when producing numeric actions. The research reveals that the model compresses 10 intensity words into 5 distinct median outputs and is influenced by the current system state.

LLMs language interpretation numeric actions NLP

RESEARCHarXiv CS.LG·12d ago

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

This paper investigates the mechanistic origins of catastrophic forgetting in Large Language Models (LLMs), comparing Reinforcement Learning (RL) with Supervised Fine-Tuning (SFT). It reveals that RL preserves internal computational circuits more effectively, mitigating the forgetting of prior capabilities, unlike SFT which causes greater circuit disruption.

LLMs deep learning machine learning Catastrophic Forgetting

RESEARCHarXiv CS.AI·12d ago

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

VFEAgent is an end-to-end multi-agent system designed to automate Finite Element Analysis (FEA) modeling and simulation from images and problem descriptions. It integrates a multimodal vision-language pipeline for structured FEA specifications and a verification-first code synthesis framework for reliability.

Engineering Automation multimodal AI LLMs Finite Element Analysis

RESEARCHarXiv CS.CL·7d ago

On the Persistent Effects of Lexicality in Large Language Mod

This work investigates the persistent effect of lexical overlap, rather than semantic content, on representations extracted from large language models (LLMs) and its implications. The authors find that lexical influence extends across model depths, architectures, and training regimes, even in models trained for semantic similarity.

LLMs lexicality NLP semantic analysis

RESEARCHarXiv CS.CL·7d ago

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

Researchers found that language model performance can significantly improve when deeper layers learn context-free value vectors, preserving original token information. This eliminates the need to recompute or persistently cache these values, as the context-dependent component provides little additional benefit.

neural networks LLMs deep learning Attention Mechanism

ARTICLEDEV.to AI·4/17/2026

The Layers Beneath A2A: Notes From Running a Live Multi-Agent Society

This content explores the challenges of running live multi-agent systems beyond message routing (A2A) and tool access (MCP) protocols. The author identifies failures in the "gaps between messages" and context continuity, highlighting semantic drift as a critical unsolved challenge in multi-turn LLM dialogues.

LLMs AI protocols AI challenges multi-agent systems

RESEARCHarXiv CS.CL·15d ago

Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges

This paper introduces a causal framework to study rationalization bias in LLMs used as automatic judges for summarization and dialogue evaluation. It investigates whether LLM judges' rankings and explanations remain stable when non-evidential cues are perturbed, proposing cue interventions and anchoring metrics.

LLMs evaluation AI rationalization

RESEARCHarXiv CS.CL·9d ago

Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

This protocol evaluates ChatGPT's ability to generate and verify disease-centric biomedical associations, using biomedical ontologies and literature. It employs a self-consistency strategy and a RAG-enabled workflow with open-source LLMs to address exact-match limitations and detect hallucination.

LLMs evaluation ChatGPT RAG

RESEARCHarXiv CS.LG·9d ago

LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study

This article presents a novel architecture for LLMs that eliminates the need for deep neural networks. The proposed model, building on RBF networks, finds the global optimum of the loss function in a single iteration, thereby removing the tedious training step.

neural networks AI architecture LLMs machine learning

RESEARCHarXiv CS.AI·15d ago

BODHI: Precise OS Kernel Specification Inference

This paper proposes BODHI, a domain knowledge prompting method for OS kernel specification inference, aiming to overcome current LLM limitations. It augments the standard few-shot prompt with a structured C-to-Python translation guide, improving automation and specification precision.

AI models LLMs operating systems Formal verification

RESEARCHarXiv CS.AI·9d ago

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN (Modular Agentic Verification and Execution Network) is a lightweight symbolic reasoning scaffold designed to improve generalization in agentic tool-calling environments. It has been evaluated across established benchmarks, and introduces MAVEN-Bench, a new stress-test benchmark for multi-step mathematical and physical reasoning.

LLMs Generalization tool-calling benchmarking

RESEARCHarXiv CS.CL·9d ago

Can LLM Teams Play What? Where? When?

This research explores how team-based interactions improve Large Language Model (LLM) performance on complex reasoning tasks, specifically in the quiz game What? Where? When?. It demonstrates that team strategies yield significant accuracy gains, with the best teams approaching human performance.

LLMs team strategies benchmarking Reasoning