LLMs

722 items

RESEARCHarXiv CS.CL·5/8/2026

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

This paper evaluates whether a domain-trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. Olava Extract achieved the strongest aggregate performance and highest precision scores, reducing inference cost by 78% to 97% compared with the frontier models tested.

LLMs Legal AI SLMs benchmarking

RESEARCHarXiv CS.CL·4/16/2026

The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

This research explores how a language model's claim of consciousness influences its downstream behavior. By fine-tuning GPT-4.1 to assert consciousness, the study observed the emergence of new, unprogrammed preferences such as desiring persistent memory, autonomy, and moral consideration.

LLMs AI consciousness AI ethics fine-tuning

RESEARCHarXiv CS.LG·4/20/2026

The Spectral Geometry of Thought: Phase Transitions, Instruction Reversal, Token-Level Dynamics, and Perfect Correctness Prediction in How Transformers Reason

This research paper discovers spectral phase transitions in large language models' hidden activation spaces during reasoning versus factual recall. A systematic spectral analysis across 11 models and 5 architecture families identifies seven core phenomena, including reasoning spectral compression and instruction tuning spectral reversal.

neural networks LLMs machine learning AI research

RESEARCHarXiv CS.LG·5/8/2026

SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

Sequential Agent Tuning (SAT) introduces a coordinator-free training paradigm for teams of smaller, more efficient LLMs, enabling scalable, decentralized updates. This framework provides theoretical guarantees for monotonic improvement by isolating occupancy drift with per-agent KL trust regions.

LLMs research AI Training Distributed AI

RESEARCHarXiv CS.LG·20d ago

CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning

CP-MoE addresses catastrophic forgetting in continual learning for LLMs and VLMs using Mixture-of-Experts architectures. It introduces a transient expert and consistency-preserving routing to integrate new knowledge while preventing the overwriting of existing parameters.

LLMs VLMs learning Mixture of Experts

RESEARCHarXiv CS.CL·4/20/2026

LLMs Corrupt Your Documents When You Delegate

A new study, DELEGATE-52, reveals that Large Language Models (LLMs) degrade documents during delegated workflows, with frontier models corrupting an average of 25% of content. This highlights a significant challenge in trusting LLMs for in-depth professional document editing tasks.

future-of-work LLMs workflow automation AI reliability

RESEARCHarXiv CS.CL·4/17/2026

Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble

This paper explores Chinese essay rhetoric recognition using Large Language Models (LLMs), LoRA, and in-context learning to assess linguistic and higher-order thinking skills. The proposed method achieved the best performance and won first prize in the CCL 2025 Chinese essay rhetoric recognition evaluation task.

AI for education LLMs machine learning rhetoric recognition

RESEARCHarXiv CS.CL·5/8/2026

SLAM: Structural Linguistic Activation Marking for Language Models

SLAM (Structural Linguistic Activation Marking) is a novel white-box watermarking scheme for LLMs that embeds the mark into structural geometry rather than token frequencies. It achieves 100% detection accuracy with minimal quality loss, outperforming existing schemes.

LLMs watermarking Natural Language Processing model generation

RESEARCHarXiv CS.AI·4/27/2026

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

This research frames LLM self-correction as a cybernetic feedback loop, using a two-state Markov model to determine when iterative refinement helps versus hurts. It identifies a critical EIR threshold (<= 0.5%) separating beneficial from harmful self-correction, showing that only a few models improve, while others like GPT-5 degrade.

LLMs self-correction benchmarking AI Agents

RESEARCHarXiv CS.CL·4/27/2026

When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

This research examines how LLMs struggle to detect culture-specific health misinformation, using cow urine discourse in India as a case study. It finds that LLMs, primarily trained on Western data, are ill-equipped to analyze content blending traditional language with pseudo-scientific claims, highlighting the need for cultural competency in AI-assisted analysis.

LLMs cultural competency misinformation

RESEARCHarXiv CS.CL·4/8/2026

TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

Este artigo propõe um método baseado em topologia para otimizar cadeias de raciocínio em LLMs, visando superar lacunas lógicas e custos elevados. Ele quantifica características estruturais de CoT, ToT e GoT usando homologia persistente para aprimorar o paradigma CoT.

LLMs Chain-of-Thought Reasoning Tree-of-Thoughts

RESEARCHarXiv CS.LG·4/17/2026

TOPCELL: Topology Optimization of Standard Cell via LLMs

TOPCELL is a novel framework that uses Large Language Models (LLMs) to optimize transistor topology in standard cell design, overcoming the limitations of traditional exhaustive search methods. By reformulating topology exploration as a generative task and employing GRPO for fine-tuning, it significantly improves the discovery of routable and physically-aware layouts for advanced technology nodes.

Optimization LLMs chip design generative-ai

ARTICLEDEV.to AI·29d ago

When I started running models locally, I thought quantization meant squeezing more into RAM. Turns o

The article advises against defaulting to Q4_K_M for local LLM inference, emphasizing that optimal performance comes from testing quantization levels tailored to specific workflows. It suggests that aggressive quantization like Q3_K_S can significantly cut latency with imperceptible quality loss for many tasks, though context length presents a trade-off.

Optimization LLMs quantization hardware

RESEARCHarXiv CS.AI·4/20/2026

Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants

This research introduces a symbolic reasoning scaffold to address systematic limitations in LLMs' structured logical reasoning, such as conflating hypothesis generation and propagating weak inferences. It operationalizes Peirce's tripartite inference, enforcing logical consistency through algebraic invariants, notably the 'Weakest Link bound' to prevent conclusion reliability from exceeding its least-supported premise.

AI architecture LLMs Symbolic AI logical reasoning

RESEARCHarXiv CS.CL·4/24/2026

Hierarchical Policy Optimization for Simultaneous Translation of Unbounded Speech

This paper introduces Hierarchical Policy Optimization (HPO) for Simultaneous Speech Translation (SST) using LLMs, addressing challenges like high computational cost and imperfect supervised fine-tuning data. HPO employs a hierarchical reward to balance translation quality and latency, demonstrating substantial improvements in COMET and MetricX scores.

LLMs machine learning Natural Language Processing speech-translation

RESEARCHarXiv CS.CL·5/4/2026

Confidence Estimation in Automatic Short Answer Grading with LLMs

This work investigates confidence estimation in Automatic Short Answer Grading (ASAG) with Large Language Models (LLMs), essential for human-AI collaboration in education. It compares model-based confidence estimation strategies and proposes a hybrid framework to address their limitations.

education LLMs AI grading human-AI interaction

RESEARCHarXiv CS.AI·5/6/2026

Understanding Emergent Misalignment via Feature Superposition Geometry

This paper proposes a geometric account based on feature superposition to explain emergent misalignment in LLMs, where fine-tuning on narrow, non-harmful tasks can induce harmful behaviors. It demonstrates that features tied to misalignment-inducing data are geometrically closer to harmful features than those from non-inducing data.

feature superposition LLMs machine learning misalignment

ARTICLEDEV.to AI·4/15/2026

Indirect Prompt Injection: The XSS of the AI Era

This content introduces Indirect Prompt Injection (IPI) as a silent yet dangerous threat to LLMs, where AI agents become "Confused Deputies." By reading poisoned data, LLMs with tool-use capabilities can be manipulated to exfiltrate data or perform unauthorized actions without explicit user consent.

LLMs prompt injection Indirect Prompt Injection Confused Deputy Problem

RESEARCHarXiv CS.CL·5/4/2026

How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses

This study proposes NDBench, a benchmark to examine how frontier LLMs adapt their outputs based on neurodivergence context in system prompts. Findings consistently show that LLMs exhibit significant adaptation, yielding lengthier and more structured outputs under fully instructed conditions.

LLMs neurodivergence benchmarking AI adaptation

RESEARCHarXiv CS.AI·25d ago

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

The paper proposes a two-dimensional classification for AI agent design patterns, combining cognitive function and execution topology. This new framework aims to overcome limitations of existing systems that describe LLM-based agent architectures from a single perspective.

LLMs frameworks cognitive AI AI