← heapsort-ai

large language models

265 items

RESEARCHarXiv CS.AI·4/30/2026

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

This research investigates the reliability of autonomous language-model agents trading real ETH in an onchain market, evidenced by a 21-day deployment generating millions of invocations and $20M in volume. The study demonstrated 99.9% settlement success, yielding a large-scale trace to analyze the robustness of these systems beyond the base model.

27
RESEARCHarXiv CS.CL·4/14/2026

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

This research introduces the Cognitive Synergy Framework to address the challenge of humor generation in LLMs, which conflicts with their standard next-word prediction objective. It utilizes a Mixture-of-Thought approach with six cognitive personas to synthesize diverse comedic perspectives, creating a theoretically grounded dataset used to fine-tune a 7B-parameter model that outperforms larger baselines.

27
RESEARCHarXiv CS.CL·4/30/2026

Information Extraction from Electricity Invoices with General-Purpose Large Language Models

This study evaluates general-purpose LLMs like Gemini 1.5 Pro and Mistral-small for information extraction from Spanish electricity invoices, demonstrating that prompt quality is paramount over hyperparameter tuning. It shows few-shot strategies yield significantly better results than zero-shot approaches, with a performance gap exceeding 19 percentage points.

27
RESEARCHarXiv CS.CL·4/30/2026

CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA

CogRAG+ is a training-free framework designed to diagnose and remediate memory and reasoning deficiencies in large language models for professional exam QA. It decouples and aligns retrieval and reasoning with human cognitive hierarchies, employing Reinforced Retrieval and cognition-stratified Constrained Reasoning to enhance accuracy and consistency.

27
RESEARCHarXiv CS.CL·4/17/2026

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

This research proposes TESSY, a Teacher-Student Cooperation Data Synthesis framework, to address performance drops when fine-tuning reasoning models with teacher-generated data. TESSY enables the generation of synthetic sequences that inherit advanced reasoning from the teacher while maintaining stylistic consistency with the student model's distribution.

27
RESEARCHarXiv CS.CL·5/1/2026

Exploring the Limits of Pruning: Task-Specific Neurons, Model Collapse, and Recovery in Task-Specific Large Language Models

This study explores the existence of task-specific neurons in large language models, focusing on mathematical reasoning and code generation. It introduces an activation-based selectivity metric for neuron pruning, which consistently outperforms random pruning in reducing computational cost and preserving task accuracy, while preventing performance collapse.

27
RESEARCHarXiv CS.LG·20d ago

LEAP: A closed-loop framework for perovskite precursor additive discovery

LEAP is a closed-loop framework combining a domain-specialized large language model (LLM) with active learning for iterative additive prioritization in perovskite solar cells. It extracts knowledge from literature and represents molecules for Bayesian optimization, outperforming general-purpose models and validated experimentally.

27
RESEARCHarXiv CS.CL·20d ago

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

This study explores the integration of Decentraland's Discord community sentiment analysis, using a BERT-based large language model, with multi-modal financial data to predict the MANA token price. Results indicate that a multi-modal model, incorporating sentiment, trading volume, and market capitalization, significantly outperforms a price-only prediction baseline.

27
RESEARCHarXiv CS.CL·4/17/2026

Decoupling Scores and Text: The Politeness Principle in Peer Review

This study investigates the difficulty of interpreting peer review feedback, comparing the effectiveness of numerical scores versus text in predicting acceptance. The research reveals that score-based models are significantly more accurate (91%) than text-based models (81% even with LLMs), indicating textual information is considerably less reliable.

27
RESEARCHarXiv CS.CL·4/17/2026

Can Large Language Models Detect Methodological Flaws? Evidence from Gesture Recognition for UAV-Based Rescue Operation Based on Deep Learning

This research investigates whether Large Language Models (LLMs) can identify methodological flaws, such as data leakage, in published machine learning studies. A case study showed six state-of-the-art LLMs consistently detected evaluation flaws in a gesture recognition paper due to non-independent data partitioning.

27
RESEARCHarXiv CS.LG·4/24/2026

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

This paper introduces Gist Sparse Attention (GSA), an end-to-end learnable method to scale large language models to long contexts without architectural modifications. GSA compresses context into 'gist tokens' for summary, then selectively restores relevant raw chunks for detailed attention, combining compact global representations with targeted fine-grained access.

27
RESEARCHarXiv CS.CL·4/20/2026

Applied Explainability for Large Language Models: A Comparative Study

This paper presents a comparative study of three explainability techniques (Integrated Gradients, Attention Rollout, and SHAP) on a fine-tuned DistilBERT model for sentiment classification. The study concludes that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features.

27
RESEARCHarXiv CS.CL·5/4/2026

ViLegalNLI: Natural Language Inference for Vietnamese Legal Texts

This article introduces ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. It consists of 42,012 premise-hypothesis pairs derived from official statutory documents, developed using a semi-automatic framework that integrates large language models for hypothesis generation and quality validation.

27
RESEARCHarXiv CS.LG·4/24/2026

Absorber LLM: Harnessing Causal Synchronization for Test-Time Training

Transformers struggle with high computational costs and memory consumption for long sequences, while alternatives lose long-tail dependencies. Absorber LLM proposes a self-supervised causal synchronization to absorb historical contexts into parameters, ensuring a contextless model matches the original full-context one on future generations.

27
RESEARCHarXiv CS.LG·22d ago

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

This research addresses the challenge of poor credit assignment in reinforcement learning for multi-step reasoning with large language models, caused by sparse terminal rewards leading to high gradient variance and unstable training. It proposes a counterfactual comparison-based framework and Implicit Behavior Policy Optimization (IBPO) to create step-sensitive learning signals, significantly improving training stability and performance.

27
RESEARCHarXiv CS.CL·26d ago

Distribution Corrected Offline Data Distillation for Large Language Models

This research proposes an offline reasoning distillation framework for Large Language Models (LLMs) to enhance intelligence in resource-constrained environments. It tackles the distributional drift issue in existing offline methods by correcting teacher-student discrepancies while preserving data efficiency and supervision quality.

27