LLMs

715 items

RESEARCHarXiv CS.CL·4/8/2026

The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

Este artigo aborda a 'maldição da reversão' em modelos de linguagem autorregressivos, onde falham ao recuperar fatos em ordem inversa. A pesquisa demonstra que a precisão da reversão exige um sinal de treinamento que torne a entidade de origem um alvo de previsão, indicando armazenamento separado para direções diretas e inversas, em vez de uma representação única e agnóstica à direção.

LLMs NLP bidirectional models representation learning

RESEARCHarXiv CS.CL·4/7/2026

Evolutionary Search for Automated Design of Uncertainty Quantification Methods

Este artigo explora o uso de busca evolucionária impulsionada por LLMs para desenvolver automaticamente métodos de Quantificação de Incerteza (UQ) não supervisionados. Os métodos evoluídos superam baselines manuais em verificação de alegações, demonstrando generalização robusta e estratégias distintas entre diferentes modelos de LLM.

LLMs Uncertainty Quantification Evolutionary Search AI Research

RESEARCHarXiv CS.CL·4/7/2026

CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

CresOWLve é um novo benchmark para avaliar a resolução criativa de problemas em LLMs, superando as limitações dos benchmarks existentes. Ele utiliza quebra-cabeças baseados em conhecimento do mundo real, exigindo diversas estratégias de pensamento criativo e combinação de fatos para encontrar soluções.

LLMs Creative Problem Solving Benchmarks Cognitive Abilities

RESEARCHarXiv CS.CL·4/6/2026

Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

Este estudo investiga o viés de confirmação em grandes modelos de linguagem (LLMs) usando uma tarefa de descoberta de regras, revelando que os LLMs exibem essa tendência, o que retarda a descoberta de regras ocultas. Ele demonstra que estratégias de intervenção, como o uso de prompts específicos, podem consistentemente diminuir esse viés.

LLMs prompt-engineering cognitive bias Confirmation Bias

RESEARCHarXiv CS.CL·4/6/2026

Speaking of Language: Reflections on Metalanguage Research in NLP

Este trabalho define metalinguagem e explora sua conexão com PNL e LLMs, discutindo esforços de pesquisa e dimensões de tarefas metalinguísticas. Propõe ainda uma lista de futuras direções de pesquisa pouco estudadas.

LLMs research Metalanguage NLP

RESEARCHarXiv CS.LG·4/6/2026

An Initial Exploration of Contrastive Prompt Tuning to Generate Energy-Efficient Code

Este estudo explora a otimização de LLMs para gerar código mais energeticamente eficiente, utilizando Contrastive Prompt Tuning (CPT). A CPT, que combina Contrastive Learning e Prompt Tuning, é avaliada em Python, Java e C++ para promover o desenvolvimento de software mais verde.

LLMs Energy Efficiency code generation PEFT

RESEARCHarXiv CS.LG·4/6/2026

Haiku to Opus in Just 10 bits: LLMs Unlock Massive Compression Gains

Este estudo explora a compressão de texto gerado por LLMs em regimes com e sem perdas, apresentando métodos que melhoram a eficiência em 2x, como adaptadores LoRA e reescritas concisas. Introduz também a compressão interativa por Perguntas e Respostas (QA), um protocolo que transfere um bit por resposta para recuperar uma parte significativa da capacidade de modelos maiores.

lossy compression LLMs arithmetic coding compute frontier

RESEARCHarXiv CS.CL·4/6/2026

An Empirical Study of Many-Shot In-Context Learning for Machine Translation of Low-Resource Languages

Este estudo empírico investiga o aprendizado em contexto (ICL) de muitos exemplos para tradução automática de inglês para dez idiomas de baixo recurso. Os achados mostram que o ICL se torna mais eficaz com o aumento do número de exemplos, e a recuperação baseada em BM25 melhora substancialmente a eficiência dos dados.

LLMs Many-Shot Learning NLP machine translation

RESEARCHarXiv CS.AI·4/23/2026

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

This paper proposes an explainable AML triage framework utilizing LLMs to address the challenges of unconstrained generation risks like hallucinations. It integrates retrieval-augmented evidence bundling, structured LLM outputs with explicit citations, and counterfactual checks for auditable decision-making.

LLMs Financial services Explainable AI fraud detection

RESEARCHarXiv CS.AI·4/23/2026

The Tool-Overuse Illusion: Why Does LLM Prefer External Tools over Internal Knowledge?

This paper reveals the pervasive phenomenon of "tool overuse" in LLMs, where models unnecessarily use external tools. It identifies a "knowledge epistemic illusion" and proposes a direct preference optimization-based strategy that reduces tool usage by 82.8% while improving accuracy.

LLMs Knowledge Representation Reasoning model behavior

RESEARCHarXiv CS.CL·5/6/2026

Evaluating Reasoning Models for Queries with Presuppositions

This research evaluates how large reasoning models handle user queries containing factually inaccurate presuppositions. It finds that while reasoning models show a slight improvement over non-reasoning models, they still fail to challenge a significant fraction of false assumptions.

presuppositions AI models LLMs evaluation

RESEARCHarXiv CS.AI·5/6/2026

Stop Automating Peer Review Without Rigorous Evaluation

This paper argues against using current AI systems for peer review, identifying two critical issues: a "hivemind effect" that reduces perspective diversity and the trivial gameability of AI review scores through paper rewriting. Empirical comparison of human- versus AI-generated reviews shows that AI reviewers are susceptible to stylistic changes rather than scientific merit, highlighting the need for non-gameability and review diversity for automation.

LLMs academic publishing AI ethics Peer review

RESEARCHarXiv CS.CL·5/6/2026

Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability

This research explores using geometric deviation of LLM hidden states as a pre-generation signal to determine if a query is outside the model's knowledge, requiring no labeled failure data. Across various models and prompt forms, it finds that this signal effectively predicts unanswerable math prompts but not factual ones.

LLMs research Model Evaluation Reliability

RESEARCHarXiv CS.CL·5/6/2026

How Language Models Process Negation

This study investigates how Large Language Models (LLMs) mechanistically process negation, revealing that even open-weight models possess internal components for correct negation processing despite often providing wrong answers. Their poor accuracy is attributed to late-layer attention promoting simple shortcuts, and models implement both attending to negated phrases and directly constructing negative phrase representations.

LLMs Mechanistic Interpretability attention mechanisms Natural Language Processing

RESEARCHarXiv CS.AI·5/6/2026

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

This research introduces Terminus-4B, a finetuned small language model, to explore its capability in replacing frontier LLMs for agentic terminal execution tasks. The model is post-trained using Supervised Finetuning and Reinforcement Learning with rubric-based LLM-as-judge rewards.

LLMs model training performance evaluation Small Language Models

RESEARCHarXiv CS.AI·21d ago

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

Current LLM agents accurately model counterparty preferences but do not consistently translate this into strategic bargaining. They often respond to perceived counterparty values without consistently securing gains on their own high-value attributes, leading to suboptimal outcomes for the informed side.

Strategic Bargaining LLMs negotiation AI agents

RESEARCHarXiv CS.CL·28d ago

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Magis-Bench is a new benchmark for evaluating Large Language Models (LLMs) on magistrate-level legal tasks, using 74 questions from recent Brazilian judicial competitive examinations. It evaluates 23 state-of-the-art LLMs using an LLM-as-a-judge methodology with strong inter-judge agreement.

LLMs Legal AI Judicial tasks Benchmarks

RESEARCHarXiv CS.AI·19d ago

OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

This paper introduces OSCToM, an approach for modeling nested belief conflicts in LLM-based Theory of Mind tasks. It combines reinforcement learning and compositional surrogate models to generate these conflicts, with OSCToM-8B showing the best results in experiments.

LLMs reinforcement learning AI Research Theory of Mind

NEWSML Mastery·4/30/2026

Effective KV Compression with TurboQuant

Google recently launched TurboQuant, a novel algorithmic suite and library for applying advanced quantization and compression to large language models (LLMs) and vector search engines. This tool is an indispensable element of RAG systems.

LLMs quantization vector search RAG systems

Effective KV Compression with TurboQuant

ARTICLEDEV.to AI·4/16/2026

I Tested Claude, GPT-4, and Gemini on the Same Refactoring Task

The article compares Claude, GPT-4, and Gemini's performance on a refactoring task. It evaluates their capabilities in code generation and improvement.

AI models LLMs software development comparison