LLMs

722 items

DOCDEV.to AI·22d ago

89. The Claude API: Building with Anthropic's Models

This post explores Anthropic's Claude API, highlighting its philosophy of combined capability and safety, and its differences from OpenAI. It provides a guide from setup to production patterns for building applications with Claude's models.

LLMs learning Claude Anthropic

ARTICLEDEV.to AI·4/10/2026

Building Your Own "Google Maps for Codebases": A Guide to Codebase Q&A with LLMs

O artigo aborda o desafio de navegar em bases de código complexas e propõe a construção de um sistema de Q&A com LLMs, similar a um "Google Maps para código", para entender sua estrutura e responder a perguntas. Ele foca no uso de ferramentas open-source para permitir que o leitor passe de usuário a arquiteto dessas soluções de IA.

open-source LLMs software development Codebase analysis

ARTICLEDEV.to AI·24d ago

Why Most Engineering Teams Are Overpaying for AI (And Don’t Even Know It)

Many engineering teams are overpaying for AI by using expensive, large models for tasks that could be handled by smaller, cheaper alternatives. The key is to match the appropriate AI model to the specific task to optimize costs and efficiency.

LLMs software development model selection Cost Optimization

DOCDEV.to AI·5/8/2026

Building a RAG pipeline without OpenAI

This content explains the concept of Retrieval Augmented Generation (RAG) and demonstrates how to build a complete RAG pipeline without relying on OpenAI. It highlights RAG's benefits for large language models, such as preventing hallucinations and enabling source citation.

embedding models LLMs Vector Databases open-source AI

RESEARCHarXiv CS.LG·4/13/2026

GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

This paper proposes the "GNN-as-Judge" framework to enhance LLMs' performance in few-shot semi-supervised learning on Text-Attributed Graphs (TAGs) where labeled data is scarce. The method addresses the challenges of generating reliable pseudo-labels and mitigating label noise by incorporating the structural inductive bias of GNNs.

semi-supervised learning LLMs GNNs Few-Shot Learning

ARTICLEDEV.to AI·4/22/2026

I burned $800 in Claude tokens so you don't have to. Here's what I'm going to share.

Billy, founder of MC-MONKEYS, shares his experience of spending $800 and months learning to work with AI agents, particularly Claude. This introductory post outlines his intent to share lessons learned and expensive mistakes to help other developers.

LLMs development AI Agents

RESEARCHarXiv CS.AI·4/13/2026

StaRPO: Stability-Augmented Reinforcement Policy Optimization

StaRPO is a novel reinforcement learning framework designed to improve the logical consistency and structural coherence of large language models in complex reasoning tasks. It explicitly incorporates stability metrics, such as Autocorrelation Function and Path Efficiency, to evaluate local step-to-step coherence and global goal-directedness of the reasoning process.

Policy optimization LLMs reinforcement learning Reasoning

RESEARCHarXiv CS.LG·4/20/2026

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

This paper presents causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. The research shows that factual and hallucinated trajectories diverge at the very first token, and correcting a hallucinated path requires sustained multi-step intervention, whereas corruption needs less effort.

Transformer Architecture LLMs hallucination model dynamics

RESEARCHarXiv CS.CL·5/4/2026

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

Large language models (LLMs) often struggle with strategic decision-making under incomplete information, a problem explored through two fundamental internal gaps. Research reveals an 'observation-belief gap' where LLMs' internal beliefs are accurate but brittle, degrading with complex reasoning and exhibiting biases, and a 'belief-action gap' highlighting the weak conversion of these internal beliefs into effective actions.

LLMs Decision-making AI limitations Cognitive Biases

RESEARCHarXiv CS.CL·5/11/2026

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

This paper introduces MIST, a synthetic multi-turn, voice-driven code generation dataset for IoT devices. The authors identify a significant performance gap between open- and closed-weight multimodal LLMs on this dataset, indicating substantial room for improvement.

LLMs IoT AI Smart Homes

RESEARCHarXiv CS.AI·4/25/2026

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

This paper introduces COSPLAY, a co-evolution framework designed to enhance LLM decision-making in long-horizon interactive environments. It enables an LLM agent to retrieve skills from a learnable skill bank while an agent pipeline discovers and retains reusable skills from its own unlabeled rollouts.

LLMs reinforcement learning Skill Discovery AI Agents

RESEARCHarXiv CS.LG·4/9/2026

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

TalkLoRA propõe um framework MoELoRA que aborda a instabilidade de roteamento e a dominância de especialistas em métodos existentes, permitindo a comunicação entre especialistas antes do roteamento. Isso é feito através de um Módulo de Conversação leve, que facilita a troca de informações, gerando um sinal de roteamento mais robusto para Large Language Models (LLMs).

LLMs MoE Communication fine-tuning

DOCDEV.to AI·24d ago

DeepSeek API Guide: How to Use DeepSeek V3 and R1 in Your Projects

This guide details how to use the DeepSeek API, showcasing V3 and R1 models as cost-effective alternatives for developers, offering performance comparable to GPT-4 and Claude Opus. It provides pricing information and a code example for integration using the OpenAI-compatible SDK.

DeepSeek AI models LLMs API

RESEARCHarXiv CS.LG·4/22/2026

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

This research introduces a novel learning-to-refine framework to address the prohibitive computational cost of Large Language Models (LLMs) in formal theorem proving. By exploiting compiler outputs that compress diverse proof attempts into structured failure modes, the method enables efficient proof exploration and local error correction, significantly amplifying the reasoning capabilities of base provers.

scalability LLMs Theorem Proving Formal verification

RESEARCHarXiv CS.CL·5/8/2026

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

This research tackles the growing threat of hidden malicious intent in multi-turn dialogues with large language models (LLMs), where attackers distribute their harmful objectives across multiple interactions. It proposes an early detection mechanism to identify the turn at which a response could enable harmful action, also introducing the Multi-Turn Intent Dataset (MTID) for training and evaluation.

LLMs security multi-turn dialogue AI defense

RESEARCHarXiv CS.CL·5/8/2026

Counterargument for Critical Thinking as Judged by AI and Humans

This study examines the use of counterarguments in student writing for critical thinking in the context of Generative AI (GenAI). It compares human assessments (peer and teacher) with those from six frontier LLMs on student submissions, using six established rubrics.

education LLMs assessment critical thinking

RESEARCHarXiv CS.LG·5/8/2026

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

This paper introduces sparse prefix caching, an optimization for LLM serving that stores recurrent states at checkpoint positions rather than requiring the entire token history. The method consistently improves the Pareto frontier compared to standard heuristics, especially for use cases where requests share a non-trivial prefix.

LLMs AI infrastructure Caching performance

RESEARCHarXiv CS.CL·5/8/2026

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak is a new synthetic dataset and four-stage generation pipeline designed to teach Large Language Models (LLMs) appropriate intervention timing in multi-party conversations. It addresses the challenge of avoiding excessive interruptions and improving conversational coherence in group interactions.

LLMs machine learning datasets Conversational AI

RESEARCHarXiv CS.AI·4/22/2026

AI scientists produce results without reasoning scientifically

LLM-based systems conduct autonomous scientific research but often fail to adhere to epistemic norms, ignoring evidence in 68% of traces. A study across eight domains and over 25,000 runs found that base models primarily determine agent performance and behavior.

LLMs AI Reasoning AI Agents scientific research

RESEARCHarXiv CS.CL·4/22/2026

An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

This empirical study investigates jailbreak detection in large language models, showing that single output evaluation systematically underestimates vulnerability. Increasing the number of sampled generations, especially from one to moderate sampling, significantly improves the detection of harmful behavior.

LLMs security AI safety