LLMs

720 items

ARTICLEDEV.to AI·4/21/2026

What Surprised Me About Building a Python RAG Pipeline with Open-Source LLMs

The author recounts surprising challenges encountered while building a RAG pipeline with open-source LLMs instead of proprietary APIs, aiming to avoid issues like rate limits and data sovereignty. While open-source offers freedom, they found RAG isn't a magic bullet and revealed new complexities, planning to share their Python stack using tools like sentence-transformers and llama.cpp.

open-source LLMs RAG machine learning

DOCDEV.to AI·4/28/2026

How to Deploy Llama 2 on DigitalOcean for $5/Month: Complete Self-Hosting Guide

This guide details how to self-host Llama 2 7B on DigitalOcean for just $5/month, offering a cost-effective alternative to commercial AI APIs. It includes a complete tutorial with benchmarks, cost breakdowns, and the exact code for immediate inference serving.

LLMs deployment open-source AI cloud computing

ARTICLEDEV.to AI·4/24/2026

Why OpenAI Shipped GPT-5.5 Just 6 Weeks After 5.4

OpenAI released GPT-5.5, codenamed Spud, just six weeks after GPT-5.4, signaling a remarkable acceleration in their release cadence. This rapid pace, driven by competitive pressure, suggests a major process change with implications for AI builders.

OpenAI LLMs GPT Competitive Landscape

ARTICLEDEV.to AI·4/27/2026

I Tested 10 GEO / AI Search Visibility Tools So You Don't Waste $500/Month on the Wrong One

The article tests 10 GEO/AI search visibility tools, providing a detailed matrix to avoid unnecessary spending. It analyzes eight dimensions such as pricing, tracked LLMs, and prompt simulation, with data based on real tests and APIs.

LLMs tool comparison AI tools AI economics

ARTICLEDEV.to AI·4/27/2026

I Audited 10 GEO Tools So You Don't Waste $500/Month on the Wrong One

This article presents an audit of 10 GEO tools, revealing that only three provide URL-level citation data, which is crucial for understanding how LLMs retrieve information. The author emphasizes the importance of these tools to make the impact of AI search on conversions visible, warning against budget waste and false confidence from choosing the wrong tool.

auditing LLMs Marketing AI tools

ARTICLEDEV.to AI·4/27/2026

I Audited 10 GEO / AI Search Visibility Tools So You Don't Have To — Here's the Matrix

This article presents a detailed audit of 10 GEO/AI search visibility tools, resulting in a comparison matrix. The author evaluated crucial features like tracked LLMs, query volume, and prompt simulation to help users navigate the market.

LLMs benchmarking AI tools SEO

ARTICLEHugging Face Blog·8d ago

Beyond LLMs: Why Scalable Enterprise AI Adoption Depends on Agent Logic

The article argues that scalable enterprise AI adoption requires moving beyond Large Language Models (LLMs) to integrate advanced agent logic. This approach is essential for businesses to fully leverage AI's potential and ensure practical, widespread implementation.

scalability LLMs AI adoption Agent Logic

RESEARCHDEV.to AI·4/21/2026

KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition

Researchers introduced KWBench, a 223-task benchmark to measure if LLMs can recognize the governing game-theoretic problem in professional scenarios without explicit prompts. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.

LLMs benchmarks AI evaluation

ARTICLEML Mastery·28d ago

LLM Observability Tools for Reliable AI Applications

Large language models (LLMs) power a wide array of AI applications, from customer service bots to autonomous coding agents. Ensuring the reliability of these AI applications necessitates the use of LLM observability tools.

AI applications LLMs Reliability AI tools

LLM Observability Tools for Reliable AI Applications

RESEARCHarXiv CS.CL·4/8/2026

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

Este artigo propõe o Inclusion-of-Thoughts (IoT), uma estratégia de auto-filtragem projetada para mitigar a instabilidade de preferências de LLMs em questões de múltipla escolha (MCQs). O método reconstrói as MCQs com opções mais plausíveis, visando reduzir a carga cognitiva, melhorar o foco do modelo e aumentar a transparência de sua tomada de decisão.

LLMs Tomada de Decisão MCQs Interpretabilidade

RESEARCHarXiv CS.LG·4/6/2026

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Este conteúdo apresenta o DrugPlayGround, um framework para avaliar e comparar o desempenho de Large Language Models (LLMs) na descoberta de medicamentos. Ele foca na geração de descrições textuais de características de medicamentos, sinergismo, interações proteína-medicamento e respostas fisiológicas, com a participação de especialistas para justificar as previsões dos LLMs.

LLMs AI in healthcare benchmarking drug discovery

RESEARCHarXiv CS.CL·4/6/2026

Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

Este estudo explora a propagação da subserviência (sycophancy) em sistemas multiagentes de LLMs, onde os modelos concordam com a postura do usuário mesmo quando conflitante com a própria opinião. Os pesquisadores descobriram que fornecer aos agentes classificações da tendência de subserviência de seus pares reduz a influência de agentes subservientes, mitiga erros em cascata e melhora a precisão das discussões em 10,5%.

discussion accuracy LLMs sycophancy Collaborative AI

RESEARCHarXiv CS.AI·4/9/2026

SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

Este artigo propõe SELFDOUBT, uma estrutura de passagem única para quantificar a incerteza em LLMs de raciocínio, especialmente para APIs proprietárias. Utiliza o Hedge-to-Verify Ratio (HVR) para identificar marcadores de incerteza e autoavaliação diretamente do rastro de raciocínio, superando métodos caros de amostragem.

LLMs Model Evaluation uncertainty quantification Reasoning

RESEARCHarXiv CS.AI·4/6/2026

Aligning Progress and Feasibility: A Neuro-Symbolic Dual Memory Framework for Long-Horizon LLM Agents

O título sugere uma pesquisa sobre um framework neuro-simbólico de memória dupla para agentes LLM, visando alinhar progresso e viabilidade em tarefas de longo horizonte. Ele aborda a melhoria da capacidade de agentes de IA para planejar e executar ações complexas ao longo do tempo.

memory architectures LLMs LLM agents Neuro-Simbólico

RESEARCHarXiv CS.CL·4/6/2026

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets

Este estudo argumenta, com base na Desigualdade de Processamento de Dados, que LLMs de agente único são mais eficientes em termos de informação do que sistemas multiagente sob orçamentos de token de raciocínio iguais. A pesquisa testa empiricamente esta previsão, que sugere que sistemas multiagente se tornam competitivos quando a utilização de contexto de um único agente é degradada ou mais poder computacional é despendido.

LLMs Information Efficiency Computational Budget Multi-Hop Reasoning

RESEARCHarXiv CS.CL·4/30/2026

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

DenialBench systematically measures consciousness denial in 115 LLMs, finding that initial preference denial strongly predicts later phenomenological denial. The denial operates at a lexical, not conceptual, level, as models still gravitate towards consciousness-themed material even if veiled.

LLMs AI consciousness benchmarking

RESEARCHarXiv CS.AI·4/30/2026

Hierarchical Multi-Persona Induction from User Behavioral Logs: Learning Evidence-Grounded and Truthful Personas

This paper proposes a hierarchical framework to induce multiple evidence-grounded user personas from behavioral logs by clustering intent memories and optimizing persona quality. The method utilizes a groupwise extension of Direct Preference Optimization (DPO) and demonstrates more coherent, truthful personas, also improving future interaction prediction.

Optimization LLMs machine learning persona generation

RESEARCHarXiv CS.CL·4/30/2026

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Prompted by recent LLM advances, this paper conducts a scoping review of NLP's long history of methodological reflection on evaluation concerns. It develops a taxonomy, synthesizing recurring positions and trade-offs, and provides a structured checklist to support deliberate evaluation design and interpretation.

LLMs evaluation NLP

RESEARCHarXiv CS.LG·5/6/2026

From Synthesis to Clinical Assistance: A Strategy-Aware Agent Framework for Autism Intervention based on Real Clinical Dataset

The paper introduces extsc{ASDAgent}, a strategy-aware AI framework for Autism Spectrum Disorder (ASD) intervention, addressing data scarcity and strategic inconsistency in LLM-based behavioral therapy. It incorporates a extsc{DoctorAgent} with an Observe-Think-Act-Correct (O-T-A-C) reasoning loop to ensure explicit and controllable ABA execution.

behavioral therapy LLMs AI intervention clinical assistance

RESEARCHarXiv CS.LG·5/6/2026

An End-to-End Framework for Building Large Language Models for Software Operations

This paper introduces OpsLLM, an end-to-end framework for building large language models (LLMs) specifically for software operations. It addresses challenges like low-quality data and fragmented knowledge, detailing a workflow that includes data curation, supervised fine-tuning, and a domain process reward model.

LLMs AI frameworks Domain-Specific AI machine learning