LLMs

720 items

ARTICLEDEV.to AI·4/8/2026

I Built a Tool to Test Whether Multiple LLMs Working Together Can Beat a Single Model

O Occursus Benchmark é uma plataforma de benchmarking de código aberto que testa se múltiplas LLMs colaborando podem superar um único modelo. A ferramenta avalia 22 estratégias de orquestração em quatro provedores de LLMs, usando julgamento cego duplo para pontuar a qualidade das saídas.

multi-model AI avaliação de desempenho Orquestração LLMs

RESEARCHarXiv CS.AI·6d ago

StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

StepPRM-RTL is a novel framework that enhances LLM-based RTL code generation by combining stepwise trajectory modeling, process-reward modeling (PRM), and retrieval-augmented fine-tuning (RAFT). It uses dense feedback from a PRM to guide reinforcement-style updates and Monte Carlo Tree Search (MCTS) to enrich the training dataset.

LLMs reinforcement learning code generation RTL Synthesis

ARTICLEDEV.to AI·4/11/2026

Why Chunking Is the Biggest Mistake in RAG Systems

Este artigo critica a técnica de 'chunking' em sistemas RAG, destacando seus problemas de perda de contexto e erros em documentos estruturados, como registros clínicos. Propõe a indexação ciente da estrutura e a sumarização como métodos mais eficazes para lidar com dados complexos.

chunking LLMs RAG Document Intelligence

ARTICLEDEV.to AI·4d ago

<think>

This article delves into cost-effective alternatives to GPT-4o, revealing how other AI models can offer significant savings for developers. It provides direct cost comparisons, highlighting options like DeepSeek V4 Flash and Qwen3-32B.

LLMs API Management development Cost Optimization

DOCML Mastery·5d ago

Using Scikit-LLM with Open-Source LLMs

This article provides a tutorial on integrating locally hosted open-source large language models such as Mistral, Gemma, and Llama 3 for language tasks like text classification. It demonstrates how to achieve this for free using Ollama and the Scikit-LLM Python library.

open-source LLMs learning Python

RESEARCHarXiv CS.CL·5/8/2026

ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis

ReaComp compiles LLM reasoning into symbolic program synthesizers to overcome the inefficiency and unreliability of LLMs on hard program synthesis tasks. These standalone solvers achieve higher accuracy and efficiency, outperforming LLMs and significantly reducing token usage in neuro-symbolic hybrid settings.

program synthesis LLMs Symbolic AI AI Efficiency

RESEARCHarXiv CS.LG·5/7/2026

Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

Research reveals that single-position intervention fails to transfer tasks in Llama-3.2-3B despite high probing accuracy, suggesting distributed task encoding. However, multi-position intervention achieves up to 96% transfer, pinpointing for the first time the causal locus of in-context learning task identity.

LLMs Mechanistic Interpretability in-context learning causal importance

RESEARCHarXiv CS.AI·27d ago

CHAL: Council of Hierarchical Agentic Language

CHAL (Council of Hierarchical Agentic Language) is a new multi-agent dialectic framework proposed to optimize beliefs in defeasible domains. It addresses current limitations of multi-agent debate for LLM reasoning, where defeasible argumentation is treated as an engine for belief optimization.

dialectic frameworks LLMs belief optimization AI Reasoning

ARTICLEDEV.to AI·4d ago

The Limits of AI Models: What LLMs Still Can't Do (And Why)

This article explores the inherent limitations of AI models, particularly Large Language Models (LLMs), stressing the importance of understanding these boundaries for robust product development. It details hallucination as a key limitation, explaining that LLMs generate plausible, not necessarily true, text without an internal fact-checker.

AI models LLMs hallucination AI limitations

RESEARCHarXiv CS.CL·4/22/2026

Two-dimensional early exit optimisation of LLM inference

This paper introduces a two-dimensional early exit strategy for LLM classification tasks, coordinating layer-wise and sentence-wise exiting. The method achieves multiplicative computational savings and speed-ups of 1.4-2.3x over optimal layer-wise early exit for simpler tasks, applicable across various state-of-the-art LLMs.

LLMs Computational Efficiency inference optimization

RESEARCHarXiv CS.LG·4/22/2026

Towards Understanding the Robustness of Sparse Autoencoders

This research explores the robustness implications of Sparse Autoencoders (SAEs) against jailbreak attacks on Large Language Models (LLMs). Integrating pretrained SAEs at inference time significantly reduces jailbreak success rates by up to 5x and decreases cross-model attack transferability across various LLM families.

LLMs security machine learning

DOCDEV.to AI·5/2/2026

🤖 The AI SaaS Playbook (Practical Edition)

This practical playbook guides developers in building AI-core SaaS products, detailing essential changes and new considerations. It covers architectural patterns, LLM integration, agent development, cost control, testing, safety, and multi-tenancy, offering actionable advice for rapid deployment.

AI architecture SaaS LLMs best practices

NEWSDEV.to AI·4/19/2026

llama.cpp Speculative Checkpointing, Ollama Multimodal Tool, MLX vs GGUF for Gemma 4

Today's top stories feature the merger of speculative checkpointing in llama.cpp to accelerate local LLM inference and a new Ollama multimodal tool for local audio/video analysis. Additionally, a detailed comparison between MLX and GGUF is provided for optimizing Gemma 4 deployment on consumer hardware.

LLMs Ollama llama.cpp model inference

ARTICLEDEV.to AI·8d ago

AI Governance and Security: Why Enterprise LLMs Need a Defense-in-Depth Approach

As enterprises adopt large language models, robust AI governance and security are essential to prevent data leaks, regulatory penalties, and reputational damage. A defense-in-depth approach is crucial to mitigate threats like prompt injection and data contamination, ensuring compliance with regulations such as GDPR and the EU AI Act.

LLMs data privacy security compliance

RESEARCHarXiv CS.CL·4/27/2026

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

This research investigates LLM prompt sensitivity by comparing instruction-based and example-based prompting styles. It finds that despite performance variation, LLMs share common underlying mechanisms, specifically "lexical task heads" which are attention heads that literally describe the task and trigger answer production.

model interpretability LLMs prompt-engineering Attention Mechanisms

RESEARCHarXiv CS.CL·4/9/2026

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Este conteúdo apresenta CGD-PD, uma camada leve para modelos de linguagem grandes (LLMs) que melhora a resposta a perguntas lógicas de três vias (Verdadeiro/Falso/Desconhecido). Ele aborda falhas recorrentes como inconsistência de negação e previsões 'Desconhecido' epistêmicas, utilizando decisões consistentes e desambiguação baseada em prova para maior precisão.

LLMs Question Answering consistency NLP

RESEARCHarXiv CS.CL·5/7/2026

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

Nsanku is a systematic benchmark evaluating the zero-shot machine translation performance of 19 LLMs across 43 Ghanaian languages. It utilizes Bible sentences and metrics like BLEU and chrF, with gemini-2.5-flash achieving the highest overall average score.

LLMs benchmarking machine translation Low-resource languages

RESEARCHarXiv CS.LG·18d ago

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

The paper introduces HealthCraft, a public reinforcement-learning environment designed to evaluate the safety of frontier language models in emergency medicine. It focuses on trajectory-level safety, tool misuse, and clinical pressure, built on a FHIR R4 world state and offering 195 tasks for comprehensive assessment.

LLMs evaluation reinforcement learning medical AI

RESEARCHarXiv CS.CL·8d ago

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

This paper proposes SENSE (Semantic Embedding Navigation with Soft-gated Evaluation) to enhance Retrieval-based Speculative Decoding (RSD) for LLMs. SENSE addresses RSD's rigid lexical dependencies by using robust semantic alignment and a soft-gated evaluation module to validate semantic equivalence.

LLMs NLP inference optimization Speculative Decoding

RESEARCHarXiv CS.CL·9d ago

Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study

This paper presents a zero-shot multi-label topic classification framework, systematically investigating how per-article knowledge graph augmentation affects its performance. The authors test eight methods across fifteen LLMs and eight multi-label datasets, finding that keyword-enhanced classification is the best performing method in the base framework.

Multi-label Classification LLMs Knowledge Graph Zero-Shot Topic Classification