LLM

611 items

ARTICLEDEV.to AI·4/16/2026

How to run Qwen3.6-35B-A3B locally — the coding MoE that beats models 10x its active size

Qwen has released Qwen3.6-35B-A3B, a new Mixture-of-Experts model that delivers big-model quality at small-model speed with vision capabilities. It outperforms models 10x its active size on coding benchmarks like SWE-bench and Terminal-Bench, and also excels in science reasoning and frontend generation.

multimodal AI AI Benchmarks coding AI MoE

DOCDEV.to AI·16d ago

터미널 AI 에이전트 구축 (v2)

This practical guide teaches developers how to build and optimize terminal-based AI agents, leveraging local LLMs for real-time code support. It details the setup of platforms like Aider and Ollama, and includes an example CLI agent with function calling capabilities.

terminal Ollama development AI agent

ARTICLEDEV.to AI·4/23/2026

Your Customer Service Bot Is Slow Because It's Single-Threaded

This article highlights that single-threaded customer service bots are slow due to sequential LLM calls, causing up to 12 seconds latency. It proposes a parallel sub-agent pattern with LangGraph and LangSmith to execute research tasks concurrently, significantly reducing response times to around 6.5 seconds.

LangGraph customer service AI Performance optimization AI Agents

ARTICLEDEV.to AI·4/18/2026

Building an MCP Server for Prop Trading: How I Gave Claude + ChatGPT Live Access to 20+ Prop Firm Deals

The author developed an MCP server that grants Claude and ChatGPT live access to proprietary trading firm deals and challenges. This addresses the problem of LLM hallucinations with stale data, providing a single, live source of truth for traders.

Model Context Protocol Prop trading API Integration AI Agents

DOCDEV.to AI·16d ago

96. LoRA: Fine-Tune a Billion-Parameter Model on a Laptop

This article explains how the LoRA (Low-Rank Adaptation) technique enables fine-tuning billion-parameter language models on consumer hardware like laptops. Instead of updating all parameters, LoRA adds tiny trainable modules, drastically reducing GPU memory requirements.

GPU memory fine-tuning LoRA HuggingFace

DOCDEV.to AI·4/23/2026

Build AI Voice Bots That Remember: Persistent Context Across Multiple Calls

This content addresses the common issue of AI voice bots lacking memory, causing each call to start from scratch and frustrating users. It promises to demonstrate how to build persistent caller context that survives across sessions, thereby improving user experience.

AI voice bots persistent context caller identity Conversational AI

ARTICLEDEV.to AI·5/6/2026

Bun Is Porting from Zig to Rust — Here's Why That Matters If You Run LLM Workloads

Bun is migrating its core components from Zig to Rust, a strategic technical move that enhances its speed and ease of contribution. The acquisition of Bun by Anthropic in 2025 raises questions about vendor dependency structures for teams running LLM workloads in production.

JavaScript runtime Bun Anthropic Rust

ARTICLEDEV.to AI·5/7/2026

Firecrawl vs Crawl4AI: Web Scraping for RAG

Building reliable Retrieval-Augmented Generation (RAG) pipelines necessitates a shift in web scraping from traditional selectors to converting DOM into semantic Markdown. Firecrawl and Crawl4AI are key tools for this translation layer, and this post evaluates them based on architectural fit, extraction quality, performance, and AI workflow integration.

RAG AI tools web-scraping LLM

RESEARCHarXiv CS.LG·23d ago

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

This work investigates the time, token, and energy overhead of locally deployed LLM-based AI agents on consumer hardware. It reveals that while local agents address privacy and cost concerns, their iterative reasoning and tool use substantially increase resource consumption, leading to higher GPU power draw and battery drain.

consumer devices Energy Efficiency local deployment AI Agents

RESEARCHarXiv CS.AI·5d ago

Synthetic Contrastive Reasoning for Multi-Table Q&A

This paper introduces a synthetic contrastive reasoning-trace dataset for multi-table question answering (MMQA), addressing the lack of reasoning supervision in existing resources. Open-weight LLMs fine-tuned with Contrastive Preference Optimization (CPO) using this dataset achieved significant performance improvements, highlighting the benefits of heterogeneous trace generators.

Question Answering machine learning NLP datasets

RESEARCHarXiv CS.CL·5d ago

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN is a lightweight memory layer for LLMs that archives conversation turns and restores relevant details after context compaction via hybrid retrieval. It recovers 78.3% of verifiable facts lost to compaction, outperforming LLM-driven approaches with significantly lower inference cost and zero LLM calls.

memory Long-Context Processing Retrieval Networks Conversational AI

DOCDEV.to AI·11d ago

How to Deploy Llama 2 on DigitalOcean App Platform for $5/Month

This guide details how to deploy a production-ready Llama 2 inference server on DigitalOcean's App Platform for just $5/month. It offers a cost-effective alternative to AI APIs, eliminating rate limits and vendor lock-in.

Llama-2 deployment Ollama DigitalOcean

DOCDEV.to AI·11d ago

How to Deploy Qwen2.5 72B with vLLM + AWQ Quantization on a $24/Month DigitalOcean GPU Droplet: Multilingual Reasoning at 1/110th Claude Opus Cost

This guide details how to deploy Qwen2.5 72B with vLLM and AWQ quantization on a DigitalOcean GPU Droplet for just $24/month. It demonstrates significant cost reduction compared to commercial AI APIs like Claude Opus, offering enterprise-grade multilingual reasoning at a fraction of the price.

deployment quantization Cost Optimization DigitalOcean

DOCDEV.to AI·16d ago

로컬 LLM 셋업 가이드 (v8)

This guide provides a practical roadmap for developers to set up and operate local LLM environments, highlighting benefits like fast inference and data privacy. It details system requirements and compares frameworks such as llama.cpp, Ollama, and vLLM for various use cases.

Machine Learning Tools local LLM Development Guide AI Setup

RESEARCHDEV.to AI·4/23/2026

LLM Leaderboard: Best AI Models Ranked (April 2026)

As of April 2026, the LLM leaderboard shows no single best model, with performance fracturing by task. Claude Opus 4.7 leads in coding and LM Arena, while tying with Gemini 3.1 Pro Preview and GPT-5.4 on the Artificial Analysis Intelligence Index, and DeepSeek V3.2 offers optimal pricing.

AI models ranking benchmarking LLM

ARTICLEDEV.to AI·25d ago

Your LLM cost estimate is fine. Your rate-limit math is what pages you at 2am.

This article argues that while LLM cost estimates are a minor concern, rate-limits are the dominant failure mode for LLM applications in production. Rate-limit saturation leads to cascading failures, unlike minor cost discrepancies, and is often overlooked in planning tools.

rate limits Production API AI engineering

NEWSOpenAI Blog·4/23/2026

Introducing GPT-5.5

Introducing GPT-5.5, our smartest model yet, designed to be faster and more capable for complex tasks such as coding, research, and data analysis across various tools.

announcement GPT new features AI Model

ARTICLEDEV.to AI·4/9/2026

Karpathy called it context engineering > prompt engineering. I built a tool that does it automatically for codebases.

O artigo discute a ênfase de Karpathy em "engenharia de contexto" em vez de "engenharia de prompt" para LLMs, destacando que a performance da IA depende crucialmente do contexto fornecido. Ele aponta o problema de LLMs consumirem muitos tokens repetidamente para entender o contexto de um código, levando o autor a desenvolver uma ferramenta para automatizar esse processo.

prompt-engineering Context Engineering AI codebase

RESEARCHarXiv CS.AI·4/7/2026

VERT: Reliable LLM Judges for Radiology Report Evaluation

O artigo propõe VERT, uma nova métrica baseada em LLM para avaliação de relatórios radiológicos. Ele compara VERT com métricas existentes em diversos modelos e datasets, analisando sua correlação com avaliações de especialistas para determinar as melhores configurações de LLM para juízes em radiologia.

Modelos de Linguagem Grandes IA Radiologia Avaliação de Modelos

RESEARCHarXiv CS.CL·4/9/2026

SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams

SensorPersona é um sistema baseado em LLM que infere continuamente personas de usuários a partir de dados multimodais coletados de forma discreta de sensores móveis. Ele aprofunda a personalização ao extrair padrões físicos, traços psicossociais e experiências de vida, superando as limitações da inferência baseada apenas em histórico de chat.

personalization multimodal AI mobile sensors persona extraction