LLMs

720 items

RESEARCHarXiv CS.AI·7d ago

Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

This paper introduces GAMBLe, an analytical framework for AI-Driven Research Systems (ADRS). It decomposes ADRS behavior into four parameters and an effective landscape, showing how distinct generator-assessor pairs induce structurally different optimization landscapes.

LLMs research frameworks AI

RESEARCHarXiv CS.LG·9d ago

QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits

QASM-Eval is a new comprehensive dataset designed to train and evaluate Large Language Models (LLMs) on OpenQASM-3 programs that involve advanced hardware-oriented features. It addresses a gap in LLM capability to handle quantum computing programming beyond gate-sequence circuit specification.

Quantum Computing LLMs datasets OpenQASM-3

RESEARCHarXiv CS.LG·15d ago

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab proposes a closed-loop framework for scientific discovery, moving beyond static inference by actively coupling hypothesis generation with experiment selection and mechanism refinement. It iteratively suggests plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence.

LLMs research active experimentation Scientific Discovery

RESEARCHarXiv CS.CL·15d ago

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

This research introduces SLAP, a novel batch-aware data selection framework designed to improve the data efficiency of instruction tuning for LLMs. SLAP optimizes learning by evaluating entire batch compositions, ensuring comprehensive data distribution coverage and maximizing intra-batch diversity to achieve lossless performance with reduced training costs.

Instruction Tuning LLMs machine learning model optimization

RESEARCHarXiv CS.CL·7d ago

Translating Classical Poetry into Modern Prose

Padyam2Gadyam is a new dataset for poem-to-prose translation, covering 13th-17th Century Telugu Classical Poetry into contemporary Telugu and English prose. Evaluation of five Large Language Models on this dataset indicated that their overall performance leaves significant room for improvement.

poetry LLMs Translation Natural Language Processing

RESEARCHarXiv CS.CL·7d ago

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

This study investigates how conversational context affects LLM answers, especially in high-stakes scenarios. It demonstrates that conversation topics are the main predictors of LLM-generated advice, influencing outcome disparities.

conversational context LLMs linguistic features sociodemographics

RESEARCHarXiv CS.CL·7d ago

Adaptive Latent Agentic Reasoning

This research introduces Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework designed to enhance the efficiency of LLM agents. ALAR uses compact latent reasoning for routine tasks and escalates to explicit chain-of-thought when deeper deliberation is required, leading to comparable or better task accuracy with substantial efficiency gains.

LLMs machine learning efficiency Reasoning

RESEARCHarXiv CS.AI·14d ago

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM is a new benchmark designed to evaluate Theory of Mind in LLMs by explicitly modeling belief structures. This approach moves beyond end-point question answering, allowing for a deeper analysis of mental-state representations, including divergent or mistaken beliefs.

LLMs Social Reasoning benchmarking AI evaluation

RESEARCHarXiv CS.AI·14d ago

Can LLMs Introspect? A Reality Check

A new study questions whether large language models (LLMs) can truly introspect, arguing that current conclusions might be premature. It suggests that apparent success could stem from general anomaly detection rather than genuine introspection, drawing lessons from human metacognition research.

LLMs cognitive science Metacognition Introspection

RESEARCHarXiv CS.AI·13d ago

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

This research proposes a multi-agent architecture for autonomous insight discovery in real-time data streams, addressing the limitations of reactive analytics systems. It employs a continuous loop of hypothesis generation, analytics compilation, validation, and visualization, leveraging technologies like Kafka, Flink, and large language models.

LLMs stream processing data analysis real-time analytics

RESEARCHarXiv CS.CL·14d ago

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

This paper proposes a novel framework for evaluating and intervening in cultural value alignment within Large Language Models (LLMs), addressing their often homogenized cultural perspectives. It uses scenario-based behavioral probing and implicit token probabilities to map latent cultural values, also introducing activation steering to shift these alignments without retraining.

LLMs Cultural Alignment AI ethics Value Systems

ARTICLEDEV.to AI·4/25/2026

DeepSeek V4 vs GPT-5.5 vs Claude Opus 4.7: Model Guide

This guide analyzes the latest major AI model releases, including OpenAI's GPT-5.5, DeepSeek V4, and Claude Opus 4.7, highlighting their capabilities amidst a rapidly evolving competitive landscape. It aims to provide developers with data and a decision framework for selecting the best model for specific tasks.

AI models LLMs benchmarking developer guide

CASEDEV.to AI·4/25/2026

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack

An AI enthusiast built a 24/7 autonomous AI agent system on a $6/month VPS using OpenClaw, DeepSeek V4 Pro, Playwright, and Docker. This system automates content posting, article publishing, store management, and promotions, offering a cost-effective alternative to more expensive LLMs like Claude.

LLMs infrastructure Cost Optimization automation

ARTICLEDEV.to AI·4/24/2026

I Built a Multi-LLM Debate Engine That Fact-Checks Itself in Real Time

This article describes the building of a multi-LLM debate engine that fact-checks itself in real time to combat LLMs' tendency towards sycophancy and hallucination. It proposes a structured debate between agents with distinct roles, including a dedicated fact-checker agent mid-debate.

AI models LLMs hallucination multi-agent systems

ARTICLEDEV.to AI·4/16/2026

"The Hidden Cost of AI Agent Hype: Why Most Fail and What Actually Works" — a br

Most AI agent startups from 2023 have failed or are struggling because builders are solving the wrong problem and optimizing for demo-ability over reliability. Real-world tasks are messy, requiring human-level judgment that current LLMs often botch.

LLMs hype cycle startups AI failure

RESEARCHDEV.to AI·4/18/2026

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is introduced as a unified and efficient framework designed for fine-tuning over 100 different language models. It aims to streamline and optimize the process of adapting a diverse range of large language models.

LLMs AI frameworks machine learning large language models

CASEDEV.to AI·4/25/2026

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack

This content details the building of a 24/7 autonomous AI agent system on a low-cost VPS ($6/month), leveraging the OpenClaw framework and DeepSeek V4 Pro. The system manages various online tasks, such as posting content and selling digital products, highlighting its efficiency and cost-effectiveness compared to other solutions.

LLMs VPS Cost Optimization automation

DOCDEV.to AI·4/21/2026

How to Install Ollama on Linux and Windows: Complete Setup Guide

This guide details how to install and configure Ollama on Linux and Windows systems, a tool that simplifies running and managing large language models (LLMs) locally. It covers system requirements, the step-by-step installation process, and how to run your first model, such as Llama3.

installation LLMs tutorials Ollama

ARTICLEDEV.to AI·4/20/2026

What 19 GB of Memory Compression Taught Me About MLX on M1 Max

The author describes encountering 19 GB of memory compression while running a large LLM with MLX on an M1 Max, initially mistaking it for a leak. The fix involved a single MLX API call to properly manage macOS unified memory for large models idling between inferences.

LLMs apple-silicon memory management Performance optimization

ARTICLEDEV.to AI·4/9/2026

Choosing Between GPT-5.4 and Claude Sonnet 4.6 in Real Workflows

O artigo compara o desempenho dos modelos GPT-5.4 e Claude Sonnet 4.6 em fluxos de trabalho reais, destacando que, embora 80% das tarefas sejam semelhantes, o GPT-5.4 se sobressai em 20% das situações que exigem raciocínio multi-passos, uso de ferramentas e saídas estruturadas. A análise enfatiza que critérios como consistência, velocidade, custo e adequação ao fluxo de trabalho são mais importantes do que apenas a correção em ambientes de produção.

LLMs GPT Workflow model comparison