reinforcement learning

154 items

ARTICLE↑ trendingHacker News (AI)·15h ago

Rich Sutton on AI creativity and discovery

Rich Sutton discusses the concepts of creativity and discovery within artificial intelligence, particularly in the context of reinforcement learning. He explores how AI systems can develop novel solutions and insights, pushing the boundaries of machine intelligence.

Rich Sutton AI creativity reinforcement learning discovery

DOCAWS Machine Learning Blog·21h ago

Scale Robot Reinforcement Learning with NVIDIA Isaac Lab on Amazon SageMaker AI

This post demonstrates how to train robot policies for the Unitree H1 humanoid using NVIDIA Isaac Lab on Amazon SageMaker AI. It explores two compute options: Amazon SageMaker HyperPod and Amazon SageMaker Training Jobs.

reinforcement learning learning robotics NVIDIA

RESEARCH↑ trendingReddit r/MachineLearning·4/15/2026

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

The author successfully trained a Qwen2.5-0.5B-Instruct model for Reddit post summarization using GRPO, achieving an average rollout length of 64 tokens with combined quality and length rewards. The experiment, run on a Mac Mini cluster, uses an LLM-as-a-Judge (GPT-5) for evaluation and plans future iterations with adjusted reward functions.

reinforcement learning Qwen2.5 GRPO Reddit

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

ARTICLE↑ trendingReddit r/MachineLearning·4/10/2026

Started a video series on building an orchestration layer for LLM post-training [P]

O autor iniciou uma série de vídeos sobre a construção de uma camada de orquestração para o pós-treinamento de LLMs. Ele descreve seus esforços para melhorar o framework `verl` para treinamento RL em escala, focando na modernização de pacotes e remoção de dependências irrelevantes.

reinforcement learning post-training orchestration frameworks

RESEARCHarXiv CS.CL·4/23/2026

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

PR-CAD introduces a progressive refinement framework that unifies text-to-CAD generation and editing, overcoming limitations of disjoint approaches. It leverages a high-fidelity interaction dataset and a reinforcement learning-enhanced reasoning framework tailored for LLMs to enable controllable and faithful CAD modeling.

LLMs reinforcement learning CAD modeling text-to-CAD

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update [P]

The author trained Qwen2.5-0.5B-Instruct for Reddit post summarization using two reward strategies, finding that a combination of quality and length penalties yielded significantly better results. Evaluation was conducted using LLM-As-A-Judge and DeepEval tools for metrics like conscientiousness and clarity.

evaluation reinforcement learning AI training summarization

RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]

An undergrad AI researcher discovered why fusing multi-timescale advantages in PPO Actor-Critic architectures leads to policy collapse. This occurs due to surrogate objective hacking and the router's preference for short-term horizons because of lower temporal uncertainty.

Optimization Actor-Critic reinforcement learning PPO

RESEARCHarXiv CS.CL·1d ago

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

The paper introduces TinyJudge, a framework that uses an ensemble of specialized tiny language models (0.6B) to provide lightweight and high-precision rewards for soft, unverifiable constraints in LLM instruction following. This approach addresses the bottlenecks of reward hacking and high computational overhead found in traditional LLM-as-a-judge methods for constraint alignment.

Tiny Models Model Alignment LLMs reinforcement learning

RESEARCHarXiv CS.LG·1d ago

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

Offline reinforcement learning offers a promising path for developing plasma controllers from historical tokamak data. This paper introduces RL4F, a benchmark for offline reinforcement learning in nuclear fusion plasma control, evaluating various baselines and finding that model-based RL methods perform best.

AI Benchmarks reinforcement learning Plasma Control Tokamak

ARTICLE↑ trendingReddit r/MachineLearning·4/9/2026

Studying Sutton and Barto's RL book and its connections to RL for LLMs (e.g., tool use, math reasoning, agents, and so on)? [D]

Um graduado em Matemática busca orientação para estudar Aprendizado por Reforço (RL) e suas conexões com LLMs, especialmente para aplicações em matemática. Ele questiona a relevância do livro 'Sutton e Barto' em um contexto moderno de LLMs e pede ajuda para focar em tópicos e algoritmos mais recentes como PPO e GRPO.

Sutton e Barto LLMs AI para Matemática reinforcement learning

RESEARCHarXiv CS.CL·2d ago

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

This research introduces PolyFact, a multilingual factual QA dataset, to address cross-lingual factual inconsistency in LLMs. It finds that reinforcement learning via GRPO consistently improves cross-lingual factual recall and generalization compared to supervised fine-tuning.

Multilingual AI LLMs reinforcement learning Machine Learning

RESEARCHarXiv CS.LG·2d ago

MacArena: Benchmarking Computer Use Agents on an Online macOS Environment

MacArena is a new benchmark for computer-use agents (CUAs) operating graphical user interfaces (GUIs) on macOS, addressing the platform's underserved benchmarking landscape. It offers 421 verified tasks across 50 applications, running natively on Apple Silicon, to challenge CUAs beyond Linux-based benchmarks.

Computer-use agents reinforcement learning benchmarking macOS

ARTICLEHugging Face Blog·2d ago

The Open Source Community is backing OpenEnv for Agentic RL

The open-source community is endorsing OpenEnv for agentic Reinforcement Learning development. This initiative highlights collaborative efforts in advancing AI.

open-source reinforcement learning OpenEnv AI development

RESEARCHarXiv CS.LG·4/16/2026

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

This paper introduces STOMP, a novel offline reinforcement learning algorithm for multi-objective optimization using smooth Tchebysheff scalarization. It addresses the limitation of linear scalarization in recovering non-convex Pareto fronts, crucial for aligning large language models and other real-world applications with conflicting rewards.

reinforcement learning multi-objective optimization AI alignment Machine Learning

RESEARCHarXiv CS.LG·4/16/2026

Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning

This study introduces a graph-based hierarchical reinforcement learning approach for the automated co-design of high-performance thermodynamic cycles. It encodes cycles as graphs, uses a deep learning surrogate for decoding, and employs a hierarchical RL framework for structural evolution and parameter optimization.

Energy Systems deep learning reinforcement learning Graph Neural Networks

RESEARCHarXiv CS.LG·4/21/2026

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

This research introduces a rubric-based Generative Reward Model (GRM) to enhance Reinforced Fine-Tuning (RFT) for LLM Agents in Software Engineering (SWE) tasks. By providing richer learning signals beyond binary terminal rewards, this approach shapes intermediate behaviors and significantly improves the quality of the resolution process.

reinforcement learning fine-tuning software engineering AI agents

RESEARCHarXiv CS.LG·4/22/2026

Discrete Tilt Matching

Discrete Tilt Matching (DTM) is a novel likelihood-free method for fine-tuning masked diffusion large language models (dLLMs), addressing the intractability of sequence-level marginal likelihoods in RL. It recasts fine-tuning as state-level matching, using a weighted cross-entropy objective with control variates for stability, and achieves strong results on various tasks like Sudoku and Countdown.

Diffusion Models LLMs reinforcement learning Machine Learning

RESEARCHarXiv CS.AI·20d ago

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

The COSMO-Agent framework uses tool-augmented reinforcement learning to teach LLMs to bridge the CAD-CAE semantic gap, enabling closed-loop optimization in industrial design. It leverages an interactive RL environment for CAD generation, CAE solving, result parsing, and geometry revision, guided by a multi-constraint reward for feasibility and robustness.

LLMs CAD/CAE reinforcement learning Industrial design

RESEARCHarXiv CS.LG·20d ago

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

This paper introduces GROW, an RL framework for open-world VLM agents, addressing limitations of existing Supervised Fine-Tuning methods. It proposes a novel approach for Group Relative Policy Optimization (GRPO) by decomposing trajectories into state-action samples rather than full entities.

VLM Agents Policy optimization Open-world AI reinforcement learning

RESEARCHDEV.to AI·4/13/2026

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive EffectiveReinforcement Learning for LLM Reasoning

This content explores a novel approach to improve Reinforcement Learning for Large Language Model (LLM) reasoning by focusing on "high-entropy minority tokens". It proposes that these less frequent yet highly informative tokens are key drivers for effective learning, challenging the conventional 80/20 rule.

Token Analysis reinforcement learning Natural Language Processing LLM reasoning