← heapsort-ai

reinforcement learning

154 items

RESEARCH↑ trendingReddit r/MachineLearning·4/15/2026

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

The author successfully trained a Qwen2.5-0.5B-Instruct model for Reddit post summarization using GRPO, achieving an average rollout length of 64 tokens with combined quality and length rewards. The experiment, run on a Mac Mini cluster, uses an LLM-as-a-Judge (GPT-5) for evaluation and plans future iterations with adjusted reward functions.

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]
44
RESEARCHarXiv CS.CL·4/23/2026

PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

PR-CAD introduces a progressive refinement framework that unifies text-to-CAD generation and editing, overcoming limitations of disjoint approaches. It leverages a high-fidelity interaction dataset and a reinforcement learning-enhanced reasoning framework tailored for LLMs to enable controllable and faithful CAD modeling.

43
RESEARCH↑ trendingReddit r/MachineLearning·4/16/2026

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update [P]

The author trained Qwen2.5-0.5B-Instruct for Reddit post summarization using two reward strategies, finding that a combination of quality and length penalties yielded significantly better results. Evaluation was conducted using LLM-As-A-Judge and DeepEval tools for metrics like conscientiousness and clarity.

42
RESEARCHarXiv CS.CL·1d ago

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

The paper introduces TinyJudge, a framework that uses an ensemble of specialized tiny language models (0.6B) to provide lightweight and high-precision rewards for soft, unverifiable constraints in LLM instruction following. This approach addresses the bottlenecks of reward hacking and high computational overhead found in traditional LLM-as-a-judge methods for constraint alignment.

40
ARTICLE↑ trendingReddit r/MachineLearning·4/9/2026

Studying Sutton and Barto's RL book and its connections to RL for LLMs (e.g., tool use, math reasoning, agents, and so on)? [D]

Um graduado em Matemática busca orientação para estudar Aprendizado por Reforço (RL) e suas conexões com LLMs, especialmente para aplicações em matemática. Ele questiona a relevância do livro 'Sutton e Barto' em um contexto moderno de LLMs e pede ajuda para focar em tópicos e algoritmos mais recentes como PPO e GRPO.

38
RESEARCHarXiv CS.LG·4/16/2026

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

This paper introduces STOMP, a novel offline reinforcement learning algorithm for multi-objective optimization using smooth Tchebysheff scalarization. It addresses the limitation of linear scalarization in recovering non-convex Pareto fronts, crucial for aligning large language models and other real-world applications with conflicting rewards.

31
RESEARCHarXiv CS.LG·4/16/2026

Automated co-design of high-performance thermodynamic cycles via graph-based hierarchical reinforcement learning

This study introduces a graph-based hierarchical reinforcement learning approach for the automated co-design of high-performance thermodynamic cycles. It encodes cycles as graphs, uses a deep learning surrogate for decoding, and employs a hierarchical RL framework for structural evolution and parameter optimization.

31
RESEARCHarXiv CS.LG·4/21/2026

Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

This research introduces a rubric-based Generative Reward Model (GRM) to enhance Reinforced Fine-Tuning (RFT) for LLM Agents in Software Engineering (SWE) tasks. By providing richer learning signals beyond binary terminal rewards, this approach shapes intermediate behaviors and significantly improves the quality of the resolution process.

31
RESEARCHarXiv CS.LG·4/22/2026

Discrete Tilt Matching

Discrete Tilt Matching (DTM) is a novel likelihood-free method for fine-tuning masked diffusion large language models (dLLMs), addressing the intractability of sequence-level marginal likelihoods in RL. It recasts fine-tuning as state-level matching, using a weighted cross-entropy objective with control variates for stability, and achieves strong results on various tasks like Sudoku and Countdown.

30
RESEARCHarXiv CS.AI·20d ago

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

The COSMO-Agent framework uses tool-augmented reinforcement learning to teach LLMs to bridge the CAD-CAE semantic gap, enabling closed-loop optimization in industrial design. It leverages an interactive RL environment for CAD generation, CAE solving, result parsing, and geometry revision, guided by a multi-constraint reward for feasibility and robustness.

30
RESEARCHDEV.to AI·4/13/2026

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive EffectiveReinforcement Learning for LLM Reasoning

This content explores a novel approach to improve Reinforcement Learning for Large Language Model (LLM) reasoning by focusing on "high-entropy minority tokens". It proposes that these less frequent yet highly informative tokens are key drivers for effective learning, challenging the conventional 80/20 rule.

29