← heapsort
RESEARCH↑ trending44

Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

Reddit r/MachineLearningΒ·April 15, 2026
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! [P]

The author successfully trained a Qwen2.5-0.5B-Instruct model for Reddit post summarization using GRPO, achieving an average rollout length of 64 tokens with combined quality and length rewards. The experiment, run on a Mac Mini cluster, uses an LLM-as-a-Judge (GPT-5) for evaluation and plans future iterations with adjusted reward functions.

Read original β†—