Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
This paper investigates the mechanistic origins of catastrophic forgetting in Large Language Models (LLMs), comparing Reinforcement Learning (RL) with Supervised Fine-Tuning (SFT). It reveals that RL preserves internal computational circuits more effectively, mitigating the forgetting of prior capabilities, unlike SFT which causes greater circuit disruption.