When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the most compelling questions is how to overcome…
When training large language models (LLMs) with reinforcement learning from verifiable rewards (RLVR), one of the most compelling questions is how to overcome performance plateaus. The previous NVIDIA Research solution, Prolonged Reinforcement Learning (ProRL), showed that adding more reinforcement learning (RL) steps during prolonged training could expand the reasoning boundaries of LLMs.
