The reinforcement learning (RL) loss in large language models (LLMs) often outperforms supervised learning (SL) loss due to its ability to optimize for long-term rewards rather than relying solely on direct associations between input-output pairs. In cases where the dataset lacks intermediate steps, SL can struggle, as it typically requires explicit examples of desired outcomes. Conversely, RL allows models to learn from the consequences of their actions over sequences, enabling them to develop strategies that maximize overall performance based on feedback rather than fixed labels. This approach is particularly effective in complex tasks where the path to successful outcomes is not linear, allowing LLMs to navigate and refine their responses more effectively in dynamic environments.