Reinforcement learning is a type of machine learning technique where a computer agent learns to perform a task through repeated trial and error interactions with a dynamic environment. Models like GPT-4 and Claude 3.5 are trained on trillions of data points. But as AI companies run out of data to feed their LLMs, they’re turning to a tactic called reinforcement learning. It involves rewarding models when they make the right decisions, and it’s led to a sudden jump in capabilities. What is an interesting example of this? Microsoft says it found evidence that the Chinese startup swiped OpenAI’s proprietary data without permission (although it’s already added R1 to its own cloud offerings). Users can pay for access to some of Open AI’s data, but the claim is that a mysterious group with ties to DeepSeek grabbed way more than was allowed sometime last fall. The problem is that DeepSeek allegedly got to skip past all the hard parts by putting the finishing touches on an architecture that OpenAI spent at least millions of dollars developing. It might have used a method called distillation, when you feed outputs from a large model into a much smaller one to train it at a much faster pace.