Understanding Why DeepSeek Works So Well
2025 kicks off with a splash and the AI race heats up yet again with the sudden debut of the Chinese research lab DeepSeek and their powerful new open-source reasoning model, R1 and R1-Zero. What makes DeepSeek so buzzy? It’s singlehandedly upending many assumptions—one might dare to say propaganda—about the path forward for AI, and demonstrates that much can be accomplished more cheaply and with a simpler setup than many believed.
Cost-Effectiveness Meets Cutting-Edge Performance
On the surface the performance characteristics and cost-effectiveness of R1 vindicate the position that algorithmic ingenuity and novel architecture design is the true path to progress for AI and not just raw scaling. R1 matches or (marginally) bests OpenAI’s current flagship reasoning model, o1, purportedly at a small fraction of the cost. Of its hefty 671B total parameters, only 37B are active at any given time thanks to efficient Mixture of Experts computational usage strategies. It was reportedly trained for $6 million (measly by most standards), using only 2048 mid-range GPUs over two months. It required only 2.8 million GPU-hours. Contrast this to Llama 3 405B’s 30.8 million GPU-hours, which puts into question the assumption that cutting-edge LLM performance demands a huge investment and aeons of training time.
Best of all, as a gift to the open-source community (and to humanity in general), DeepSeek has published its model’s inner workings and parameters. This transparency is a blessing. It’s MIT-licensed to distill and use commercially, all for free.
The Real Breakthrough: Internalized Reinforcement Learning
While DeepSeek’s performance characteristics and cost profile are attention-grabbing, to me the real breakthrough is what it says about reinforcement learning and the possibility that it can be completely internalized and self-contained, as described by the technical paper published by the developers.
DeepSeek applies a technique called Group Relative Policy Optimization (GRPO) without requiring supervised fine-tuning to achieve the unprecedented: a model that can learn from its own internal processes without external intercession, to, in effect, self-evolve as the researchers describe it. Traditional reinforcement learning methodologies either incorporate one of two components—human feedback to directly correct the model, or a critic model, a fully trained additional “teacher” model that can provide feedback to the “student” model . GRPO does away with both. Instead, R1-Zero does something more immediately mind-like: it evaluates and refines its reasoning strategies entirely through internal dynamics, leveraging self-referential groupings of outputs. This process is akin to reflecting on one’s own thinking, weighing and evaluating the merits of each explanation or possibility one comes up with and up-ranking or down-ranking them accordingly.
Let me try to drive this point home and emphasize why it’s a big deal. By projecting reinforcement learning over reasoning tokens, the system more faithfully mimics the kind of metacognitive reckoning people engage in when we contemplate various cognitive possibilities in the course of thinking. The spread of internal representations alone describes a reward landscape, no external “teacher” is necessary. So there’s something inherently autodidactic to how DeepSeek works.
How GRPO Works
When the researchers tried this method, DeepSeek-R1-Zero demonstrated emergent behaviors such as self-reflection, reasoning via long Chains of Thought (CoT), and iterative problem-solving. This highlights the intrinsic nature of GRPO: the model provides itself with data by reasoning, developing feedback relative to past reasonings. In doing so, it internalizes the reinforcement learning process. Rather than making reinforcement an internal-external coordination strategy, GRPO is internal-internal, where the model’s own reasoning dynamics drive reinforcement.
Think of GRPO as a process where the model generates multiple reasoning strategies (policies) for a given task and pins them onto a metaphorical “board.” Instead of evaluating each policy independently, it groups them together and evaluates their relative effectiveness within the group. This grouping provides a baseline derived from the group’s collective performance, enabling the model to identify the policies that performed the best relative to others in the same batch.
By doing so:
The model avoids needing a separate, large critic model to evaluate policies.
It focuses on relative advantage—how much better or worse a policy is compared to its peers—rather than relying solely on absolute rewards.
A Philosophical and Practical Shift in Reinforcement Learning
These results are significant not just because they lead to an improved model, but because they change how we view reinforcement learning as a machine learning technique. The whole concept of RL up to this point was based on the assumption that in order for it to work, a knowledge gradient had to exist between a more knowledgeable external source and a less knowledgeable learner. RL traditionally assumes a trainer and a trainee. Instead, R1-Zero suggests that this gradient can be entirely unified in the same agent. This level of abstraction appears to be all thanks to the internal system complexity opened up by CoT reasoning tokens. A system that can synthesize better reasoning from its own internal dynamics is extraordinary. How can this happen?
Chain of Thought (CoT): Building the Mental Workspace
The trajectory of progress appears to follow this outline. DeepSeek builds on top of CoT. In my view, CoT was a prerequisite for this internalized RL dynamic. CoT introduced the idea of an internal “mental workspace” where models could iteratively refine solutions. Each piece of reasoning is sufficiently distinguishable from another to define a gradient of relative comparison, which generates information or “a difference that makes a difference.” Thus, since this technique made reasoning explicit rather than implicit, it allowed intermediate steps to be evaluated as their own independently identifiable reasoning tokens. CoT, in other words, creates something like an internal mental headspace for the model, full of possibilities it can evaluate for their own merits relative to each other as subject to the reinforcement process.
Reinforcement Learning Under Deductive Closure
The application of reinforcement learning on top of this CoT layer is exactly analogous to a human being weighing different alternative hypotheses or potential solutions to a problem in their minds, evaluating them in terms of each other according to their intrinsic merits before making a decision and favoring the one which yields the best results. No external intervention is strictly necessary (although some feedback from the world is inevitable); reinforcement learning is achieved under deductive closure. This means the system doesn’t need constant external correction or reference points. It can derive productive inferences purely by optimizing internal coherence and effectiveness.
Challenges and Limitations
This intrinsic feedback loop enables a system that may learn to reason in a way that is relatively autonomous and largely independent of traditional data requirements—qualities that may be prerequisites to AGI and represent true self-sufficient artificial cognition. However, such autonomy is not without drawbacks. The lack of a human guide can lead R1 to evolve hard-to-read reasoning traces as the system tends to evolve alien ways of thinking unbounded by human-centered constraints and preferences. The developers counter this with a “cold start” data package that establishes foundational patterns and constraints that helps the model produce more interpretable outputs. Untangling just how vital this cold start package is compared to the performance of the system without it is important to establish just how truly “autonomous” its self-learning really is.
A Stochastic Foundation for Self-Supervision
At first glance DeepSeek’s internalized use of reinforcement learning defies common sense. How can it bootstrap itself with no (or minimal) external guidance? It would almost appear to amount to the information-theoretic equivalent of a perpetual motion machine. If I want to learn physics, I can teach myself with an external information source, like a textbook, but I can’t just sit there and learn it in my own head. However, this line of thinking confuses knowledge for reasoning. Knowledge requires an external fact to memorize and learn, whereas reasoning is a deductive process that operates according to entirely a priori, intrinsic rules.
The fact the model can grow from this self-supervised mechanism and get as far as it did with minimal resources plays to the strengths of the transformer architecture’s stochastic nature and the power of CoT. Since what comes out of a transformer model is a surprise, even to the model, it can generate data for itself by weighing its different policies in its mental workspace created by CoT. The mutual information generated by the comparison and contrast of its own alternative reasoning tokens is what provides enough “lift” to fuel the upward progress of the reasoning model’s self-actuated learning, seemingly out of “thin air.” Since the model is learning how to reason better, and not focused on memorizing facts, external influence is not strictly required.
Smarter Algorithms, Not Just Bigger Models
The success of R1 calls into question the “scaling law” school of thought, which suggests that larger models with more data and compute are the only way forward. Instead, DeepSeek provides compelling evidence that smarter algorithms and ingenious architectures can achieve significant breakthroughs at a fraction of the cost. It invites us to be intellectually humble before the mysterious possibilities and unknown interactions that emerge at this level of complexity.
While GRPO and DeepSeek’s approach are potentially revolutionary, they are not without challenges. For instance, the evolved reasoning traces can sometimes lack readability, as the system is not being explicitly reinforced by human preferences the way Reinforcement Through Human Feedback setups are. One could also bet that the system’s ability to “think for itself” could also risk misalignment unless it is properly and carefully constrained and monitored to prevent it from going off the rails and begin thinking in an alien or dangerous way. The intrinsic RL mechanism DeepSeek uses could conceivably lead to a runway rogue AI scenario if the right failsafes are not in place.
Conclusion: DeepSeek’s Leap Forward
All told, DeepSeek is shaping up to be a major leap forward, and is more evidence that we’re only getting started when it comes to the quest for AI innovation. Low-cost, high-performance models coming out of left field like DeepSeek’s show that the AI race is still anybody’s game. It also casts doubt on the story that the only way forward is to pour everything into hyper-scaling of data and compute. DeepSeek simply put together the pieces that are published regularly in the research literature but that are frequently overlooked by those more interested in throwing money at the problem or going for incremental quick wins by iterating on established paradigms rather than attempting the audacious.
Our AI models should learn to work smarter, not harder. DeepSeek is proof that going back to the drawing board and integrating lessons from the latest research can yield meaningful gains. Most fascinating of all is its combination of reinforcement learning and chain of thought to help usher in what appears to be a whole new dimension of artificial cognition.