The Neural Scaling Effect: A Definitive Guide

Getting to the bottom of a scientific mystery

Feb 10, 2025

Probably the largest piece of scientific mythology in AI is the so called neural scaling law, which, in crudest terms, says “bigger equals better.” I’ve been skeptical of the so called neural scaling law almost since the day I first learned of it. Instead, I’ve advocated instead for the same exact position that motivated Liang Wenfeng, founder of DeepSeek, which is to “study new model structures to realize stronger model capability with limited resources.” as he put it.

Making the scaling story seem even more dated, Stanford researchers recently trained a reasoning model comparable to OpenAI’s o1 using only $50 in api credits!

Now that the public perception is turning against scaling laws thanks to disruptors like DeepSeek that prove there’s always a more efficient way to do things, the industry is learning the hard way and confronting the reality that scale isn’t everything. So the time is ripe to press my attack.

The purpose of this article is not to gloat about having been right. Nor is it to simply express an opinion. I want to get to the bottom of this so called scaling law—demystify it, explain it from sound principles once and for all. It irritates me to see irrationalities run rampant, and it is inherently irrational to bet billions of dollars on a phenomenon we don’t understand.

To be clear, nobody who abides by an evidence-based epistemology, and that includes me, denies the scaling effect as an empirical observation. What cries out for deconstruction and critique is the psychological process and perverse economic incentives that transformed this limited observation into a spurious universal “law.” An odd bit of linguistic prestidigitation has occurred in the rebranding of the scaling effect—a legitimate, albeit insufficiently explained empirical result— and the scaling law— an unfounded dogma that deserves serious scrutiny.

This post is going to get into the details about what neural scaling is about, and why we shouldn’t bet everything on it. It will then offer a scientifically grounded, yet novel explanation for the scaling effect, and close with some recommendations for investors and researchers looking to prosper in the shifted AI landscape. I intend this to be a definitive guide to the scaling law, that views all sides of it.

But if you want to skip to the chase, the conclusion I draw is this: a critical, scientific and industry wide blunder was made in the form of an inductive fallacy turned hasty generalization. This blunder took one way of doing things right and turned it into the way of doing things, at the expense of significant amounts of capital and missed opportunities. The scaling law became a simple—too simple—story that also conveniently justified tremendous investments. It became a way for vested interests to depict AI development as the exclusive preserve of a small clique of privileged companies and institutions when the reality is much different.

The picture I am about to paint tells a story in which, seduced by steady, incremental gains and committed to a fixed approach, the scaling effect—masquerading as a scientific law— bewitched an industry and drew attention away from fundamental research. Meanwhile, a scrappy lab in China that until very recently very few have heard of decided to approach things differently.

We have quite a bit of ground to cover. So please strap in. We will have to begin with the science on the scaling effect.

Evidence for the Scaling Effect

Let’s first begin by reviewing the science behind scaling.

The term neural scaling law entered common circulation thanks to the landmark study by Kaplan et al (2020) which documented various relationships related to scaling in transformers, focusing on language models, and demonstrating that the cross-entropy loss scales as a power-law with model size, dataset size, and the amount of compute used for training. (Although more scaling dimensions have since been discovered, including a possible big fourth, these largely remain the “big three.”) They found that other architectural details, such as network width or depth, have minimal effects within a wide range. They report model size to be the most impactful scaling factor. In other words, the bigger the better.

Zhai et al (2022) confirmed these results for vision transformers, finding that increasing both model and dataset sizes leads to improved performance, following predictable scaling laws. Larger models are more sample-efficient, achieving better results with fewer training examples.

The paper “Explaining Neural Scaling Laws” by Bahri et al. (2024) investigates why larger datasets and bigger models improve performance in deep learning. The authors identify four key scaling regimes: (1) Variance-limited scaling, where performance improves predictably as model size or dataset size increases due to reduced variance in learning; and (2) Resolution-limited scaling, where performance is constrained by how well the model can resolve the structure of the data manifold. These two regimes apply separately to dataset size and model size, creating four total cases. The study argues that these scaling patterns arise from the statistical properties of neural network feature learning and how they interpolate between data points. Empirical experiments across different architectures confirm these theoretical predictions, showing that as models grow, they systematically refine their understanding of the data. The paper provides a taxonomy for scaling behaviors and bridges theoretical insights with empirical scaling laws in deep learning.

All of this is well and good. Once again, I’m not here to contest these findings. I dispute two things only:

That these findings are robust and universal enough to amount to a proper scientific law
That scaling is the only way to progress in the quest of frontier AI model development, and thus frontier AI model development should remain the preserve of a select well-heeled group of exclusive companies (this is what they want you to believe)

Both of these contentions will be addressed later. First, let’s make sure we’ve covered everything we know about scaling thus far.

Identifying Scaling Dimensions:

So far several dimensions of scale have been discovered. It’s helpful to identify and describe each. These different scaling dimensions can all interact in complex and nonlinear ways, so you generally want to vary them together, e.g increasing both data and parameters simultaneously, improving compute scaling to enable memory scaling, etc.

Note: some other forms of scaling may exist, but these are the major ones. Others may remain to be discovered.

Data Scaling:

What it is: This is perhaps the simplest type of scaling to understand. Data scaling is simply providing more data to the model for it to learn.

Why it works: The more data you feed the model the more it will learn, just like the more books you give a student to read the more knowledge they’ll obtain. Although there’s reason to believe this relationship is nonlinear as parameter size increases. In general, as the model’s embedding space is enriched, it’s better able to draw connections and detect parallels from disparate domains. Quantity becomes a quality all its own.

Shortcomings: High quality, high quantity data is in finite supply, and both properties are generally needed of data. More generally, we can only provide data for things we already understand or for problems we have already solved. So if we want AGI we’ll need a model that can reason beyond data and not just memorize and parrot established facts. More data doesn’t necessarily translate to improved “thinking ability.” Furthermore, novelty matters. Once the model has converged on memorizing a fact, re-exposing it to that same fact is of little use.

Compute Scaling:

What it is: Increasing total FLOPs (floating-point operations per second) during training, typically by allocating more GPUs to the task.

How it works: It makes sense that the more power you provide the models, the more it can do faster. Performance follows Kaplan’s scaling law but has an optimal allocation strategy. Chinchilla scaling (2022) found that model size and dataset size must be balanced for efficient compute usage.

Shortcomings: Simply throwing more compute at training has diminishing returns unless carefully optimized. GPUs or other, specialized AI chips are expensive and energy intensive to operate.

Parameter Scaling:

What it is: Parameter scaling effectively addresses the size of the model- the number of basic computational units, or neurons, it has. Experience has consistently shown that bigger models tend to perform better, although smaller, newer models have been seen to outperform older, bigger models by implementing recently discovered efficiencies and best practices. (Yet likely a bigger model that implements those same efficiencies and best practices would perform even better.)

Why it works: Parameter scaling works on an intuitive level because the model has more brain power to crunch the numbers. Many parameters provide a wider “surface area” for the model to learn complex and large datasets.

Shortcomings: Certain technical caveats such as over-fitting and vanishing gradients mean that bigger isn’t always better. Naturally, bigger models take up more space and require more energy to train and deploy. It’s also known that adding more layers to models doesn’t always translate into better results. More layers might even lead to worse results after a given point, since it means it will take longer and get exponentially more complex to process information. Additionally, it’s being discovered that for any given computation, a model doesn’t utilize all of its neurons, so big models may be more wasteful than we realize.

Test-time scaling:

What it is: Test-time scaling comes in after a model has already be pre-trained, and takes place during inference time-that is, when the model is activated in real time and working on a problem. A possible “big four” of the major scaling laws, test-time compute scaling has shown great promise. By giving the model both more time to process a task, and more resources to process the task with, outcomes tend to improve. Test time compute scaling is behind the success of such reasoning models such as OpenAI’s o1, Gemini 2.0, and DeepSeek’s R1.

Why it works: Intuitively, if someone gave you more time and a calculator and scratchpad to solve a math problem, you’d do better than if you had to solve it in one go off the top of their head. Test-time compute scaling pairs well with a technique called chain of thought (CoT), which involves breaking a problem into parts, planning a solution, and solving each piece step by step. CoT requires the dimension of time to compute. Breaking problem up over several time steps gives the mind, and an AI model, more time and room to solve it.

Shortcomings: Interestingly, we don’t yet know what the limits of test time compute scaling are yet other than the obvious fact that we can’t run these models forever and that more processing time and compute might grow unsustainably costly. Otherwise, this scaling method gives the model time to perfect its reasoning. I would caution against viewing everything successful about this method as scaling, however. Chain of Thought is more accurately classified as an algorithm, rather than a simple increase of numerical parameters we’d normally associate with the term scaling. DeepSeek R1’s impressive use of reinforcement learning during this phase to achieve great results is also an algorithm. The scaling of time and compute just gives the algorithm room to express itself. (Another disingenuous thing that has been going on, the industry has taken to attributing almost any progress to scaling even when it’s not.)

Memory (Context) Scaling

What it is: Increasing the context length a model can handle at once (e.g., longer sequences in transformers).

Why it works: Performance improves with longer context windows. The more information a model can ingest in one go, the more it can correlate and associate at once.

Shortcomings: For transformers, quadratic attention complexity makes it expensive and also tends to degrade performance.

Multi-Modal Scaling:

What it is: Expanding models to multiple modalities (e.g., text, images, video, audio). Arguably a subcategory of data scaling.

Why it works: Multimodal scaling inherits from data scaling. It stands to reason that furnishing a model with a richer comparative and contrastive system of reference will allow it to do more and perform better. Larger models generalize better across modalities. Interestingly, performance scaling differs across modalities (vision models have different laws than NLP). Joint training (e.g., vision-language models like CLIP) follows different scaling curves than unimodal models.

Shortcomings: Alignment is challenging. Different modalities have fundamentally different statistical properties, structures, and representations. Ensuring that training stability and modalities interact meaningfully, efficiently, and coherently without one modality dominating or interfering with the learning of another is nontrivial. Multimodality also inherits data scaling’s potential limitations as well.

Architecture Scaling:

What it is: Changing network topology rather than just size (e.g., depth vs. width, Mixture of Experts, sparse models).

Why it works: Performance improvements depend on parameter efficiency, not just parameter count. Researchers have found that scaling the depth of a model (number of layers) or width (number of nodes per layer) can enhance performance under specific circumstances

Shortcomings: Wider networks are more parallelizable (better for GPUs) while deeper networks excel at hierarchical feature extraction. Scaling both dimensions must be balanced to optimize training efficiency. Generally there are tradeoffs.

Cost scaling:

What it is: This entry is tongue in cheek. But what the powers that be in AI want you to believe is that model performance is directly proportional to capital expenditure.

How it works: Data and compute aren’t free. The more you can pay for it, the more you can power development.

Shortcomings: Money alone can’t buy you the right ideas. Scaling inefficient ideas will lessen the value of each marginal performance increase per dollar.

As you can see scaling is not a simple thing. And all of it falls to one critical point: what model structure you’re using. Almost all of the above generalizations have only been exhaustively empirically confirmed for transformer architectures within certain implicit bounds. The tricky part is that we are dealing with artificial systems where anything is possible. At any point, it may be possible to do things differently using a completely different set of assumptions. Assuming that scaling the transformer architecture is the end all be all is to adopt a false objectivity.

DeepSeek’s ingenious use of reinforcement over reasoning tokens is one such example, however, it’s not limited to that. Earlier last year, Boston-based startup Liquid unveiled a high performing, efficient model built on a state space architecture. Indeed, Liquid proved what DeepSeek did sooner: it’s not all about scale. It’s all about discovering architecture and algorithmic innovation and then scaling that to its uppermost extent.

Why Does Scaling Work?

One thing we’re sorely lacking is a complete and uncontroversial explanation for neural scaling. Thus, to most people, it just looks like magic—which is an awful situation to be in. Here’s the thing. Machine learning as a field is not a mature science with a well founded general theory. It’s a very much “in the trenches” practical results-oriented empirical field, but to my knowledge very view researchers in the field have taken a step back and taken a “God’s eye view” of the entire phenomenon in an attempt to deduce empirical findings from theoretical foundations. So a theoretically principled explanation for the neural scaling law remains out of reach. Because nobody really knows why machine learning works, nobody really knows why scaling works.

Despite the fact that we presently lack a theoretical framework to understand scaling, I still think it’s possible to take a crack at it. This is because I think we can have a science of machine learning, but that’s a much longer story. So here’s my—scientifically and mathematically grounded—explanation of neural scaling. Let me just preface this by saying this explanation is not yet widely accepted among the scientific community. (But it should be!) I also need more space than I’m willing to allot myself to do this explanation complete justice, so here I’ll just be giving you the gist of it.

To understand why scaling works, we have to define a learning space. A learning space is a set of values that it’s possible of the model to learn. It’s the haystack which contains our needle. All we know about the learning space initially is that it contains elements that have not yet proven to be false, which is another way of saying, they could be true. (Here true/false really means, better or worse for model loss performance.) Mathematically speaking, the learning space that is the most objectively unbiased considering everything that is known about a given distribution is called the maximum entropy distribution. Elsewhere I’ve gone to great lengths to demonstrate that the learning space of any possible initial distribution must be the maximum entropy distribution.

What scaling does is allow the model to narrow in on the target values, and therefore converge on its maximum entropy distribution. Consider rotational invariances in Convolutional Neural Networks (CNNs). When training a CNN, you will want to expose the model to various rotations and deformations of your target image, so that it can capture it under as many aspects as you’d expect to see in the real world. Every time you train the model on some new rotational variant, you’ve exhausted some permutation of state within the learning space. The learning space is like a configuration space of all the possibilities your image could be in, both seen and unseen by the model. The more you expose the model to some samples of those possibilities, the better it can approximately represent the totality. This is common sense. Expose the model to more seen than unseen variants, and it will get better at predicting the shrinking unseen remainder as it accumulates experience.

No matter what variable you scale, it all serves this purpose of providing the model a better means to reach the optima of its learning space. Better chips means it can calculate the possibilities faster. More data means it can fill in the blanks more completely. More parameters means it has more infrastructure to support handling larger and more complex learning spaces, and so on. All of it is in service of reaching a point where the learning space is saturated by the model.

Here’s the problem. Saturating the learning space is a point of no return. Once we’ve gotten to that point where a model has mastered the learning space, our model constraints have been satisfied. If we don’t have an architecture that can do more interesting things, it doesn’t matter how much we scale it. There’s more in heaven and earth than exists in our philosophy, so to speak. The learning space we’ve mastered is not representative of all possibilities, so we’ve essentially become the big fish in a small pond. Hence why scaling ultimately reaches diminishing returns. Not only does it start to slow down after more exposure, since informative signals will naturally become rarer the more the model learns as more signals become redundant, but if the model is not architecturally capable of supporting certain additional discoveries, it will never learn what it is incapable of learning. Essentially, we’ve been barking up the wrong tree the whole time.

Many traditional explanations for scaling, such as overparameterization, implicit regularization, and capacity control, are often treated as separate effects. However, a maximum entropy framework unifies them under a single principle.

1. Overparameterization as Entropic Expansion

Overparameterization: This refers to the presence of far more parameters in a model than necessary to fit a dataset.

◦Large models appear to generalize well despite excessive parameters because they explore more of the possible learning space.

◦Training forces them to collapse toward a maximum entropy distribution, filtering out solutions that don’t minimize loss effectively.

2. Implicit Regularization as Entropy-Constrained Optimization

Implicit Regularization: This phenomenon describes how models, despite lacking explicit regularization, tend to favor solutions that generalize well.

◦Implicit regularization emerges as models navigate the learning space, naturally selecting solutions that balance maximum expressiveness with stability.

◦Scaling aids this process by allowing models to explore the entropy landscape more efficiently.

3. Capacity Control and the Limits of Scaling

Capacity Control: This involves determining how much model capacity is needed to optimize performance.

◦Each additional parameter contributes to a broader hypothesis space only if the existing information density supports it.

◦If a model saturates its learning space, adding more parameters fails to improve generalization, explaining diminishing returns in scaling.

In this view, scaling works only to the extent that it enables models to approach the entropy-saturating limit of their hypothesis space. Beyond that point, brute-force scaling becomes ineffective, and architectural innovation becomes necessary.

That at least, is my mathematically and theoretically grounded take on why scaling works, when it does, and why it only succeeds up to a limit.

We Still Need GPU Scaling

Regardless of whether recent events leads to a democratization of AI development, if you believe in the technology we’re still going to need significant investment in infrastructure. The fastest, best models will naturally generate the most demand, which will require infrastructure to support the use. (DeepSeek has recently become a victim of its own success in this regard, experiencing frequent service interruptions due to a lack of infra support.) Besides, what the scaling effect tells us is that anything done well with minimal costs can be done better scaled up, that’s the whole point. The entire basis of my critique is that we have no idea what base architecture is the very best.

The Scaling Law as Epistemic Tarpit

From a philosophical point of view, the scaling law commits the basic error of an inductive fallacy: which is assuming that because something happened in the past, it will keep happening in the future. Scientifically, there’s absolutely no reason to generalize from available evidence that scaling can continue indefinitely. Certainly nor do any observations entail that scaling is the only way to get results. Both logically and empirically, that is demonstrably false. Nor do we know that, given current known architectural constraints, there does not exist a whole different set of architectural assumptions with different findings. Almost all the evidence for scaling has been for the transformer architecture, which works well, but we don’t know the full extent of architectural possibilities beyond it.

Faith in scaling commits an additional fallacy of assuming that because it works, it’s the only thing that can work. This hasty generalization is not justified.

Exposing the Motivated Reasoning Behind Scale Mania

Just like political power, business and science often don’t mix. For the same reason that we should be skeptical of studies funded by the tobacco industry that assert no relationship between smoking and cancer, we should take research from self-interested private entities with a grain of salt. The deeper one investigates, the more the abuse of the scaling effect begins to look less like a sincere intellectual mistake, and more like an anti-competitive price fixing scheme.

(Pause as I wait for the shocked gasps in the audience to subside.)

Quite a bit of money is infused into the scientific narratives surrounding cutting edge machine learning research, which is why we should be cautious about studies published by big companies. That’s not to disparage the work being done by these firms, only to point out the obvious: there are conflicts of interest.

It’s a bit too rich that the mainstream story being told to us by extremely wealthy, highly advantaged companies is that only extremely wealthy, highly advantaged companies can innovate in AI, and that therefore you should invest in them, and only them. If anything in AI is a bubble it’s this deception, not the technology.

Recommendations for Researchers & Investors

For investors, the real moral of the story here is that you don’t need to give all your money to the same handful of companies. The scaling story would have you believe that the path forward for AI was straight and narrow. In actuality there are many potentially promising paths, and much remains unknown, yet to be discovered.

Investing in AI infrastructure is necessary, but the real winners are companies with cutting edge ideas, who are making strides with novel architectures, and doing so for cheap. The powers that be would have you think that it’s a done deal, and that the only thing we need to do to get to AGI (and the promised land this will supposedly bring) is simple scaling of transformer base architectures with minimal additions. That’s totally wrong.

Recent news in the industry should hopefully help promote a flourishing startup industry in AI where more ideas can get tried out with less monopolization. That’s healthier for the economy and for the truth. Anyone with good ideas and a genuine interest in the technology can now contribute. That’s good news for all but a few who will manage to get rich anyway.

For researchers, the decline in the scaling law story should be welcome news. It means that the picture is more complicated than once widely believed, and that your efforts are more needed than ever. The potential for a relatively obscure name to invent something profound, for cheap, and see their name in lights is definitely there.

Conclusion

The decline of the neural scaling law as an unquestioned paradigm marks a pivotal moment in AI research and investment. While scaling has undeniably driven progress, it was never a universal law—just an empirical trend within a specific architectural framework. The future of AI will not be dictated by sheer scale alone but by breakthroughs in architecture, optimization, and novel learning paradigms. The industry must now shift from a mindset of brute-force expansion to one of strategic innovation. For investors, this means diversifying beyond the incumbents betting on infinite scaling. For researchers, this is an opportunity to redefine AI’s trajectory, exploring alternative models and techniques that scale more intelligently. The lesson here is simple: AI’s future belongs not to those who merely scale, but to those who think beyond scale.

Logic Bombs

Discussion about this post