Research

Scaling Laws Revisited: What Actually Determines LLM Performance?

Brixnex Editorial

📅 March 16, 2026 ⏱ 14 min read 👁 14.3K views

Scaling Research Theory

The Current State of Scaling Laws Revisited in 2026

there's a lot of noise around this topic, and most of the coverage I read falls into one of two failure modes: uncritical enthusiasm that glosses over real limitations, or reflexive scepticism that misses genuine progress. What I want to do here's give you an honest picture of where things actually stand in mid-2026, based on working with these systems rather than reading press releases about them.

The progress in what actually drives LLM performance over the past eighteen months has been real — not the transformative overnight revolution that some headlines suggest, but a steady accumulation of improvements that, taken together, add up to something meaningfully different from what existed two years ago. Understanding which improvements are substantive and which are incremental helps you make better decisions about where to invest time and money.

What Has Actually Changed

The most significant recent developments in Chinchilla compute-best training, emergent capabilities, data quality scaling, and the limits of naive parameter scaling share a common thread: the gap between controlled demonstration and real-world deployment has narrowed. Systems that worked well in research settings two years ago now have the reliability and tooling support to actually run in production. that's a different kind of progress than raw capability improvements, and in many ways it's more important for practitioners who need things to actually work. [Chinchilla scaling laws paper] See our AI infrastructure economics.

At the same time, the challenges that were hard two years ago remain largely hard. Context and consistency at scale, hallucination in low-confidence domains, and evaluation that reflects real-world performance rather than benchmark performance — the field has made progress on all of these, but none of them are solved. The teams doing the best work are the ones who are clear-eyed about both the progress and the remaining gaps.

The key empirical finding that has modified the Chinchilla framework is that the compute-optimal token count depends on the intended deployment context. Chinchilla optimised for minimum training compute given a performance target — but if your goal is minimum inference cost over the model's deployment lifetime, the optimal point shifts significantly toward smaller models trained on more data. A model serving 100 billion tokens of inference amortises additional training cost very efficiently, justifying training far beyond the Chinchilla compute-optimal frontier.

LLaMA 3, Mistral Large 2, and Gemma 2 all trained their smallest models on 15-20 trillion tokens — 5-10× the Chinchilla compute-optimal recommendation for their parameter counts. The resulting models punch significantly above their weight in inference-constrained deployments: a well-trained 7B model on 15T tokens consistently matches the quality of undertrained 30B models on most practical tasks, at one-quarter the inference cost. This has fundamental implications for which organisations can deploy capable AI: inference efficiency, not training capability, is the relevant bottleneck for most users.

The Technical Foundations

Understanding what actually drives LLM performance at a practical level requires getting familiar with a few foundational concepts. this is not about having a PhD-level understanding — it's about having enough grounding to evaluate claims, understand tradeoffs, and make informed decisions about when and how to apply these techniques in real work.

The key insight that changes how you think about Chinchilla compute-best training, emergent capabilities, data quality scaling, and the limits of naive parameter scaling: performance depends heavily on the interaction between the model's capabilities, the quality of the data or context it's working with, and how the task is framed. Changing any one of these can shift the outcome dramatically. this is why benchmark results and real-world results diverge so often — the conditions are different in ways that matter significantly. [Kaplan scaling laws paper]

Scaling laws as studied by Kaplan (2020) and Hoffmann (Chinchilla, 2022) are empirically fitted power laws relating loss to model parameters (N), training tokens (D), and compute (C = 6ND for transformer training). The functional form implies smooth, predictable improvement — double the compute budget and performance improves by a predictable factor. The constants in the power laws, however, are architecture and training setup dependent: improvements in attention mechanisms, normalisation approaches, learning rate schedules, and data quality have shifted the empirical scaling curves substantially between 2020 and 2026.

Scaling laws operate on next-token prediction loss (cross-entropy) on a held-out evaluation set. The relationship between this loss and downstream task performance is not uniform across tasks: some capabilities exhibit near-linear improvement with loss reduction, others show emergent step-function behaviour where a threshold loss level is required before the capability appears at all. This divergence between aggregate scaling metrics and task-specific capability trajectories is why headline parameter counts and compute budgets are imperfect predictors of real-world model utility on specific applications.

Where It Works Well

The use cases where current approaches to what actually drives LLM performance deliver reliable value have some common characteristics: tasks where the domain is well-defined, where errors are recoverable, where there's a human in the loop for high-stakes decisions, and where you've a reasonable evaluation strategy to measure whether the system is actually working. These constraints sound limiting but they cover a lot of practical use cases.

Teams that have deployed successfully share a pattern: they started with a narrow, well-defined use case rather than trying to solve everything at once. They built evaluation infrastructure before they built the product. They treated the first deployment as a learning exercise, not a finished product. And they had explicit plans for what good enough looked like before they started building.

Where It Still Struggles

The honest limitations of current approaches are worth naming directly. Open-ended tasks with no clear success criteria are hard to evaluate and hard to improve. Tasks requiring sustained consistency over long sessions still see degradation. Anything where the cost of a confident wrong answer is high needs human review, not autonomous action. And any task where the training distribution differs significantly from your deployment distribution will produce surprises.

None of these are reasons to avoid using AI in these areas — they're reasons to deploy thoughtfully, with appropriate safeguards and evaluation, rather than assuming the demo performance will hold in production. The teams that get burned by AI disappointments are almost always teams that deployed without this kind of evaluation in place.

Practical Guidance for Getting Started

Based on working with these systems across several different contexts: spend the first two weeks on evaluation before you spend any time on building. Understand what success looks like, build a dataset that lets you measure it, and use that to calibrate how much capability you actually need before writing a line of production code.

Then start small. The teams that ship successful AI products nearly always start with a narrower scope than they originally planned, get that working reliably, and expand from there. The temptation to build the thorough version first is strong and almost always produces systems that are impressive in demos and frustrating in production. Discipline about scope is not a constraint on ambition — it's how ambitious projects actually succeed.

Looking Ahead

The trajectory of what actually drives LLM performance over the next year points toward continued improvement in reliability, better tooling for evaluation and deployment, and increasingly capable models that are cheaper to run than current-generation equivalents. The competitive dynamics are pushing costs down and capability up across the board, which is good for teams building on top of these systems.

What is less certain: which specific approaches will win out, whether the current capability trajectory will continue at the same pace, and how regulatory developments will affect what is permissible in different markets. The teams best positioned for these uncertainties are the ones building on solid evaluation infrastructure and avoiding over-dependence on any single model or provider. Flexibility and measurement are the two most durable competitive advantages in this space right now.

References & Further Reading

Scaling Laws for Neural Language Models (Kaplan et al., 2020) — Original OpenAI scaling laws paper establishing power law relationships
Training Compute-Optimal Large Language Models — Chinchilla (Hoffmann et al., 2022) — DeepMind paper revising optimal compute-data ratios
Scaling Laws for Reward Model Overoptimization (Gao et al., 2022) — Extension of scaling laws to RLHF reward models
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling — Analysis of how inference costs change optimal training ratios

Frequently Asked Questions

What are scaling laws in AI?

Scaling laws describe the predictable relationships between model performance and three key variables: model size (parameters), training compute (FLOPs), and training data size (tokens). The landmark 2020 Kaplan et al. paper from OpenAI showed that loss decreases as a power law as these quantities increase. The 2022 Chinchilla paper from DeepMind refined these relationships, showing most models were significantly undertrained relative to their size.

Do scaling laws still hold in 2026?

The core scaling relationships still hold, but their implications have become more nuanced. We have hit data walls for pre-training on naturally occurring text — most high-quality internet data has been used. Synthetic data, multi-modal data, and longer training runs are extending the scaling curve. Architecture innovations like MoE decouple parameter count from compute, complicating simple scaling law predictions. The consensus is that scaling continues but requires more creative approaches.

What is the Chinchilla scaling law?

The Chinchilla paper (Hoffmann et al., 2022) found that for a given compute budget, models should be trained on approximately 20 tokens of data per parameter for optimal efficiency — the 'Chinchilla-optimal' ratio. This showed that GPT-3 and similar models were undertrained: a smaller model trained on more data would have achieved better performance for the same compute. Most modern models now follow approximately Chinchilla-optimal training recipes.

Is there a limit to AI scaling?

Researchers debate this actively. Physical limits (memory bandwidth, energy costs) and data limits (quantity of high-quality human-generated text) are real constraints. However, synthetic data generation, multi-modal scaling, longer context, and architectural improvements are all extending the scaling frontier. The prevailing view in 2026 is that scaling will continue to yield improvements, but the gains per dollar of compute have slowed compared to 2019–2022, requiring more innovation per capability increment.

📢 Found this useful? Share it: