GPT-4, released in 2023, is estimated to have been trained by running about 25,000 GPUs on Microsoft Azure for months. The compute poured into that single training run was roughly 2×10²⁵ floating-point operations (FLOPs). Even the older GPT-3's training alone burned about 1,287 MWh of electricity — more than a century's worth of power for an average household, spent to build just one model. Behind the casual "hey, summarize this" we type lies a world of physics and stacks of cash.

This article digs deep into "how an LLM (large language model) actually works," from three directions: mechanism, power, and money. Specifically — (1) why can an LLM produce language out of a collection of knobs called "weights (parameters)," (2) how much electricity does one question or one training run consume, and (3) is the claim that "frontier LLM development is a money fight" true? The short answer to the third: "For the absolute frontier, it is essentially true — but a counter-current where 'cash alone doesn't win' has grown stronger in 2026." That is the accurate picture.

My stance up front: an LLM's "intelligence" is neither magic nor consciousness — it is the result of beating a giant probability-prediction machine into shape with electricity. Understanding the mechanism dissolves both excessive hype and excessive fear. This article goes into intermediate-level depth. If you're starting from "what even is an LLM," read what is an LLM (primer) first; for context length see the context window; for pricing see AI API for beginners.

HOW LLMs WORK · WEIGHTS × POWER × CASH

Dissecting an LLM From Three Directions

— What intelligence is made of, the power it burns, the cash it costs

Mechanism
Weights predict the next word
Hundreds of billions to 1T+ knobs just computing probabilities
Power
One query ≈ 0.4–33 Wh
One training run = 100+ household-years of power
Cash
$200–500M at the frontier
By 2027, training runs of $1–3B are projected

An LLM's smarts are no magic. They are the result of beating a giant probability machine into shape with power and cash.
Know the mechanism, and both hype and fear dissolve.

1. An LLM Just Keeps Guessing "the Next Word"

It may sound surprising, but ChatGPT, Claude, and Gemini all essentially do one thing. "Given the text so far, compute the probability of the most likely next word (more precisely, 'token') as a continuation, pick one, and line them up." That's it. Feed it "the cat is on the ___" and it assigns probabilities to candidates like "mat," "couch," "floor" and emits the highest one (or one sampled by probability). It repeats this one token at a time until the text ends.

Here is the question that trips many people up. "How can a mere word-guessing game summarize papers or write code?" The answer: "To truly guess the next word accurately, it has no choice but to 'understand' the structure of the world to some degree." Guessing "the capital of Japan is ___" requires geography; "3 + 5 = ___" requires arithmetic; "the cause of this bug is ___" requires programming knowledge held internally. As a byproduct of training "next-word guessing" to the extreme on enormous text, knowledge and reasoning emerge. That is the strange and essential nature of LLMs.

So what is computing that "next-word probability"? As foreshadowed, the lead actor is a staggering pile of numbers called "weights (parameters)." The next chapter reveals what they are.

2. What Are "Weights"? — A Trillion Knobs Make Intelligence

To put the inside of an LLM in one analogy: "a giant computation device with hundreds of billions to over a trillion 'knobs.'" Each knob is a "weight (parameter)," and when the signal of an input word passes to the next layer, it decides "which signals to strengthen or weaken, and by how much." GPT-3 had about 175 billion; the latest frontier models are said to exceed a trillion. The setting of these vast knobs is exactly what the model's learned "knowledge" is.

WEIGHTS

How "weights" turn into language

① Tokenize
Split text into word fragments (tokens) and convert to numeric vectors
② Pass through weights
Dozens of Transformer layers transform signals by multiplying weights
③ Attention
Weights judge which words in the sentence to focus on
④ Output probabilities
Compute the probability distribution of the next token and pick one

"Learning" is the work of turning these trillion knobs little by little toward the right answer.
The finished knob settings (weights) = the model's "knowledge" itself.

The Transformer, which appeared in 2017, is the foundation of modern LLMs. Its heart is the "Attention" mechanism, which dynamically judges by weights "which word in the sentence matters to the current word." Whether "bank" in "saw the river in front of the bank" means a financial institution or a riverbank is decided by weighting its relationship to the other words in context — and this "context-dependent weighting" is exactly why an LLM can return coherent responses even over long passages. When people say "something about weighting," they mean precisely this Attention and the trillions of multiplications behind it.

The crucial point: these weights were not set by hand. At first they are a blob of random numbers, meaningless. Meaning is instilled through "learning." So how does that learning happen?

3. Two Stages of Learning — Pre-training and Post-training (RLHF)

An LLM's learning splits broadly into two stages — the process by which the previous chapter's "random knobs" become "smart knobs."

Stage 1: Pre-training. Feed it internet-scale text (books, the web, code) and have it relentlessly "guess the next word." Each time it errs, all parameters are adjusted by a tiny amount in the direction that shrinks the error (this adjustment algorithm is the famous "backpropagation + gradient descent"). Repeat this over trillions of tokens, and the foundations of grammar, knowledge, and reasoning get carved into the knobs. Pre-training eats most of the compute, most of the power, and most of the cash. The astronomical ~2×10²⁵ FLOPs of a GPT-4-class model burns here.

Stage 2: Post-training. A pre-trained-only model is "knowledgeable but ill-mannered." So RLHF (reinforcement learning from human feedback) and similar teach it "helpful, safe ways to answer." Furthermore, from 2025 onward, the weight of post-training that drills long reasoning (thinking carefully), tool use, and agentic behavior has surged, to the point that for Claude, GPT, and Gemini families, post-training now takes up roughly 15–25% of total compute. The reason recent models "think before answering" so much is the evolution of this post-training. Multi-agent behavior is also instilled here.

4. Inference — The Moment Your Question Becomes Electricity

If training is "the construction work of setting the knobs," then inference is "the operation of actually producing answers using the finished knobs." Every time you type a question into ChatGPT, trillions of multiplications run through nearly a trillion knobs, and tokens are generated one at a time. We've seen how heavy training is — but across society as a whole, it is inference, not training, that eats the power.

The reason is simple: training runs basically once per model, but inference runs hundreds of millions of times a day worldwide. By some estimates, inference accounts for 80–90% of all AI compute, and by 2030, 75% of AI power demand is projected to be inference. "One question is hardly any electricity" — true, one is tiny. But "tiny × hundreds of millions × every day" stacks up into a nation-scale power problem. Let's look at concrete numbers next.

5. Power — How Much Electricity Does an LLM Eat?

"AI eats power" is often said, but how much exactly? Here are the representative figures published as of 2026.

ELECTRICITY

LLM power consumption in numbers

One query (short)
0.43Wh
GPT-4o class
one short question
One heavy reasoning
33Wh+
long-thinking model
~70x the light version
Training GPT-3
1,287MWh
550t+ CO2
(an old generation)
Global DC power
415→945
TWh
2024→2030 forecast

Even one short query (0.43Wh), scaled to 700M/day, equals the power of ~35,000 U.S. households.
One data-center rack draws up to 10x the old norm; a dedicated AI DC eats 20MW–1GW.

What stands out is that "power efficiency differs by orders of magnitude between models." A short question to a lightweight model is under 0.5 Wh, but throwing a heavy question at a long-thinking reasoning model (the type that mulls before answering) consumes 33 Wh+ — about 70x the light version. As touched on in the token-consumption-as-workload trap, "just do everything on the top model" is a luxury in both power and cost. Sending light errands to a light model is kind to both the planet and your wallet. Global data-center power hit 415 TWh in 2024 (about 1.5% of the world total) and is projected to double to 945 TWh by 2030 — with AI as the main driver of that growth.

6. Is "Development Is a Money Fight" True?

Here is the question you were most curious about. "Is frontier LLM development a money fight?" The verified conclusion first: "Limited to the frontier's pre-training, it is essentially true." The numbers back it up.

MONEY FIGHT

Frontier training-cost trajectory

GPT-3 (2020)
~ 3×10²³ FLOPs. Off the charts for its time
GPT-4 (2023)
~ 2×10²⁵ FLOPs. ~25,000 GPUs
2026 frontier
10²⁶–10²⁷ FLOPs / $200–500M
2027 forecast
a single run reaching $1–3B

Frontier training compute long grew at 4–10x per year.
One GPT-5 / Gemini Ultra-class training run = $200–500M — a money fight indeed.

Concretely, training one GPT-5 / Gemini Ultra-class model once is estimated at $200–500 million, and some forecasts put the late-2027 frontier at $1–3 billion per run. And this is "one successful run" — behind it sit failed trial-and-error, data preparation, salaries, and inference infrastructure. On top of that, each GPU costs thousands of dollars; running tens of thousands of them for months racks up the electricity bill. A wall of money that "a bright idea" or "a clever algorithm" alone can never clear stands at the entrance to the frontier. In this sense, "money fight" is no exaggeration — it's fact. That's why only a handful who secured enormous capital — OpenAI, Google, Anthropic, Meta, xAI — can fight at the very front.

7. But Cash Alone Doesn't Win — The Efficiency Backflow

The previous chapter said "the money fight is real." But ending the story there misreads the reality of 2026. It is by no means true that "with enough cash you win" — if anything, a counter-current has strengthened. As an honest answer, let me write this other side too.

The symbolic case is the series of moves where China's DeepSeek released models approaching the frontier on a relatively small budget, and was said to have "reset the cost floor." Techniques to build the same performance orders of magnitude cheaper — efficient architectures, Mixture of Experts (MoE), distillation (transferring a big model's knowledge into a small one), and careful data-quality work — have been demonstrated one after another, driving a wedge into the "huge capital = victory" formula. In fact, frontier compute growth is projected to decelerate from 10x per year to roughly 3–4x from 2026 onward, and industry attention is shifting from "just go bigger" to "how to deliver the same performance cheaper and with less power."

So the accurate picture is this: "The race to update the frontier's 'peak performance' is a money fight. But the race to deliver 'good-enough performance' cheaply is a contest of wits and efficiency." Most models we use day to day benefit from the latter, getting cheaper, faster, and more power-efficient year by year. As written in how far you can go on the free tier, by 2026 even free tiers reached a practical level — fruit handed to users by the efficiency backflow.

8. What's Next — The "Power and Physics" Wall After Cash

So can you scale forever just by stacking cash? No — and that's the new wall that began to appear in 2026. Above roughly 10²⁷ FLOPs, the bottleneck stops being "the budget to buy GPUs." Instead, what blocks the way is —

  • Power: can you continuously supply gigawatt-scale electricity in one place? Now a problem of power plants and grids
  • Interconnect: the bandwidth to synchronize tens to hundreds of thousands of GPUs without latency. There is a physical ceiling on what one giant training job can handle
  • Data: high-quality training text is itself running dry (there is a limit to how much good writing humanity has produced)

What comes after "the money fight" is "a fight of power, physics, and wits." That's why companies are now shifting toward investing in nuclear power, developing their own dedicated chips, leveraging synthetic data, and researching efficient architectures. The era you could win by throwing money is, ironically, turning into an era you can't win with money alone.

Summary

An LLM's true nature is "a giant prediction device where hundreds of billions to over a trillion 'weights' keep computing the probability of the next word." The Transformer's Attention handles "context-dependent weighting," and pre-training (which eats most of the compute, power, and cash) plus post-training (RLHF, reasoning training) make the knobs smart. The smarts are no magic — they are a byproduct of drilling "next-word guessing" to the extreme on enormous text.

On power: one short query ≈ 0.43 Wh, heavy reasoning 33 Wh+ (about 70x the light version), and GPT-3's training alone 1,287 MWh. Across society, inference accounts for 80–90% of the power, and global data-center power is projected to double to 945 TWh by 2030. "Do everything on the top model" is a luxury in both power and cost; the smart move is to pick the model by the weight of the task.

And the core question — "is LLM development a money fight?" The answer is "essentially true, limited to the frontier's pre-training" ($200–500M per GPT-5-class run; $1–3B projected for 2027). But the "cash alone doesn't win" backflow is strong too (DeepSeek's floor reset, efficiency, distillation). Updating peak performance is a cash fight; delivering practical performance cheaply is a wits fight — this two-layer structure is the reality of 2026. And next comes the physical wall of power, interconnect, and data scarcity. Understanding an LLM not as a "magic box" but as an "electricity-powered probability machine" keeps you from being swept up in either hype or fear. To learn more, see what is an LLM (primer), the context window, and the free-tier comparison.

FAQ

Q. Are more parameters (weights) always smarter?
A. "Bigger was smarter" once held almost universally, but in 2026 it's not that simple. Even at the same parameter count, performance varies greatly with data quality, post-training, and architectural ingenuity. Small-but-smart models (products of distillation and efficient design) have multiplied, and "parameter count = intelligence" no longer holds. We've entered an era of "how it's trained" over "how many."

Q. Does an LLM really "understand," or is it rote memorization?
A. Even experts disagree — it's a hard question. What's certain is that "it shows generalization that rote memorization can't explain" (it solves problems not in its training). Whether that's "the same meaning-understanding as humans" is a separate question with no clear answer. Practically, treat it as "an extremely advanced prediction device that behaves as if it understands." That's exactly why it errs so confidently (hallucination).

Q. Can I build my own LLM?
A. "Frontier-class" is impossible for an individual (it needs hundreds of millions of dollars and tens of thousands of GPUs). But training a small model, or fine-tuning an existing open model, is feasible even for individuals. Moreover, most practical needs are met by using existing models via the API. There's almost no need to "build everything yourself."

Q. Is AI's power consumption a serious problem for the planet?
A. It's a fact that the scale is becoming non-negligible (data-center power is about 1.5% of the world's, projected to double by 2030). But efficiency is also advancing furiously in parallel; "power per token" is dropping year by year. The problem is less "the efficiency of one query" than "the explosive growth of total volume × frequency." How much renewables, nuclear, and dedicated chips can offset that is the future focus.

Q. In the end, what's worth knowing as a user?
A. Three things. (1) The model is a "probability predictor," so it errs even in a confident tone (verify important info). (2) Heavy questions are costly in power and money, so pick the model by the task's weight (light errands to light models). (3) "Peak performance" is a money fight, but "practical performance" gets cheaper and more power-efficient every year (waiting for free/cheap models to evolve is also smart). The more you know the mechanism, the more cheaply and cleverly you can use AI.