The Revolution of Logical Architectures

How we moved from a universal CPU to specialized processors — and why this fracture is power

CPU: Seventy years of stability, ten years of revolution

From 1945 to 2005, computing was dominated by a single fundamental architecture: the CPU (Central Processing Unit) based on the stored-program model. For sixty years, every improvement came from doing the same thing better and faster: more transistors, higher frequencies, larger caches. This model worked. It colonized the world.

Then, between 2005 and 2015, something irreversible happened: the emergence of radically different architectures. GPUs that process thousands of operations in parallel. TPUs built for matrix multiplication. NPUs optimized for edge inference. LPUs that eliminate non-determinism to reach extreme speed.

This isn’t a story of linear progress. It’s a story of crisis: when old solutions stop working, the system reinvents itself — and in that reinvention, power concentrates.


The era of the universal CPU (1945–2005)

The forgotten inventors of the CPU: Eckert and Mauchly

The story begins in 1945 at the Moore School of Electrical Engineering at the University of Pennsylvania. J. Presper Eckert and John Mauchly had just completed ENIAC: the first large-scale electronic digital computer, the size of a room.

Their revolutionary idea: instead of “programming” the machine by rewiring cables, why not store programs in the same memory as data? The stored-program concept is the foundation of modern computing. John von Neumann documented these ideas in “First Draft of a Report on the EDVAC” (1945): the document circulated widely, and the architecture became “von Neumann.” A classic pattern: collective work credited to whoever holds greater prestige.

The architecture: the power and limit of traditional CPU

  • Unified memory: programs and data share the same space
  • Control unit: coordinates execution
  • ALU: performs arithmetic-logical operations
  • Input/Output: interfaces with the outside world

The separation between processor and memory, connected by a bus, is genius and disaster at the same time: total flexibility, but also dependence on data movement. This is where the bottleneck is born.

The bottleneck: the price of CPU generality

In 1977, John Backus (Turing Award) put it plainly: the problem isn’t computing — it’s moving. Computation is fast; waiting for memory is the structural tax of a universal architecture.

CPU “magic laws”: Moore and Dennard (1965–2005)

For forty years, the industry avoided collapse through technological scaling: Moore’s Law (more transistors) and Dennard scaling (smaller transistors = proportionally less power). It was the “free lunch” of computing: you just had to wait.

Intel 4004 (1971): 2,300 transistors, 740 KHz · Pentium (1993): 3.1 million, 60 MHz · Pentium 4 (2000): 42 million, 1.5 GHz

2005: the crisis — the CPU “heat wall”

Around 2005, clock speeds hit a ceiling. Not for lack of ideas, but because of physics. Dennard scaling collapsed: pushing clock higher meant catastrophic heat. This is the power wall. The universal CPU model ran into a non-negotiable limit.

Infographic: from the universal CPU to the 2005 heat wall, highlighting the memory bus and the processor–memory bottleneck
Image idea 1: “The CPU was universal → the heat wall forces fragmentation.”

The temporary fix: multi-core CPU (2005–2010)

If we can’t make one core faster, we add more cores. But parallelism isn’t free: software must be rewritten, and Amdahl’s Law places a hard theoretical cap on speedups. As transistors keep growing, dark silicon appears: parts of the chip must remain off because of thermal limits.


The discovery of parallelism — the GPU era (2006–2012)

2006: NVIDIA changes the game

NVIDIA makes a strategic move: it turns a graphics processor into a parallel computing engine. GPUs are architecturally opposite to CPUs: less sophisticated control, more massive throughput. A few “smart” cores versus thousands of “simple” cores.

The “aha” moment: matrix multiplication

3D graphics and deep learning share the same core: massive matrix multiplication. When the community proves GPUs can accelerate training, the point isn’t “faster” — it’s “suddenly feasible.”

CUDA: the ecosystem that creates a monopoly

The decisive move is CUDA (2006): programming GPUs becomes accessible, but with a condition: the ecosystem is proprietary and runs only on NVIDIA. This is where technical advantage turns into structural rent.

2012: AlexNet — the detonation

AlexNet wins ImageNet and opens the era of practical deep learning. GPU training becomes standard. But what consolidates isn’t just a technology — it’s an industrial dependency on the software interface.


Google and the inference crisis — the TPU era (2013–2024)

2013: the calculation that scares Google

If every planet-scale service embeds neural networks, datacenters explode. GPUs are great for training, but inference (millions of requests, one at a time) is a different regime: latency, efficiency, operating cost.

TPU v1: absolute specialization

Google builds a dedicated ASIC: the TPU. The key architecture is the systolic array. The idea is brutal: data reuse, minimizing memory access during compute. It’s a direct response to the real enemy: memory, not the ALU.

Visual: a systolic array and on-chip dataflow, comparing memory access patterns in CPU/GPU versus TPU compute
Image idea 2: “Systolic arrays: let data flow, instead of chasing it in memory.”

The ecosystem beats performance

Even when hardware is competitive or superior, adoption depends on the ecosystem. CUDA dominates education, frameworks, libraries, and the job market. Control of the software layer matters more than hardware.


AI at the CPU’s edge — the NPU era (2014–present)

Training in the cloud, inference everywhere

AI has to live on smartphones, laptops, IoT, cars. A discrete GPU draws too much power. The answer is the NPU: integrated acceleration, low power, low precision, real-time inference. AI becomes ubiquitous — and invisible.


Chapter 5: The sequential problem — the LPU era (2024–present)

LLMs: one token at a time

Large Language Models generate text autoregressively: token n+1 depends on token n. It’s structurally sequential. On GPUs, inference becomes memory-bound: hardware waits for data more than it computes.

LPU: determinism as a weapon

The LPU idea: remove “intelligence” from hardware and put control into the compiler. Static scheduling, reduced non-determinism, minimized memory access. It’s extreme specialization: efficiency and latency in exchange for flexibility.


Five architectures, one division of labor

ArchitectureOptimized forStrengthsLeadersWhere
CPUControl flow, irregular workloadsFlexibility, low-latencyIntel, AMD, ARMEverywhere
GPUAI training, massive parallelismThroughput, mature ecosystemNVIDIADatacenters, workstations
TPULarge-scale training, batch inferenceEfficiency, stack integrationGoogleCloud/internal services
NPUEdge inference, mobile AILow power, on-deviceApple, Qualcomm, SamsungPhones, laptops, IoT
LPUReal-time LLM inferenceDeterminism, low latencyGroq (specialized stacks)Inference services

Logical architectures: specialization = efficiency, but also concentration

Each transition solves a specific bottleneck. But each solution almost always shifts power: toward whoever controls the interface (software), production (foundries), and the ecosystem (education + libraries + toolchains). Specialization isn’t only engineering. It’s industrial policy.

CUDA as technological rent

Hegemony isn’t measured only in TFLOPS. It’s measured in migration costs, dependencies, and lock-in. When a platform becomes “the university,” “the default,” and “the job market,” hardware is just the visible face of power.

The manufacturing chokepoint

Advanced chips require nodes and machines that exist in only a few places on Earth. Computational infrastructure becomes a geopolitical single point of failure. This isn’t a footnote: it’s a structural condition of the digital future.

Conclusions: which future?

In seventy years we moved from a universal CPU to a fractured ecosystem of accelerators. Each fracture increases efficiency, but reduces distributed control. Knowledge is produced collectively; control tends to concentrate privately.

Decode. Resist. Reclaim.

Similar Posts