Week 08 — The GPU & NPU · From Zero to AI Hero

A close-up of a graphics card with cooling fans and aluminium fins.

Inside that fan is a chip with ten thousand cores — each one less capable than your CPU, all working as a single hive.

The CPU is a brilliant generalist. It will run anything you throw at it — a database, a browser, an AI model, your code, a 1985 video game emulator — at extraordinary speed, one instruction at a time.

But "one at a time" turns out to be the wrong shape for two of the most important workloads of the last forty years. Drawing pixels onto a screen — millions of identical lighting calculations, all at once. Multiplying matrices for a neural network — billions of identical multiplications, all at once. The CPU does these one at a time, very fast. The GPU does them all at the same time, more slowly per operation, with thousands of small chefs in parallel.

And that — not Moore's Law, not transistor count, not algorithmic ingenuity — is the reason modern AI exists. There was always a CPU that could, in principle, do the math. There was never one that could do it before the heat death of the universe.

One chef vs ten thousand prep cooks

Two genuinely different design philosophies, sitting an inch apart on the same motherboard:

CPU — the master chef

Cores~8

Clock4 GHz

Cache~50 MB

Per-core powerextreme

Branchingexcellent

"Give me anything. I'll figure it out, fast."

GPU — ten thousand prep cooks

Cores~10,000

Clock1.5 GHz

Cache~80 MB

Per-core powertiny

Branchingterrible

"Give me ten thousand things. I'll do them all at once."

Each GPU "core" (NVIDIA calls them CUDA cores; AMD calls them stream processors; Apple calls them shader cores) is dramatically less capable than a CPU core. It runs at half the clock speed. It hates conditional branches. It can barely access memory on its own. But there are ten thousand of them, and they're glued together such that one instruction can be issued to a thousand of them at once. Same Instruction, Multiple Data — the SIMD idea from last week, taken to a maximalist extreme.

The CPU is a single chef who can roast a chicken, debone a fish, and lobby Congress. The GPU is the entire prep brigade chopping ten thousand identical cubes of carrot. You absolutely want both.

Extreme close-up of a digital display showing the red, green, and blue pixel grid.

Photo · Anatoly Maltsev / Unsplash

Eight million tiny lighting calculations, refreshed sixty times a second, on every screen you own. The original GPU job description.

Where GPUs came from — and why they were waiting for AI

Pixels, originally. In the late 1990s, gamers wanted 3D graphics — and 3D graphics is, almost embarrassingly, the same calculation done over and over for every pixel on the screen. Multiply this position vector by that camera matrix, light it from this direction, sample that texture, write the result. Two million pixels at sixty frames per second is a hundred-and-twenty million identical calculations per second, all independent of each other.

NVIDIA, ATI (now AMD), and a few others built a chip whose entire shape was wrong for general computing but exactly right for this. Lots of small dumb cores. Wide buses to specialised graphics memory. A baked-in pipeline for "transform, light, rasterise, shade". Quake ran. Half-Life ran. By 2005 the GPU was a fixture of every gaming computer.

Then a strange thing happened. Around 2007 a few researchers noticed that the matrix-and-vector math at the heart of graphics was the same math at the heart of neural networks. NVIDIA released CUDA — a programming framework that let you treat the GPU as a generic parallel-math machine, not just a graphics-pipeline. The first papers using GPUs to train neural networks appeared shortly after. By 2012 a network called AlexNet, trained on two consumer NVIDIA GPUs, won the ImageNet competition by a margin that broke the field. The modern AI era starts there.

The GPU did not get invented for AI. It was invented for video games, sat around for fifteen years, and then turned out to be exactly what AI needed.

The numbers, today

How many operations per second can each chip do? The unit is FLOPS — floating-point operations per second. A typical 2024 high-end CPU sits around 1 trillion FLOPS (1 TFLOP) using all its cores and SIMD. A high-end GPU is two orders of magnitude past that. Top-end AI accelerators add another order of magnitude.

Chip	Year	Peak FP32	Peak AI (FP16/BF16)	Power
Apple M3 Max (CPU only)	2023	~1.0 TFLOP	—	~30 W
NVIDIA RTX 4090 (consumer GPU)	2022	~83 TFLOPS	~660 TFLOPS	~450 W
NVIDIA H100 (data-centre GPU)	2022	~67 TFLOPS	~2,000 TFLOPS	~700 W
Apple M3 Max (Neural Engine, NPU)	2023	—	~18 TOPS	~3 W
NVIDIA H200 (latest)	2024	~67 TFLOPS	~3,400 TFLOPS	~700 W

Read those AI columns slowly. The H200 does three quadrillion four hundred trillion low-precision multiply-adds per second. The CPU it's plugged into does, generously, a thousandth of that. This is why a modern data centre training run uses thousands of GPUs and barely any CPU. The CPU shovels data; the GPU computes.

NPUs — the third chef

If a GPU is a generalised parallel-math machine, an NPU (Neural Processing Unit) is a parallel-math machine that has further specialised. NPUs only do tensor multiplications — but they do them with tiny circuits, low precision, and brutally low power.

Apple's "Neural Engine", Google's "Tensor", Qualcomm's "Hexagon" NPU, Microsoft's "NPU" in Copilot+ PCs — these all do roughly the same thing: 8–40 trillion low-precision operations per second, on milliwatts. They cannot train a model. They are extremely good at running an already-trained one. This is why your iPhone unlocks instantly via Face ID, why your photos get auto-categorised the moment they arrive, why dictation feels live: a 0.5 W NPU is doing twenty trillion ops per second locally, while you watch.

The breakdown roughly stabilises like this:

CPU — anything that branches, anything that runs once, anything that needs to be smart. The orchestrator.
GPU — anything that's a big matrix multiplication. Training huge models. Real-time graphics. Scientific simulation. Physics.
NPU — running pre-trained models locally, at low power, all the time. Voice, vision, autocorrect, summarisation. The on-device AI tax.

An abstract glowing rendering of a sphere with interconnected dots and lines suggesting a neural network.

Photo · Growtika / Unsplash

A neural network is, structurally, an enormous chain of matrix multiplications. The shape is exactly what GPUs and NPUs were built for.

Why this is the bridge to Phase 2

Phase 1 is now complete. You have, in your head, a working model of the chef, the kitchen, the catalogue, the cop, the streets, the road width, and the army of prep cooks. Every machine in this room — and every model running in any cloud — is some specific arrangement of those eight ingredients. Nothing about it is mysterious any more.

What's left is to actually tell the chef what to do. Phase 2 is where we stop talking about the kitchen and start writing recipes. We start with the language that has been the lingua franca of computing for fifty years — the language whose ghost is in every modern compiler, every operating system, every AI framework. The language of Bell Labs, 1972.

When somebody says "AI runs on silicon", they really mean: AI runs on a GPU plugged into a CPU sharing a bus with some RAM. Now you know what every word in that sentence means.

Try it yourself

See your three chefs:

macOS — open Activity Monitor → GPU History (under the Window menu). Watch what spikes when you scroll a long page or play a video. On Apple Silicon, the Neural Engine doesn't show as a separate panel — but you can use the powermetrics command in Terminal to see ANE (Apple Neural Engine) activity per second.
Windows — Task Manager → Performance. You should see CPU, GPU, and on Copilot+ machines, NPU. Open a vision-AI feature (Windows Studio Effects, on-device transcription) and watch the NPU column flicker.
NVIDIA box — install nvtop (or run nvidia-smi -l 1) to see GPU utilisation in real time. Run a Stable Diffusion generation and watch it pin to 100%.

The numbers are not abstract — every one of them maps onto the chef-and-prep-cooks model you now carry in your head.

What's next — and welcome to Phase 2

Phase 1 is done. You have the silicon. You have the kitchen. From next week, we start writing for the chef.

Week 09 is 1972 & Bell Labs — why the language we start with, fifty years on, is C. The most influential programming language of all time, written by two people in a back office for a phone company. Phase 2 begins.

Photo credits

All photos are free under the Unsplash license. GPU · Abdullah Abid · Pixels · Anatoly Maltsev · Neural · Growtika. Chef-vs-cooks comparison and ops table are inline CSS / SVG.