Week 29 · Phase 4 — The Toolbox
Storing data in a straight line — the foundational data structure.
Photo · Rafał Lasiewicz / Unsplash
Phase 1 to 3 was about the language. Phase 4 is about the vocabulary of working programmers — the data structures and algorithms that everyone builds out of and that everyone is expected to recognise. Almost all of these structures are built on top of the simplest one of all: a row of slots, sitting next to each other in memory, each addressable by an index.
You've already seen this before — Week 12 (variables in memory), Week 17 (arrays as pointers), Week 26 (the STL's std::vector). This week we treat the array as the centerpiece. We'll meet two flavours, one famous gotcha, and the surprising reason almost every other data structure is, secretly, an array under the hood.
An array is a fixed number of values of the same type, stored back-to-back in memory. Like this:
The array's "name" is, secretly, just the address of the first cell. To get cell i, the chef computes base_address + i × element_size and reads from there. One add, one multiply, one load. O(1), regardless of how big the array is. This is why arrays are fast, and why "indexing into an array" is the cheapest data-structure operation that exists.
// Plain C-style array — fixed size, lives on the stack
int arr[8] = {42, 17, 99, 8, 23, 71, 5, 38};
arr[0]; // 42
arr[7]; // 38
arr[8]; // undefined behaviour — past the end!
// std::vector — growable, lives on the heap, manages its own size
std::vector<int> v = {42, 17, 99};
v.push_back(8);
v.push_back(23);
v[3]; // 8
v.size(); // 5
Always prefer std::vector. The C-style array doesn't know its own size, decays to a pointer when passed to functions, and silently allows out-of-bounds access. std::vector handles its own memory, can grow, and offers bounds checking via .at(i). It compiles to identical machine code on the happy path; it's just safer and more flexible.
std::vector growsA vector starts with some capacity (often 0). When you push_back and the capacity is exhausted, the vector allocates a new buffer twice as large, copies the existing elements, and frees the old buffer. This is called amortised O(1) growth: most pushes are free, occasional pushes are expensive (when reallocation happens), but the average over many pushes is constant.
The doubling strategy is mathematically optimal: if you grew by a constant amount you'd get O(n²); if you grew by less than 2× you'd risk fragmentation. Almost every dynamic-array implementation in every language uses this exact trick.
You can also pre-allocate with v.reserve(1000) if you know the size in advance — that avoids any reallocations during a known-size loop. Useful for tight inner code.
Computers are good at 1-D. To represent a 2-D matrix in 1-D memory, we lay out the rows back-to-back:
// a 3x4 matrix flattened row-major
// row 0: [ 1 2 3 4 ]
// row 1: [ 5 6 7 8 ]
// row 2: [ 9 10 11 12 ]
std::vector<int> m(12);
// access (row, col):
int v = m[row * 4 + col]; // 4 = number of columns
This is exactly what NumPy and PyTorch are doing under the hood — a tensor's shape tells you the bounds; its strides tell you how to compute the flat memory offset. Everything is a 1-D array. Everything else is bookkeeping.
Cache-wise (Week 3), this layout means walking across a row is fast (sequential bytes) and walking down a column is slow (strided jumps). Hence the row-major-vs-column-major performance difference we saw in Week 3 — same fact, different angle.
C and C++ do not check whether you're reading a valid index. If you write arr[8] on an 8-element array, the chef happily reads the byte after the array's last element — which might be the next variable, the return address, or someone else's data. This single category of bug is responsible for an enormous fraction of historical security vulnerabilities (Heartbleed, Stagefright, the original Morris worm).
Defensive practice:
v.at(i) instead of v[i] in code that might receive untrusted indices — at() throws on out-of-bounds.-fsanitize=address in development to catch overruns at runtime.for over manual index loops.A neural network is, fundamentally, a graph of tensors — high-dimensional arrays. Even a 7-billion-parameter language model is, structurally, a few thousand tensors of varying shapes. Every operation (matrix multiply, attention, layer-norm) is implemented as careful index arithmetic over flat memory. Understanding arrays — and especially understanding strides and row/column-major layouts — is half the battle in numerical performance.
Open the at::Tensor source. You'll find a shape vector, a strides vector, and a pointer to a flat buffer of bytes. That's the entire core of PyTorch. Everything else is conveniences on top.
Almost every "data structure" you'll meet is, underneath, an array with rules about how to access it.
std::vector<int> and returns the largest element.std::vector::push_back in a loop. Print v.size() and v.capacity() at each step. Watch capacity double when full.std::vector<double> with row-major layout. Time it. Try the column-major version; time the difference.You can now build collections. The next question is: how do you tell which collection — and which algorithm — is best? That's a language with its own notation.
Week 30 is Big O Notation.
Photo free under the Unsplash license. Cubes · Rafał Lasiewicz.