Computer Memory Hierarchy Explained

Your program runs ten times faster on some days than others. You changed nothing. The difference is almost always memory.

The Workspace Analogy

Picture a carpenter. On the workbench in front of them: a handful of nails and the piece they are cutting right now. That is a register. On a small shelf directly to their right: the tools and materials for the current job. That is L1 cache. In a cabinet on the other side of the workshop: everything for this week's projects. That is RAM. In the warehouse next door: raw materials and completed work going back years. That is disk. When you order from a supplier three states away, that is the cloud.

The carpenter does not stop every five seconds to walk to the warehouse. Work flows fastest when everything needed is within arm's reach. If something is not there, time is lost fetching it. The computer is identical. Every time your CPU needs data that is not in the nearest layer, it waits. That wait is called a miss, and misses are where performance goes to die.

Here is what the hierarchy looks like concretely:

Layer	Size	Latency
Registers	Bytes	< 1 ns
L1 Cache	32–64 KB	~1 ns
L2 Cache	256 KB–1 MB	~4 ns
L3 Cache	8–32 MB	~10–40 ns
RAM	8–64 GB	~100 ns
NVMe SSD	1–4 TB	~100 μs
HDD	1–8 TB	~5 ms

The jump from RAM to NVMe is 1000x. From NVMe to HDD is another 50x. These are not small differences. They are the difference between a response in microseconds and a response that is visibly, annoyingly slow.

Why the Hierarchy Exists

The answer is physics and economics, not poor engineering.

Registers are made from the same material as the processor itself, flip-flops embedded directly on the chip, capable of being read in a single clock cycle. They are also tiny, expensive per bit, and generate heat proportional to their switching speed. You cannot make a terabyte of registers because the chip would be the size of a room and the power bill would be measurable by a substation.

DRAM, the technology behind RAM, is cheaper because each bit is one transistor and one capacitor. But that capacitor leaks charge and needs constant refreshing, which takes time. You get density and price per gigabyte in exchange for nanoseconds of latency. The trade is intentional.

Disk storage is cheap and durable, but mechanical hard drives move physical platters. Electrons travel at close to the speed of light. Spinning metal does not. Even NVMe SSDs, which have no moving parts, communicate through a controller that adds latency RAM never has.

The hierarchy is not a limitation. It is the engineering answer to an impossible question: how do you give a CPU fast data access, large storage, low cost, and acceptable power draw simultaneously? The answer is: you do not. You build a ladder and you make data climb it.

Where This Shows Up in Production

This is not textbook theory. It runs through every system you will ever operate.

Redis and Memcached exist entirely because RAM is 1000x faster than a database query hitting disk. You are explicitly managing a layer of the hierarchy when you cache a query result. The cache hit rate in a Redis deployment is the same concept as L1 hit rate in a CPU. The scale is different. The principle is identical.
Database index design is a direct application of hierarchy thinking. A full table scan forces the database to pull rows from disk into RAM, often in random order, which is the worst possible access pattern for spinning drives. An index fits in RAM and serves results without touching disk. Every time you write CREATE INDEX, you are moving a lookup up the hierarchy.
Java and Python heap sizing decisions come back to RAM. When a JVM heap is too small, the garbage collector runs constantly, swapping objects in and out of memory in ways that thrash the CPU cache. The latency spike you see in a GC pause is partly the application waiting for memory at the wrong layer.
Kubernetes pod memory limits cause OOMKill events when a container exceeds its allocation. That is the OS enforcing RAM layer constraints. The container does not know it is being evicted. Your application just stops. The memory hierarchy has physical limits, and production systems hit them regularly.

The Common Mistake

Most engineers treat RAM as "memory" and treat everything else as either "fast memory" or "storage." This collapses six distinct layers into two, and it is why people are confused when their program is slow despite having plenty of RAM.

Cache memory is where most of the performance action happens, and it is managed entirely by hardware without your direct control. You cannot explicitly put a value in L1 cache. What you can do is write code that accesses memory in patterns the cache hardware can predict and prefetch. Sequential access is cache-friendly. Random access is not. This is why iterating over a flat array is faster than chasing pointers through a linked list, even when both structures hold the same data.

The mental model to hold: RAM is not your program's fast memory. Registers and cache are your program's fast memory. RAM is the slow fallback you hit when the cache runs out of room.

The Example

import time

SIZE = 10_000
matrix = [[i * SIZE + j for j in range(SIZE)] for i in range(SIZE)]

# Row-major traversal (cache-friendly: reads sequential memory)
start = time.perf_counter()
total = 0
for row in matrix:
    for val in row:
        total += val
row_major_time = time.perf_counter() - start

# Column-major traversal (cache-unfriendly: random jumps in memory)
start = time.perf_counter()
total = 0
for j in range(SIZE):
    for i in range(SIZE):
        total += matrix[i][j]
col_major_time = time.perf_counter() - start

print(f"Row-major:    {row_major_time:.3f}s")
print(f"Column-major: {col_major_time:.3f}s")

Row-major access follows the way Python stores lists in memory. Column-major access jumps between rows, defeating prefetch. On the same hardware, the column-major version runs measurably slower. The only difference is traversal order.

The Connection

In Post 12, "How a CPU Works", we looked at the fetch-decode-execute cycle and the role of registers in keeping the CPU fed. Registers are the top of the hierarchy examined in this post. Every pipeline stall described in Post 12 happens because the CPU requested data that was not sitting in a register. The pipeline waits. The hierarchy is the cause of that wait. Now that both pieces are in place, you can see the full picture: a CPU that can execute billions of cycles per second sitting idle because it is waiting for memory several layers down.

Takeaway

The hierarchy exists because no single storage technology is simultaneously fast, large, and cheap. Every performance optimization you ever write is, at its core, moving work up this ladder.

Next up: Cache Memory: The Layer That Makes Modern Computing Possible

Memory: From Registers to SSDs, Why the Hierarchy Exists

The Workspace Analogy

Why the Hierarchy Exists

Where This Shows Up in Production

The Common Mistake

The Example

The Connection

Takeaway

Comments

Zero to Architecture

Cache Memory: The Layer That Makes Modern Computing Possible

More from this blog

Variables and Memory: What a Variable Actually Is

From Code to Execution: How a Program Actually Runs

Your code never actually touches the hardware

Storage: How Data Persists and Why Durability Is Not Free

How RAM Works: Volatile Memory and Why It Matters

Command Palette

The Workspace Analogy

Why the Hierarchy Exists

Where This Shows Up in Production

The Common Mistake

The Example

The Connection

Takeaway

Comments

Zero to Architecture

Cache Memory: The Layer That Makes Modern Computing Possible

More from this blog