Cache Memory Explained: The Layer That Makes CPUs Fast

Your CPU can perform billions of operations per second. Without cache, it would spend most of that time waiting.

That is not an exaggeration. A modern CPU can execute an instruction in under a nanosecond. Fetching data from RAM takes 60 to 100 nanoseconds. Do the arithmetic. If every instruction required a round trip to RAM, your processor would sit idle for the vast majority of its working life, and everything you run, every app, every query, every request, would slow to a crawl.

Cache is the solution the industry landed on. It is fast memory, very fast, placed physically on the CPU die itself. When the processor needs data, it checks cache first. If it finds it there, it continues at full speed. If it does not, it pays the penalty of fetching from RAM. That distinction, hit or miss, is the single most important factor in whether a piece of code is genuinely fast or just looks fast in a benchmark.

Three Levels, Three Trade-offs

There is not one cache. There are three, and understanding why requires understanding the problem they are each solving.

L1 cache is the smallest and fastest. Typically 32 to 64 KB per core, latency around 1 nanosecond. It lives directly on the processor core. Every core has its own L1.
L2 cache is larger, 256 KB to a few MB per core, and slightly slower, 3 to 10 nanoseconds. Still on-chip, still dramatically faster than RAM.
L3 cache is shared across all cores on the chip. Anywhere from 8 MB to 64 MB on modern server chips. Latency is 30 to 40 nanoseconds. Slower than L1 and L2, but still several times faster than going out to RAM.

The hierarchy exists because of physics. Fast memory requires small circuits. Small circuits are expensive and generate heat. You cannot put 64 GB of L1 cache on a chip; the economics and thermal constraints do not allow it. So the design is layered: tiny and extremely fast close to compute, larger and slightly slower further out. The memory hierarchy explored in Post 13 is exactly this principle playing out at the hardware level.

How Cache Knows What to Load

Cache does not wait for you to ask. It predicts.

The hardware uses two principles to make those predictions:

Temporal locality: if you accessed a piece of data just now, you will probably access it again soon. A loop counter is the obvious example. Cache keeps recently used data warm.
Spatial locality: if you accessed address X, you will probably need addresses near X shortly. Cache loads not just the byte you asked for, but an entire cache line, typically 64 bytes, on the assumption that neighbouring data will be needed next.

These two principles are why sequential access patterns are dramatically faster than random access patterns. When you iterate through an array in order, you get spatial locality for free. Every cache line you load serves multiple subsequent reads. When you jump around memory randomly, every access is likely a miss, and you pay the full RAM penalty every single time.

Where This Appears in Real Systems

This is not an abstract hardware concern. Cache behavior is the difference between fast production code and slow production code.

Database engines are designed to be cache-conscious. PostgreSQL and MySQL store table pages in buffer pools specifically to avoid repeated disk reads, but underneath that, the CPU cache behavior of how those pages are scanned matters too. Column-oriented databases like ClickHouse or Apache Parquet store data by column rather than by row, which improves CPU cache utilization for analytical queries that read one column at a time.
Redis fits entirely in memory by design. Part of what makes it fast is that working data stays in CPU cache across operations. A Redis server handling repetitive key lookups on a small hot dataset is essentially running from L3 cache.
Java and the JVM have a garbage collector that compacts the heap in part to improve cache performance. Fragmented objects scattered across memory produce cache misses. Compact, sequential object layout produces cache hits.
Linux kernel scheduling is cache-aware. The scheduler tries to run a thread on the same core it previously ran on. This is called cache affinity, and it exists entirely to keep the L1 and L2 warm with that thread's data.
AWS and cloud instance sizing are often misread through the lens of CPU clock speed alone. Two vCPUs with good cache pressure from a well-structured workload will outperform eight vCPUs running a cache-thrashing access pattern. This is a real cost and performance decision.

The Mistake Most Engineers Make

The assumption is that cache is automatic and invisible, so you do not need to think about it.

It is automatic. It is not invisible.

The hardware manages cache without your involvement, but the access patterns your code produces are entirely your responsibility. Two algorithms with identical computational complexity can differ by an order of magnitude in real execution time if one is cache-friendly and the other is not. A matrix traversal in row-major order on a row-major stored matrix is fast. The same traversal in column-major order produces a cache miss on nearly every access. Same Big O notation. Dramatically different wall-clock time.

The Example

import time
import numpy as np

N = 4096
matrix = np.arange(N * N, dtype=np.float64).reshape(N, N)

# Cache-friendly: row-major traversal
start = time.perf_counter()
total = 0.0
for i in range(N):
    for j in range(N):
        total += matrix[i][j]
row_time = time.perf_counter() - start

# Cache-unfriendly: column-major traversal
start = time.perf_counter()
total = 0.0
for i in range(N):
    for j in range(N):
        total += matrix[j][i]
col_time = time.perf_counter() - start

print(f"Row-major: {row_time:.3f}s")
print(f"Col-major: {col_time:.3f}s")

Row-major traversal accesses memory sequentially and loads each cache line fully. Column-major traversal skips across memory on every access, causing cache misses throughout. Run this and the timing gap tells you everything about what cache actually does.

The Connection

In Post 13, Memory: From Registers to SSDs, Why the Hierarchy Exists, the full memory hierarchy was mapped from registers down to spinning disk. Cache sits at the top of that hierarchy for a reason. It is not an add-on or an optimization afterthought. It is the layer that makes the rest of the hierarchy usable. Without it, the gap between CPU speed and RAM speed would make modern computing economically impractical.

What You Keep

Cache is not a speed bonus. It is the reason modern software runs at all.

Next up: How RAM Works: Volatile Memory and Why It Matters

Cache Memory: The Layer That Makes Modern Computing Possible

Three Levels, Three Trade-offs

How Cache Knows What to Load

Where This Appears in Real Systems

The Mistake Most Engineers Make

The Example

The Connection

What You Keep

Comments

Zero to Architecture

How RAM Works: Volatile Memory and Why It Matters

More from this blog

Variables and Memory: What a Variable Actually Is

From Code to Execution: How a Program Actually Runs

Your code never actually touches the hardware

Storage: How Data Persists and Why Durability Is Not Free

How RAM Works: Volatile Memory and Why It Matters

Command Palette

Three Levels, Three Trade-offs

How Cache Knows What to Load

Where This Appears in Real Systems

The Mistake Most Engineers Make

The Example

The Connection

What You Keep

Comments

Zero to Architecture

How RAM Works: Volatile Memory and Why It Matters

More from this blog