What Does CPU Cache Do (Explained) In 2024

What is CPU cache? here is what you need to know about it. The CPU serves as the computer’s brain, processing user-inputted data, among other tasks.

But how is that information sent to the CPU for processing? The processor determines if the information from a particular place in the main memory is already in the cache before attempting to read from or write to that address.

In that case, the CPU will read from or write to the cache rather than the main memory, which is substantially slower.

What Does CPU Cache Do ?

Information is sent to the CPU for processing
Operation
Associativity
Address translation

Information is sent to the CPU for processing

A computer’s central processing unit (CPU) uses a CPU cache as a hardware cache to lower the average cost (time or energy) of accessing data from the main memory.

[1] Copies of the data from frequently utilized main memory locations are kept in a cache, a smaller, quicker memory closer to a processor core.

Most CPUs contain a hierarchy of distinct cache levels (L1, L2, frequently L3, and in rare cases even L4), with various instruction-specific and data-specific caches at level 1.

The processor determines if the information from a specific place in the main memory is already in the cache before attempting to read from or write to that address.

In that case, the CPU will read from or write to the cache rather than the main memory, which is substantially slower.

Many contemporary desktop, server, and industrial CPUs feature three separate caches or more:

Instruction cache

used to expedite the fetch of executable instructions

Data cache

The data cache is often set up as a hierarchy of several cache levels, and it is used to speed up data retrieval and storage (L1, L2, etc.; see also multi-level caches below).

The translation lookaside buffer (TLB)

It is a tool for accelerating the translation of executable instructions and data from virtual to physical addresses.

It is possible to offer separate Instruction TLB (ITLB) and Data TLB (DTLB) or a single TLB allowing access to both instructions and data.

The TLB cache, however, is separate from the CPU caches and is a component of the memory management unit (MMU).

Operation

On a cache miss, the cache often has to remove one of the current entries to make place for the new item.

The replacement policy is the heuristic used to choose which entry to evict. Any replacement policy faces the fundamental challenge of determining which entry in the current cache is most unlikely to be utilized in the future.

Future prediction is challenging, particularly for hardware caches that employ basic principles that can be implemented in circuits.

Therefore there are a range of replacement policies available and no ideal way to pick between them. The least recently used entry is replaced by the LRU replacement policy, which is widely utilized.

Data must eventually be written to the main memory in addition to the cache when it is written to the cache. This writer’s time is governed by a concept known as the written policy.

Every writing to a write-through cache results in a write to the main memory. In contrast, writes are not instantaneously mirrored to memory in a write-back cache.

The cache instead keeps note of which places have been overwritten (these locations are marked dirty).

When information is removed from the cache, the data in these places is written back to the main memory.

As a result, to service a miss in a write-back cache, two memory accesses are frequently needed: one to retrieve the new position from memory and the other to write the dirty location to memory.

Associativity

Remember that the replacement policy determines where a copy of a particular main memory entry will be placed in the cache.

The cache is referred to as fully associative if the replacement policy is free to select any entry in the cache to hold the copy.

On the other hand, the cache is direct mapped if each entry in the main memory can fit in just one location in the cache.

Set associative is a term that describes several caches that implement a compromise.

For instance, an AMD Athlon’s level-1 data cache is 2-way set associative, allowing any specific position in the main memory to be cached in one of its two locations.

If there are two places in the cache where each address in the main memory may be stored, one reasonable inquiry is:

which two? Using the least significant bits of the memory location’s index as the cache memory’s index and having two entries for each index is the most straightforward and widely used strategy, as illustrated in the right-hand figure above.

The fact that the tags saved in the cache do not need to contain the portion of the main memory address given by the cache memory’s index is a beneficial feature of this method.

The cache tags take up less space and can be read and compared more quickly since they contain fewer bits.

Address translation

The majority of general-purpose CPUs implement virtual memory. To summarize, each program on the system sees its own customized address space that only includes information specific to that application.

Without considering what other programs do in their address spaces, each program places items in its own address space.

The processor must convert virtual addresses produced by the program into physical addresses in main memory to use virtual memory.

The memory management unit is the area of the CPU that performs this translation (MMU).

The translations kept in the strangely called Translation Lookaside Buffer (TLB), a cache of mappings from the operating system’s page table, can be executed using the fast route through the MMU.

There are three critical aspects of addressing translation that is crucial for the issue at hand:

Latency: The virtual address is available from the address generator after the physical address is available from the MMU, sometimes a few cycles later.
Aliasing: A single physical address may be mapped to many virtual addresses. Most processors promise that all modifications to that one physical address will occur in the correct sequence. The processor must ensure that there is always just one instance of a physical address in the cache to fulfill that guarantee.
Granularity: Pages make up the virtual address space. An example of this would be the division of a 4 GiB virtual address space into 1048576 4 KiB pages, each of which may be individually mapped. There could be support for different page sizes; for further information, see virtual memory.

Cache hierarchy in a modern processor

Specialized caches

The first problem is that pipelined CPUs can collect data, translate virtual addresses to physical addresses, and fetch instructions from different places in the pipeline.

It is reasonable to employ multiple physical caches to avoid planning one physical resource to serve two points in the channel.

As a result, the pipeline inevitably has at least three different caches (instruction, TLB, and data), each focusing on a particular function.

Victim cache

Blocks removed from a CPU cache because of a conflict or capacity miss are stored in a victim cache.

The victim cache only contains blocks removed from the primary cache on a miss and is located between it and its refill path.

This method is employed to lessen the cost a cache bears in the event of a miss.

Trace cache

After being decoded or as they are being retired, instructions are stored in a trace cache.

Generally, groupings of instructions representing distinct basic blocks or dynamic instruction traces are added to trace caches.

A basic block is made up of several non-branch instructions that are followed by a branch.

An active trace, also known as a “traceroute,” comprises numerous basic blocks and only includes instructions whose results are utilized.

It excludes instructions that follow chosen branches since they are not performed. As a result, a processor’s instruction fetch unit may retrieve many basic blocks without worrying about execution flow forks.

Based on a set of branch predictions and the program number of the first instruction in the trace, trace lines are saved in the trace cache.

As a result, storing many trace pathways that begin at the exact location is possible, and each indicates a distinct branch consequence.

The current program counter and a set of branch predictions are verified in the trace cache for a match during the instruction fetch step of a pipeline.

If there is a hit, a trace line that does not need to travel to a regular cache or memory for these instructions is sent to fetch.

Until the trace line is finished or there is a prediction error in the pipeline, the trace cache keeps feeding the fetch unit. A fresh trace is produced in the event of a miss.

The Levels of CPU Cache Memory: L1, L2, and L3

CPU There are three “levels” of cache memory: L1, L2, and L3. Again, the size of the cache and speed determine the memory hierarchy.

So, does performance depend on the size of the CPU cache?

L1 Cache

The quickest memory in a computer system is the L1 (Level 1) cache. The data that the CPU will most likely require while performing a particular job is stored in the L1 cache and accessed with the highest priority.

The CPU determines the L1 cache’s size. There are currently several high-end consumer CPUs with a 1MB L1 cache, such as the Intel i9-9980XE, although they are still costly and hard to find.

A 1-2MB, L1 memory cache is also present in some server chipsets, such as those from Intel’s Xeon family.

There is no “standard” L1 cache size; thus, examining the CPU specifications to ascertain the precise L1 memory cache size before purchasing.

The L1 cache is generally divided into an instruction and data cache. The data cache stores the data on which the operation is to be conducted, whereas the instruction cache deals with information about the action that the CPU must carry out.

L2 Cache

Although L2 (Level 2) cache is more prominent in size than the L1 cache, it is slower.

Modern L2 memory caches are gigabytes in length as opposed to L1 caches, which may be measured in kilobytes.

For instance, the highly regarded AMD Ryzen 5 5600X has a 384KB L1 cache and a 3MB L2 cache (plus a 32MB L3 cache).

Depending on the CPU, the size of the L2 cache might range from 256 KB to 8 MB. An L2 cache of 256 KB or more is typically found in current CPUs, and this capacity is now regarded as tiny.

Some of the most powerful contemporary CPUs also feature an enormous L2 memory cache that exceeds 8MB.

The L2 cache is slower than the L1 cache in terms of performance, but it is still far quicker than your system RAM. Your RAM is usually 100 times faster than the L1 memory cache, while the L2 cache is around 25 times faster.

L3 Cache

I’m going to the Level 3 cache. The L3 memory cache was initially located on the motherboard.

This was long ago when most CPUs had a single core. With top-tier consumer CPUs having L3 caches up to 32MB, your CPU’s L3 cache can now be enormous. Some server CPU L3 caches can offer up to 64MB, which is more than this.

The largest and slowest cache memory unit is the L3 cache. The L3 cache is a feature of contemporary CPUs.

The L3 cache is more analogous to a generic memory pool that the entire chip may utilize, whereas the L1 and L2 caches are present for each core on the chip itself.

How Does Data Move Between CPU Memory Caches?

The data moves from the RAM to the L3 cache, then to the L2 cache, and eventually to the L1 cache.

The L1 cache is the first place the CPU looks when seeking data to operate. The circumstance is known as a cache hit if the CPU discovers it. In that order, it then goes on to locate it in L2 and L3.

If none of the memory caches has the data, the CPU tries to obtain it from your system memory (RAM). An instance of such is referred to as a cache miss.

Due to its speed and proximity to the core, L1 cache memory has the lowest latency, whereas L3 has the greatest.

When a cache is missed, memory cache latency increases because the CPU must fetch the data from the system memory.

The latency keeps getting smaller as computers get quicker and more productive. Your system will run quicker than ever thanks to low latency DDR4 RAM and lightning-fast SSDs. The speed of your system memory is crucial in this regard.

Final thought

The Architecture of cache memory is constantly changing, especially as memory becomes less expensive, quicker, and denser. I hope the essay has helped you learn about caching and its functions.

Is the CPU Worth Upgrading? (2 Reasons why)

What Does CPU Cache Do (Explained)