While chiplets have been around for decades, their use was historically limited to specific, specialized applications. Today, however, they are at the forefront of technology, powering millions of desktop PCs, workstations, servers, gaming consoles, phones, and even wearable devices worldwide.
In a matter of a few years, most leading chipmakers have embraced chiplet technology to drive innovation. It’s now clear that chiplets are poised to become the industry standard. Let’s explore what makes them so significant and how they’re shaping the future of technology.
TL;DR: What are chiplets?
Chiplets are segmented processors. Instead of consolidating every part into a single chip (known as a monolithic approach), specific sections are manufactured as separate chips. These individual chips are then mounted together into a single package using a complex connection system.
This arrangement allows the parts that can benefit from the latest fabrication methods to be shrunk in size, improving the efficiency of the process and allowing them to fit in more components.
The parts of the chip that can’t be significantly reduced or don’t require reduction can be produced using older and more economical methods.
While the process of manufacturing such processors is complex, the overall cost is typically lower. Furthermore, it offers processor companies a more manageable pathway to expand their product range.
Silicon science
To fully understand why processor manufacturers have turned to chiplets, we must first delve into how these devices are made. CPUs and GPUs start their life as large discs made of ultra-pure silicon, typically a little under 12 inches (300 mm) in diameter and 0.04 inches (1 mm) thick.
This silicon wafer undergoes a sequence of intricate steps, resulting in multiple layers of different materials – insulators, dielectrics, and metals. These layers’ patterns are created through a process called photolithography, where ultraviolet light is shone through an enlarged version of the pattern (a mask), and subsequently shrunk via lenses to the required size.
The pattern gets repeated, at set intervals, across the surface of the wafer and each of these will ultimately become a processor. Since chips are rectangular and wafers are circular, the patterns must overlap the disc’s perimeter. These overlapping parts are ultimately discarded as they are non-functional.
Once completed, the wafer is tested using a probe applied to each chip. The electrical examination results inform engineers about the processor’s quality against a long list of criteria. This initial stage, known as chip binning, helps determine the processor’s “grade.”
For instance, if the chip is intended to be a CPU, every part should function correctly, operating within a set range of clock speeds at a specific voltage. Each wafer section is then categorized based on these test results.
Upon completion, the wafer is cut into individual pieces, or “dies,” that are viable for use. These dies are then mounted onto a substrate, akin to a specialized motherboard. The processor undergoes further packaging (for instance, with a heat spreader) before it’s ready for distribution.
The entire sequence can take weeks of manufacturing and companies such as TSMC and Samsung charge high fees for each wafer, anywhere between $3,000 and $20,000 depending on the process node being used.
“Process node” is the term used to describe the entire fabrication system. Historically, they were named after the transistor’s gate length. However, as manufacturing technology improved and allowed for ever-smaller components, the nomenclature no longer followed any physical aspect of the die and now it’s simply a marketing tool.
Nevertheless, each new process node brings benefits over its predecessor. It might be cheaper to produce, consume less power at the same clock speed (or vice versa), or have a higher density. The latter metric measures how many components can fit within a given die area. In the graph below, you can see how this has evolved over the years for GPUs (the largest and most complex chips you’ll find in a PC)…
The improvements in process nodes provide a means for engineers to increase the capabilities and performance of their products, without having to use big and costly chips. However, the above graph only tells part of the story, as not every aspect of a processor can benefit from these advancements.
Circuits inside chips can be allocated into one of the following broad categories:
- Logic – handles data, math, and decision-making
- Memory – usually SRAM, which stores data for the logic
- Analog – circuits that manage signals between the chip and other devices
Unfortunately, while logic circuits continue to shrink with every major step forward in process node technology, analog circuits have barely changed and SRAM is starting to reach a limit too.
While logic still forms the largest portion of the die, the amount of SRAM in today’s CPUs and GPUs has significantly grown in recent years. For example, AMD’s Vega 20 chip used in its Radeon VII graphics card (2019), featured a combined total of 5 MB of L1 and L2 cache. Just two GPU generations later, the Navi 21 chip powering the Radeon RX 6000 series (2020), included over 130 MB of combined cache – a remarkable 25-fold increase.
We can expect these to continue to increase as new generations of processors are developed, but with memory not scaling down as well as the logic, it will become increasingly less cost-effective to manufacture all of the circuitry on the same process node.
In an ideal world, one would design a die where analog sections are fabricated on the largest and cheapest node, SRAM parts on a much smaller one, and logic reserved for the absolute cutting-edge technology. Unfortunately, this is not practically achievable. However, there exists an alternative approach.
Divide and conquer
In 1995, Intel introduced the Pentium II, a successor to its original P5 processor. What set it apart from other processors at the time was the design hidden beneath its plastic shield: a circuit board housing two chips. The main chip contained all the processing logic and analog systems, while one or two separate SRAM modules served as Level 2 cache.
While Intel manufactured the primary chip, the cache was sourced from external suppliers. This approach became fairly standard for desktop PCs in the mid-to-late 1990s, until advances in semiconductor fabrication allowed logic, memory, and analog systems to be fully integrated into a single die.
While Intel continued to dabble with multiple chips in the same package, it largely stuck with the so-called monolithic approach for processors – i.e., one chip for everything. For most processors, there was no need for more than one die, as manufacturing techniques were proficient (and affordable) enough to keep it straightforward.
However, other companies were more interested in following a multi-chip approach, most notably IBM. In 2004, it was possible to purchase an 8-chip version of the POWER4 server CPU that comprised four processors and four cache modules, all mounted within the same body (known as a multi-chip module or MCM approach).
Around this time, the term “heterogeneous integration” started to appear, partially due to research work done by DARPA. Heterogeneous integration aims to separate the various sections of a processing system, fabricate them individually on nodes best suited for each, and then combine them into the same package.
Today, this is better known as system-in-package (SiP) and has been the standard method for equipping smartwatches with chips from their inception. For example, the Series 1 Apple Watch houses a CPU, some DRAM and NAND Flash, multiple controllers, and other components within a single structure.
A similar setup can be achieved by having different systems all on a single die (known as an SoC or system-on-a-chip). However, this approach doesn’t allow for taking advantage of different node prices, nor can every component be manufactured this way.
For a technology vendor, using heterogeneous integration for a niche product is one thing, but employing it for the majority of their portfolio is another. This is exactly what AMD did with its range of processors. In 2017, the semiconductor giant introduced its Zen architecture with the launch of the single-die Ryzen desktop CPU. Just a few months later, AMD debuted two multi-chip product lines: Threadripper and EPYC, with the latter featuring configurations of up to four dies.
With the launch of Zen 2 two years later, AMD fully embraced HI, MCM, SiP – call it what you will. They shifted the majority of the analog systems out of the processor and placed them into a separate die. These were manufactured on a simpler, cheaper process node, while a more advanced one was used for the remaining logic and cache.
And so, chiplets became the buzzword of choice.
Smaller is better
To understand exactly why AMD chose this direction, let’s examine the image below. It showcases two older CPUs from the Ryzen 5 series – the 2600 on the left, employing the so-called Zen+ architecture, and the Zen 2-powered 3600 on the right.
The heat spreaders on both models have been removed, and the photographs were taken using an infrared camera. The 2600’s single die houses eight cores, though two of them are disabled for this particular model.
This is also the case for the 3600, but here we can see that there are two dies in the package – the Core Complex Die (CCD) at the top, housing the cores and cache, and the Input/Output Die (IOD) at the bottom containing all the controllers (for memory, PCI Express, USB, etc.) and physical interfaces.
Since both Ryzen CPUs fit into the same motherboard socket, the two images are essentially to scale. On the surface, it might seem that the two dies in the 3600 have a larger combined area than the single chip in the 2600, but appearances can be deceiving.
If we directly compare the chips containing the cores, it’s clear how much space in the older model is taken up by analog circuitry – it’s all the blue-green colors surrounding the gold-colored cores and cache. However, in the Zen 2 CCD, very little die area is dedicated to analog systems; it’s almost entirely composed of logic and SRAM.
The Zen+ chip has an area of 213 mm² and was manufactured by GlobalFoundries using its 12nm process node. For Zen 2, AMD retained GlobalFoundries’ services for the 125 mm² IOD but utilized TSMC’s superior N7 node for the 73 mm² CCD.
The combined area of the chips in the newer model is smaller, and it also boasts twice as much L3 cache, supporting faster memory and PCI Express. The best part of the chiplet approach, however, was that the compact size of the CCD made it possible for AMD to fit another one into the package. This development gave birth to the Ryzen 9 series, offering 12 and 16-core models for desktop PCs.
Even better, by using two smaller chips instead of one large one, each wafer can potentially yield more dies. In the case of the Zen 2 CCD, a single 12-inch (300 mm) wafer can produce up to 85% more dies than for the Zen+ model.
The smaller the slice one takes out of a wafer, the less likely one is going to find manufacturing defects (as they tend to be randomly distributed across the disc), so taking all of this into account, the chiplet approach not only gave AMD the ability to expand its portfolio, it did so far more cost-effectively – the same CCDs can be used in multiple models and each wafer produces hundreds of them!
But if this design choice is so advantageous, why isn’t Intel doing it? Why aren’t we seeing it being used in other processors, like GPUs?
Following the lead
To address the first question, Intel has been progressively adopting chiplet technology as well. The first consumer CPU architecture they shipped using chiplets is called Meteor Lake. Intel’s approach is somewhat unique though, so let’s explore how it differs from AMD’s approach.
Using the term tiles instead of chiplets, this generation of processors split the previously monolithic design into four separate chips:
- Compute tile: Contains all of the cores and L2 cache
- GFX tile: Houses the integrated GPU
- SoC tile: Incorporates L3 cache, PCI Express, and other controllers
- IO tile: Accommodates the physical interfaces for memory and other devices
High-speed, low-latency connections are present between the SoC and the other three tiles, and all of them are connected to another die, known as an interposer. This interposer delivers power to each chip and contains the traces between them. The interposer and four tiles are then mounted onto an additional board to allow the whole assembly to be packaged.
Unlike Intel, AMD does not use any special mounting die but has its own unique connection system, known as Infinity Fabric, to handle chiplet data transactions. Power delivery runs through a fairly standard package, and AMD also uses fewer chiplets. So why is Intel’s design as such?
One challenge with AMD’s approach is that it’s not very suitable for the ultra-mobile, low-power sector. This is why AMD still uses monolithic CPUs for that segment. Intel’s design allows them to mix and match different tiles to fit a specific need. For example, budget models for affordable laptops can use much smaller tiles everywhere, while AMD only has one size chiplet for each purpose.
The downside to Intel’s system is that it’s complex and expensive to produce (which has lead to different kind of issues). Both CPU firms, however, are fully committed to the chiplet concept. Once every part of the manufacturing chain is engineered around it, costs should decrease.
When it comes to GPUs, they contain relatively little analog circuitry compared to the rest of the die. However, the amount of SRAM inside has been steadily increasing. This trend prompted AMD to leverage its chiplet expertise in the Radeon 7000 series, with the Radeon RX 7900 GPUs featuring a multi-die design. These GPUs include a single large die for the cores and L2 cache, along with five or six smaller dies, each containing a slice of L3 cache and a memory controller.
By moving these components out of the main die, engineers were able to significantly increase the amount of logic without relying on the latest, most expensive process nodes to keep chip sizes manageable. While this innovation likely helped reduce overall costs, it did not significantly expand the breadth of AMD’s graphics portfolio.
Currently, Nvidia and Intel consumer GPUs are showing no signs of adopting AMD’s chiplet approach. Both companies rely on TSMC for all manufacturing duties and seem content to produce extremely large chips, passing the cost onto consumers.
That said, it is known that both are actively exploring and implementing chiplet-based architectures in some of their GPU designs. For example, Nvidia’s Blackwell data center GPUs utilize a chiplet design featuring two large dies connected via a high-speed interlink capable of 10 terabytes per second, effectively functioning as a single GPU.
Getting Moore with chiplets
No matter when these changes occur, the fundamental truth is that they must happen. Despite the tremendous technological advances in semiconductor manufacturing, there is a definite limit to how much each component can be shrunk.
To continue enhancing chip performance, engineers essentially have two avenues – add more logic, with the necessary memory to support it, and increase internal clock speeds. Regarding the latter, the average CPU hasn’t significantly altered in this aspect for years. AMD’s FX-9590 processor, from 2013, could reach 5 GHz in certain workloads, while the highest clock speed in its current models is 5.7 GHz (with the Ryzen 9 9950X).
Intel’s highest-clocked consumer CPU is the Core i9-14900KS, featuring a maximum turbo frequency of 6.2 GHz on two cores. This “special edition” processor holds the record for the fastest out-of-the-box clock speed among desktop CPUs.
However, what has changed is the amount of circuitry and SRAM. The aforementioned AMD FX-9590 had 8 cores (and 8 threads) and 8 MB of L3 cache, whereas the 9950X boasts 16 cores, 32 threads, and 64 MB of L3 cache. Intel’s CPUs have similarly expanded in terms of cores and SRAM.
Nvidia’s first unified shader GPU, the G80 from 2006, consisted of 681 million transistors, 128 cores, and 96 kB of L2 cache in a chip measuring 484 mm2 in area. Fast forward to 2022, when the AD102 was launched, and it now comprises 76.3 billion transistors, 18,432 cores, and 98,304 kB of L2 cache within 608 mm2 of die area.
In 1965, Fairchild Semiconductor co-founder Gordon Moore observed that in the early years of chip manufacturing, the density of components inside a die was doubling each year for a fixed minimum production cost. This observation became known as Moore’s Law and was later interpreted to mean “the number of transistors in a chip doubles every two years”, based on manufacturing trends.
Moore’s Law has served as a reasonably accurate representation of the semiconductor industry’s progress for nearly six decades. The tremendous gains in logic and memory in both CPUs and GPUs have largely been driven by continuous improvements in process nodes, with components becoming progressively smaller over time. However, this trend cannot can’t continue forever, regardless of what new technology comes about.
Rather than waiting for these physical limits to be reached, companies like AMD and Intel have embraced chiplet technology, exploring innovative ways to combine these modular components to sustain the creation of increasingly powerful processors.
Decades in the future, the average PC might be home to CPUs and GPUs the size of your hand. But, peel off the heat spreader and you’ll find a host of tiny chips – not three or four, but dozens of them, all ingeniously tiled and stacked together. The dominance of the chiplet has only just begun.