Drilling Down Into The Power10 Chip Architecture - IT Jungle

kuaciasing.blogspot.com

August 24, 2020 Timothy Prickett Morgan

Last week, we told you some general things about the future Power10 chip from IBM based on a roadmap briefing that we got from the IBM tech team ahead of their presentation at the Hot Chips 32 chip conference last week. IBM was gracious enough to let us talk about the Power10 chip generally before that presentation, because we have a Monday morning deadline no matter what. And this week, we can drill down into the Power10 architectural details a bit more.

The presentation at Hot Chips was given by William Starke, the chief architect of the Power10 processor who has helped steer several generations of Power chips from IBM (along with many others), and Brian Thompto, the Power10 core architect who was given the job of starting from scratch with the Power10 core with a focus on energy efficiency. A bunch of other Power Systems people have given us insight on the new chip, including Steve Sibley, vice president and offering manager for the Cognitive Systems division. The entire Power10 team needs to be congratulated on creating a fine processor and an even more interesting system architecture. IBM may have its issues, but these people are still among the best in the world. Period. And we know a lot of chipheads, and generally, they are good folk doing the hard job of engineering in a tough environment.

In the next few weeks, we will be peeling off the technology layers of Power10, looking at the processor architecture and design itself, the system designs Power10 engenders, the expected performance of the Power10 chip relative to Power7, Power8, and Power9. And finally, and enthusiastically, we will take a deep look at a new cross-system memory clustering technology that will prove to be pivotal in the use of Power10 processing in clouds and perhaps in all kinds of distributed computing. This last bit is the neat bit – and something we have been ruminating about in recent weeks at The Next Platform without knowing that IBM was in fact working on this. (The pieces were all there, and we felt it in the cosmic ether.) This memory clustering is evolutionary as far as IBM is concerned, being based on shared memory ideas it has had in the AS/400 line as well as supercomputers over the decades, but it is revolutionary as far as the industry is concerned because no one else has anything like this. Or will any time soon.

But this week, let’s get into the Power10 chip itself, which we have nicknamed “Cirrus” because IBM did not give it a cool codename, apparently. (That’s lame, and we are having none of that. So Cirrus it is.)

Let’s start at the die level and work our way inside. Take a look:

As we told you last week, the original plan for Power10 seemed to be to create an updated 24-core processor based generally on the Power9 design and then put two of these chips into a single socket to boost it to 24 cores with the process shrink from 14 nanometers at the GlobalFoundries fab to its 7 nanometer extreme ultraviolet (EUV) technologies. Somewhere along the way – probably between when GlobalFoundries spiked its 7 nanometer efforts in August 2018 and when it said publicly it was moving to Samsung as its foundry for Power10 – we think Big Blue went with a more aggressive plan. What we do know is that IBM went with a clean sheet resign of the Power10 core, which Thompto confirmed, and what we can see from the die that IBM pushed the limits a bit more thanks to the fact that is has 18 billion transistors to play with in a 602 square millimeter die.

The Power10 chip will come in versions that support either cores with eight threads per core (SMT8 in the IBM lingo) or twice as many cores with four threads per core (SMT4 mode). While we strongly suspect that IBM has designed a single chip to operate in either mode, this mode is set in the hardware itself and is not resettable by itself or by end users. The fact that users cannot change the mode means that IBM and its third party software suppliers know what to charge per core. IBM will dial the SMT4 or SMT8 cores and set them based on specific customer needs, as it has done with Power8 and Power9 before it, with SMT8 cores used for commercial workloads and SMT4 being used for HPC, hyperscaler, and cloud customers for the most part where they want as many VMs per machine as possible. If they are pinning a VM to a core.

As you can see from the die above, there are 16 physical cores in SMT8 mode on the Power10 die, which is the version IBM was showing off at Hot Chips, but clearly there would be 32 physical cores in SMT4 mode. To improve the yield of chips on the 7 nanometer process, IBM is figuring that at least one SMT8 core and its 8 MB L3 region is going to be a dud, and thus it is calling Power10 a 15-core SMT8 chip. But technically, it has one more core in SMT8 mode or two more cores in SMT4 mode (which have 4 MB of associated L3 cache).

Just like with the Power8 and Power9, the L3 cache is shared across the entire chip, although cores have “an affiliation” between a core and its most adjacent cache segment. With that dud core in the planning, there is a maximum of 120 MB of L3 cache per Power10 chip. That L3 cache is not etched with IBM’s own embedded DRAM (eDRAM) technology, which was a bit larger than standard SRAM but which had much lower power draw per bit cell than SRAM. With the 7 nanometer shrink, Starke tells us that IBM could use faster SRAM for L3 cache and not bust the transistor budget or the thermal envelope.

Obviously, as yields on the 7 nanometer process improve at the Samsung foundry, Big Blue can activate that 16^th core and charge extra money for it. And we think that is precisely what will happen around late 2022 or so after Power10 has been in the field for a year.

The Power10 chip is organized into two hemispheres, each with their own L3 cache blocks weighing in at 64 MB and eight SMT8 cores or sixteen SMT4 cores. All of these cores are linked to each other through the L3 cache bus, and all cores can talk to any L3 cache segment. Each SMT8 core has 2 MB of dedicated L2 cache and each SMT4 core has 1 MB of dedicated L2 cache, depending on how you carve it up. (Remember the die shot above is blocked off for the SMT8 cores, but the actual transistors don’t move around. SMT8 and SMT4 are just modes of the same hardware that are set after etching and in the packaging where no one can mess with it.)

I have no idea what is running across the top of the Power10 chip or the smaller block of circuits at the center of the bottom edge of the chip, and I was too excited by what I was seeing to remember to ask about these. That is a lot of real estate on the chip, and it must be doing something.

At the bottom of the chip are two blocks, each a PCI-Express 5.0 controller that pushes sixteen lanes of traffic at 32 GT/sec, which works out to 128 GB/sec across all those lanes (x16 as they call it) at full duplex. That’s twice the bit shifting as PCI-Express 4.0, which debuted first in the world on the Power9 chip in late 2017. Having 32 lanes of PCI-Express 5.0 does not seem like a lot compared to other servers, but remember, IBM has neater stuff.

At the four corners of the chip are the PowerAXON (formerly “BlueLink”) SerDes, and there are a total of 16 PowerAXON controllers with eight lanes of traffic each (that’s the x8 in the diagram) that deliver up to 32 GT/sec for a combined 1 TB/sec of bandwidth. These PowerAXON links are used for NUMA interconnects for machines with two, four, eight, or sixteen sockets. They are also used for OpenCAPI links out to persistent memory, accelerators, and other I/O devices and, as we shall see, are also used for the integrated memory clustering across systems for the all-memory network.

The OpenCAPI Memory Interface (OMI) is a tweak on this SerDes design that replaces a dedicated DDR4 memory controller. Now, IBM is using a third party memory buffer chip from Microchip (which it previewed at Hot Chips last year) to allow the OMI SerDes to just talk fast out to the buffer and for the buffer to decide if it is going to talk DDR4 (today) or DDR5 (in the future). This way, the processor and its memory controller does not have to be updated to support newer memory – only the buffer chip on the memory stick does. This is a bit more costly in terms of money (a few bucks) and latency (under 10 nanoseconds of additional latency, or between 5 percent and 10 percent or so, depending on the physical memory used), but it means that the Power10 chip is flexible. Unlike other X86 or Arm server processors, for instance. They are hard coded for physical memory, and you can’t upgrade it. There are two banks of eight OMI SerDes, one each on the left and the right side of the chip, for a total of sixteen x8 OMI SerDes for a total of 1 TB/sec of bandwidth per chip. Here is the important thing: That OMI SerDes provides 6X the bandwidth per square millimeter of chip area compared to a DDR4 memory controller etched in the same process. And IBM knows this because it has done both DDR4 controllers in the Power9 and OMI controllers in the Power9’ chip that IBM used to prototype the Power10 ideas.

The Power10 chip will offer a maximum of 4 TB of DDR4 memory (presumably running at 3.2 GHz or faster) and up to 410 GB/sec of peak theoretical bandwidth per socket. The Power10 chip has a truly stunning 2 PB of physical memory addressing – the current X86 chips top out at 64 TB of physical addressing.

As you can see, between the cores and the L3 cache blocks and the PowerAXON and OMI SerDes and PCI-Express 5.0 controllers there are two columns of circuits that provide the interface between the blocks at the center of the chip and the interfaces at the edges.

One other thing that might not be obvious. There is 1 TB/sec of bandwidth coming into the chip over PowerAXON controllers plus another 256 GB/sec over the PCI-Express 5.0 controllers, which is balanced out pretty evenly with the 1 TB/sec over the OMI SerDes into and out of memory.

Now, let’s drill down into that core a bit:

According to Thompto, thanks to the ground up rewrite of the Power10 core, each Power10 thread has on average of 20 percent better performance than the Power9 core across a variety of workloads, and because of changes to the cache hierarchy, the average core is yielding 30 percent more performance.

Here is what the SMT4 core looks like, schematically:

The SMT8 core would be twice as wide and did not fit easily into the presentation, which is why Starke and Thompto only showed the SMT4 core. There are four 128-bit execution slices at the center of the Power10 core, which has 48 KB of L1 instruction cache and 32 KB of L1 data cache. As you can see from the chart above, many of the features of the Power10 core are 2X or 4X that of the Power9 core, but there are a few that only got a modest boost.

Here is the interesting bit. IBM is getting 30 percent better instructions per clock (IPC) on Power10 versus Power9 normalized for the same 4 GHz clock speed IBM uses as a design point for Power7, Power8, Power9, and now Power10 chips, and the shrink allows for this work to be done in half the watts, for a factor of 2.6X performance per watt compared to Power9. That means there is enough thermal room to lower the clock speeds down from 4 GHz to 3.5 GHz and get two chips into a socket – and burn no more juice than a Power9 chip did.

And this is precisely what IBM plans to do for certain Power10 machinery next year. We will talk about that in next week’s issue.