Intel's New Chimera: Alder Lake
source link: https://www.agner.org/forum/viewtopic.php?t=79&%3Bp=187%23p187
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A chimera is a monster combining parts from different animals, or an organism containing multiple different sets of DNA. I am calling Intel's new Alder Lake processor a chimera because it is a hybrid containing two different kinds of CPU cores with very different designs.
The Alder Lake processor contains from 2 to 8 cores of the 'Golden Cove' architecture, called P cores, and from 0 to 8 cores of the 'Gracemont' architecture, called E cores. The P cores (Performance cores) are high-performance CPU cores using the latest state-of-the-art technology to get maximum performance. The E cores (Efficiency cores) use same technology as the 'Atom' series with low power consumption and lower performance. The idea behind this design is that the P cores can give a high performance for a limited number of threads, while the E cores allow the CPU to run many threads and still limit the power consumption. This may sound like a nice compromise in theory, but it involves a lot of problems when the same program or the same thread can jump arbitrarily between two very different kinds of cores.
The initial Alder Lake design had different CPUID numbers for the two kinds of cores. This gave problems with DRM software. If a program using DRM detects that the CPUID has changed, it will assume that the program has been moved to a different computer in violation of the license. This, of course, will stop the execution. Intel had to modify the Alder Lake and give it the same CPUID for all cores in order to fix this problem. Now, it is difficult for a running program to detect what kind of core it is running on.
Another problem is that the P cores are designed for the latest instruction set extensions, including AVX512 and a new set of half-precision floating point instructions (AVX-512 FP16) that are useful for neural networks. The E cores only support AVX2, not the later instruction set extensions, such as AVX512. What would happen if a program that starts executing in a P core and detects that AVX512 instructions are available is moved by the operating system to an E core that doesn't support this instruction set? A smart operating system might catch the error when the program attempts to execute an AVX512 instruction and move it back to a P core. But this requires that the operating system is designed with special support for the Alder Lake. If the program is running on an older operating system, it will crash in this situation. Therefore, Intel had to disable all instructions that are not supported by the E cores. The AVX512 instructions are actually implemented in the hardware, but they are disabled. Some motherboards have a BIOS feature that makes it possible to disable the E cores and enable the AVX512 instructions. This feature is not endorsed by Intel, and it has now been disabled in a microcode update, even for the i3 models that have no E cores. Intel have actually sacrificed their flagship 512-bit instructions in order to run multiple threads in low-power cores.
It is very difficult to optimize the software execution for this hybrid system. A further complication is that a P core can run two threads in the same core so that each thread gets half of the resources. This is what Intel call hyperthreading. A program thread may run in three different configurations with different performance parameters:
- Running alone in a P core with maximum performance
- Sharing a P core with another thread, giving half the resources
- Running in a low-power E core
It is completely unrealistic that an application program can handle this situation in a reasonable manner and optimally allocate different threads to the different cores. Hardly any software application company can afford to make different versions of their code for every new microprocessor model and verify, maintain, and support all these versions. The Alder Lake has implemented a special hardware solution to this problem called the 'Intel Thread Director'. The Intel Thread Director is an embedded microcontroller that monitors all threads and measures the resource use of each thread. The operating system can use this information to calculate the optimal allocation of P cores and E cores to the different threads. Windows 11 has support for the Intel Thread Director. Future versions of Linux are planned to support it too, while there are no known plans to support it in MacOS.
The way that Windows 11 handles this problem is still flawed, however. The system is giving high priority only to the thread that has the user focus. This ignores the behavior of many users. A user who is waiting for the computer to finish a heavy duty task is typically not just sitting and waiting. He/she is more likely to do something else during the waiting time, for example checking mails. There are various technical options that the user can use to control the prioritization of threads, but it is unreasonable to require that the user understands and masters such options when the user's attention is on a complicated calculation task rather than on the hardware details of a specific computer. It is already quite difficult to optimize for hyperthreading, as I have argued before. The hybrid design of the Alder Lake just makes the optimization an order of magnitide more complicated. It looks like the hardware designers have unrealistic expectations of how much software designs can be attuned to processor-specific peculiarities.
I have tested an Alder Lake, but I have not been able to get access to a setup that makes it possible to enable the AVX512 instructions. The performance of the P cores is improved somewhat over the Intel Ice Lake. The µop cache can hold 4k µops. The µop cache can deliver a maximum of 6 µops per clock cycle for a single thread or 3 µops per thread when running two threads. This throughput is not limited by code cache lines. The decoders can deliver a maximum of 4 µops per clock for a single thread or 2 µops per thread when running two threads. The decoders can handle a maximum of 16 bytes per clock, or 2x16 bytes when running two threads. The figures of 6 decoders and 8 µops per clock published elsewhere are not confirmed by my measurements.
Instruction latencies and throughputs are similar to the Ice Lake for most instructions, but the latency for floating point addition is reduced from 4 to 2 clock cycles. I have not published instruction tables for the Alder Lake. I prefer to wait until a pure Golden Cove with all instructions enabled becomes available.
1. Kyle Orland: Faulty DRM breaks dozens of games on Intel’s Alder Lake CPUs. Ars Technica, 2021
2. Ian Cutress and Andrei Frumusanu: The Intel 12th Gen Core i9-12900K Review: Hybrid Performance Brings Hybrid Complexity. Anandtech, 2021
3. Xaver Amberger: Intel completely disables AVX-512 on Alder Lake after all – Questionable interpretation of “efficiency”. IgorsLab, 2021
4. Ian Cutress and Andrei Frumusanu: Intel Architecture Day 2021: Alder Lake, Golden Cove, and Gracemont Detailed. Anandtech, 2021
5. Michael Larabel: Intel HFI To Premiere In Linux 5.18 For Improving Hybrid CPU Performance/Efficiency. Phoronix, 2022
6. Andrew Cunningham: Apple may be done with Intel Macs, but Hackintoshes can still use the newest CPUs. Ars Technica, 2022
7. Agner Fog: How good is hyperthreading? Agner's CPU blog, 2009
Aggregate valuable and interesting links.
Joyk means Joy of geeK