If the time we spend in them is anything to go by, we must love queues. Whether you’re trying to check out at the tills or enter a foreign country, you can expect to have to queue. Nowhere are queues more upsetting than those occurring in a processor core. The code you want executed immediately gets stuck behind some time-consuming calculation of a square root. That’s known as a stall, and it doesn’t take too many of those for your 3 GHz processor to feel like it’s swimming against 1 GHz molasses.
Some places do it better, though, as they segregate customers with just a few items in their baskets from those with trolleys piled high, by offering fast checkouts. In doing so, they’re using what’s called in a processor out-of-order execution (OoO). It’s a surprisingly old technique, first appearing in mainframe computers in the 1960s, and playing a starring role in IBM’s POWER1 microprocessor in 1990, which led to the PowerPC series, as used in Macs for more than a decade from 1994 to 2006.
Thanks to ingenious experiments by Maynard Handley and Dougall Johnson, building on earlier work by Travis Downs, Henry Wong and others, we’re now gaining more insight into the cores in the M1, and how they and the M1 chip itself are thoroughly Apple designs within an overall architecture designed by Arm.
For Apple, and everyone who uses Apple’s products, this is of signal importance. Intel processors, and our old Macs, may be great at running Windows, but they’re not designed to work hand-in-glove with macOS. Apple’s kernel engineers can’t go to Intel and get it to change how its processors work, to improve the performance of macOS. Apple may be a significant customer of Intel’s, but with less than 5% of Intel’s total sales, it has had little if any influence over its engineering.
With Apple processor cores in an Apple SoC, software and hardware can be fully integrated, and that’s why its kernel engineers now work across all of Apple’s operating systems, from the more modest watchOS and tvOS to the full glory of macOS. This isn’t about making them all the same (far from it), but ensuring that each of the cores used works optimally with that device’s kernel and its extensions. It also applies to each of the specialist co-processors in an SoC and their real-time operating system RTKit.
If you’ve ever looked through some disassembled ARM64 code for a Mac or Apple device, you’ll have been surprised by the occasional appearance of the No Operation instruction NOP. What appears at first to be a flaw in the code generated by the build system turns out to be deliberate, probably acting as a hint to the decoder in a core to better predict which code to fetch in advance, to minimise stalls. In the ARM64 instruction set, there’s even a HINT instruction which is simply a NOP instruction with a few flags, which Apple’s cores seem to use for this purpose.
Those NOPs illustrate another of Apple’s deep integrations, between its hardware engineers and those responsible for code generation by Xcode and its build tools. If the occasional NOP/HINT is what’s required to help an Apple core decide which instructions to fetch in advance, then the code generator can be tailored to add them just where they’re needed.
Conspiracy theorists like using Apple’s increasing integration of core and chip design with software engineering as an example of how closed Apple’s products are becoming, without noticing the user benefits. The alternatives are hardly appealing, and illustrate the problem that Apple has solved: Android, with its Linux kernel, running on a bewildering variety of ARMv7, ARMv8-A and x86; or Microsoft Windows running on the SQ2 SoC co-developed by Qualcomm and Microsoft with Arm-designed Cortex cores in a Microsoft Surface Pro X. Of course Android and Windows for ARM work, but I doubt they come close to the efficiency of macOS and an M1.
Next time that you’re stood waiting in a queue, fiddling with your iPhone, pause to reflect that its cores are managing themselves far better than humans can cope with customers, but whatever you do don’t try out-of-order execution.