Normal view

There are new articles available, click to refresh the page.
Before yesterdayMain stream

Tune for Performance: Core types

By: hoakley
17 December 2024 at 15:30

When running apps on Intel Macs, because all their CPU cores are identical, the best you can hope for is that tasks make use of the number of cores available by running in multiple threads. Even single-threaded processes running on Apple silicon Macs get a choice of CPU core type, and this article explains the limited control you have over that.

m4coremanagement1

In my current model of CPU core allocation by macOS, shown in part above, Quality of Service (QoS) is the factor determining which core type a thread will be allocated to. QoS is normally hard-coded into an app, and seldom exposed to the user’s control.

If a thread is assigned a minimal QoS of Background (value 9), or less, then macOS will allocate it to be run on E cores, and it won’t be run on a P core. With a higher QoS, macOS allocates it to a P core if one is available; if not, then it will be run on E cores, with that cluster running at high frequency. Thus, threads with a higher QoS can be run on either P or E cores, depending on their availability.

taskpolicy

There are times when you might wish to accelerate completion of threads normally run exclusively on E cores. For example, knowing that a particular backup might be large, you might elect to leave the Mac to get on with that as quickly as possible. There are two methods that appear intended to change the QoS of processes: the command tool taskpolicy, and its equivalent code function setpriority().

Experience with using those demonstrates that, while they can be used to demote threads to E cores, they can’t promote processes or threads already confined to E cores so that they can use both types. For instance, the command
taskpolicy -B -p 567
that should promote the process with PID 567 to run on both types of core, has no effect on processes or their threads that are run at low QoS. taskpolicy can be used to demote processes and their threads with higher QoS to use only E cores, though. Running the command
taskpolicy -b -p 567
does confine all threads to the E cluster, and can be reversed using the -B option for threads with higher QoS (but not those set to low QoS by the process).

qoscores1

That can be seen in this CPU History window from Activity Monitor. An app has run four threads, two at low QoS and two at high QoS. In the left side of each core trace they are run on their respective cores, as set by their QoS. The app’s process was then changed using taskpolicy -b and the threads run again, as seen in the right. The two threads with high QoS are then run together with the two with low QoS in the four E cores alone.

The best way to take advantage of this ability to demote high QoS threads to run them on E cores is in St. Clair Software’s excellent utility App Tamer.

Virtualisation

macOS virtual machines running on Apple silicon chips are automatically assigned a high QoS, and run preferentially on P cores. Thus, even when running threads at low QoS, those are run within threads on the host’s P cores. This remains the only known method of electively running low QoS threads on P cores.

Game Mode, CPU cores

In the Info.plist of an application, the developer should assign that app to one of the standard LSApplicationCategoryTypes. If that’s one of the named game types, macOS automatically gives that app access to Game Mode. Among its benefits are “highest priority access” to the CPU cores, in particular the E cores, whose use by background threads is reduced. Other benefits include highest priority access to the GPU, and doubled Bluetooth sampling rate to reduce latency for input controllers and audio output. Game Mode is automatically turned on when the app is put into full-screen mode, and turned off when full-screen mode is stopped, although the user also has manual control through the mode’s item in the menu bar.

In practice, while this is beneficial to many games, it has little if any use for modifying core type allocation in other circumstances. As the user can’t modify an app’s Info.plist without breaking its signature and notarization, this is only of use to developers.

Summary

  • The QoS assigned by an app to its threads is used by macOS to determine which core type to allocate them to.
  • Threads with a low (Background, 9) QoS are run exclusively on E cores. Those with higher QoS are run preferentially on P cores, and normally only on E cores when no P core is available.
  • App Tamer and taskpolicy can be used to demote higher QoS threads to be run on E cores, but low QoS threads can’t be promoted to be run on P cores.
  • macOS VMs run on P cores, so low QoS threads running in a VM will be run on P cores, and can be used to accelerate Background threads.
  • Game Mode gives games priority access to CPU cores, particularly to E cores, but users can’t elect to run other apps in Game Mode, and there don’t appear to be benefits in doing so.
  • If you feel an app would benefit from user control over CPU core type allocation through access to their QoS, suggest it to the app’s developer.

Explainer

Quality of Service (QoS) is a property of each thread in a process, and normally chosen from the macOS standard list:

  • QoS 9 (binary 001001), named background and intended for threads performing maintenance, which don’t need to be run with any higher priority.
  • QoS 17 (binary 010001), utility, for tasks the user doesn’t track actively.
  • QoS 25 (binary 011001), userInitiated, for tasks that the user needs to complete to be able to use the app.
  • QoS 33 (binary 100001), userInteractive, for user-interactive tasks, such as handling events and the app’s interface.

There’s also a ‘default’ value of QoS between 17 and 25, an unspecified value, and in some circumstances you might come across others.

Last Week on My Mac: Compression models

By: hoakley
15 December 2024 at 16:00

Over the fifty years since I started using computers, one of the greatest changes has been who does the waiting. In my early days, it was me, in some cases waiting overnight for batches to be processed on a mainframe computer. By the mid 1990s I had to wait a couple of days for some of my optimisation runs to complete on the Mac in my office. Since then, as Macs have got progressively faster, they have come to do the waiting for me. Whether I’m editing an image or writing an article, my Mac keeps pace with almost everything I do.

One notable if infrequent exception is when there’s a macOS update. On a newer Apple silicon model, installation itself is far quicker than when Apple changed to using the SSV in Big Sur, but there’s always that wait for the downloaded update to be “prepared” in the background. Much of that is thought to be decompression, which minimises the size of downloads by putting the burden on the Mac to expand and organise what needs to be installed.

Data compression is widely used in modern operating systems. Take a look in Activity Monitor’s Memory view, and at the bottom right you’ll see how much compressed memory is in use, here currently over 1 GB. It’s used to save space in macOS, where many system files are stored in compressed formats. Compression is also an integral part of many media formats, whether still or moving images, or audio. Unless they’re compressed, we’d only be able to enjoy a small fraction of the media we now consume so voraciously.

It’s understandable that compression is thus one of the best-studied tasks in computing. Over decades of intensive research and practical experience, we have learned that the most powerful methods of compressing data can also take the most processing time. The effectiveness of compression is generally expressed in terms of compression ratio, the ratio of original file size to that of the compressed file. Thus all effective compression methods should result in a compression ratio of more than 1. You’ll also come across percentage space saving, the reduction in size compared to the original. If a compression method shrinks a 10 GB file to 5 GB, then its compression ratio is 10/5 = 2, and its space saving is (1 – 5/10) x 100 = 50%.

One of the most efficient compression techniques has been that in nanozip 0.09a, for instance, which can achieve a compression ratio of 5.3, or a space saving of 81%, on English text taken from Wikipedia. This illustrates the trade-offs made in such efficient techniques, as to achieve that it takes almost 128 seconds to compress 100 MB, a rate of 0.783 MB/s. That makes it of little everyday use and unsuitable for macOS updates, where it also couldn’t achieve the same high compression ratio because the data are binary rather than English text.

Optimising compression isn’t as simple as it might appear. There’s an inverse relationship between the time spent compressing data and the amount of compressed data to be written, as there is with many similar tasks. The more sophisticated the compression and the higher the compression ratio, the less data there is to be written out, although there’s still the same amount of data to be read. Using the example of a compression ratio of 2, 10 GB of input data has to be read and only 5 GB written out, so unless write speed is less than half read speed, reading the input data is likely to be more rate-limiting.

Compression is also representative of many other types of everyday tasks, as it consists of three sub-tasks:

  1. read data from disk
  2. process it (encode, decode, transcode)
  3. write the transformed data to disk

Unless you use small file sizes, this is usually performed on data streamed from storage and back to storage, rather than on single chunks of data in memory.

To understand what limits performance of any of those everyday tasks, we need to understand how each of its sub-tasks is limited, and how those limitations interact. In some cases, such as preparation of macOS updates, time taken appears to be determined by the processing required in the second step. In the past, Apple has run all that in a single thread to minimise impact on concurrent use of other apps in the foreground. In other tasks, such as encrypting files, stage 2 processing should be fastest, making it most likely that disk access is limiting.

Although we can’t apply conclusions drawn from studying any specific compression method, we can use it as a model to develop tools that we can then apply to other tasks. Over the last week, I’ve published two articles here discussing how to do that.

In the first I asked how we can investigate whether more threads will run a task faster. We know that for my compression task they do, but that may not be true for the tasks you’re most interested in accelerating. In rare cases, an app may provide a control to set how many threads to use, and it’s easy to use that to investigate. It’s more likely that your app has no such control, but one way around that is to run it in a virtual machine, where you can pick the number of virtual cores to host that VM.

Timing the performance of your app with test files of known size should give a performance rate, and that can help identify which of those sub-tasks is probably limiting overall performance. If your app transcodes video, for example, and you time how long it takes to complete all three sub-tasks on a 10 GB video clip, you can work out its transcoding rate in GB/s. If that test clip takes 20 seconds to read that test file, transcode it, and write out the converted video, then its overall transcoding rate is 0.5 GB/s.

The next article considers how you might work out whether running that task on high-speed storage would improve that overall performance. Although it looks unlikely that a transcoding rate of 0.5 MB/s would be altered by running it on a Thunderbolt 3 disk with its 2.3 GB/s write speed, it’s worth comparing performance on the faster internal SSD to check that, and decide whether a faster USB4 SSD might be better.

By the time that you’ve worked through those tests, you should have better insight into which of those three sub-tasks is limiting performance, how you might be able to improve them, and compare current performance in the overall transcoding rate against attempts at improvement. As I wrote last week, the only way to do that is with objective measurements such as that transcoding rate.

Next week I’ll continue this series by considering how you can control the type of CPU core threads run on, and some related issues about sparse bundles and the performance of Time Machine backup storage.

Has Apple stopped updating EFI firmware?

By: hoakley
13 December 2024 at 15:30

If, like me, you pay close attention to firmware updates released with macOS, you may have noticed something highly unusual if not unique this week, in the firmware updates that came with macOS Sequoia 15.2, Sonoma 14.7.2 and Ventura 13.7.2: those could mark the end of an era.

All new Macs since Apple transitioned to using Intel processors have one of three classes of firmware:

  • Intel Macs without a T2 chip only have EFI firmware, whose version reads something like 529.140.2.0.0. These are model-specific.
  • Intel Macs with a T2 chip have firmware for both their Intel systems in EFI, and iBridge for the T2, giving them a double firmware version like 2069.40.2.0.0 (iBridge: 22.16.12093.0.0,0). All models with a T2 chip can run the same EFI and iBridge versions.
  • Apple silicon Macs have iBoot, with a version like 11881.61.3, which is common across all models.

This complexity was the reason for my first developing EFIcienC, predecessor to SilentKnight, compiling and maintaining databases of firmware versions, and trying to help those whose Macs stubbornly refused to update their EFI firmware when they should have done. This site still has long lists of the latest firmware versions for Macs running Catalina, for example.

EFIcienC01

silentknight11

As the number of supported Intel Macs without a T2 chip has steadily fallen, what used to be a long and complex list has shrunk to just seven models. With the release of macOS Sequoia 15.0, Sonoma 14.7 and Ventura 13.7, Apple stopped updating the EFI firmware for Intel Macs without T2 chips, which are now frozen as they were last June and July.

When Apple released Sequoia 15.2, Sonoma 14.7.2 and Ventura 13.7.2 this week, it appears to have ceased updating the EFI firmware in Intel Macs with T2 chips.

T2 models were updated to EFI 2069.0.0.0.0 and iBridge 22.16.10353.0.0,0 when Sequoia 15.0 was released on 16 September 2024. No firmware updates came in the rapid update to 15.0.1, but in 15.1 those models were updated to EFI 2069.40.2.0.0 and iBridge 22.16.11072.0.0,0.

Sequoia 15.1.1 also didn’t bring any change in firmware, and 15.2 updates T2 models to EFI 2069.40.2.0.0 and iBridge 22.16.12093.0.0,0: while the T2’s firmware has been updated, no change has been made in the EFI version. As far as I’m aware that’s the first time that has happened since the initial releases of T2 firmware at the end of 2017 and early 2018. The first record I have of their version numbers is of EFI 1037.147.1.0.0 and iBridge 17.16.16065.0.0,0, since when they have come a very long way.

While I’m sure that Apple could still update EFI firmware if necessary, I think we have seen the last planned updates, with only iBridge for the T2 and iBoot for Apple silicon Macs to continue to advance with future releases of macOS. As the T2 is also Apple silicon, that means an end to the last firmware for Intel processors, after more than 18 years. The end of an era indeed, and time to pour one out for EFI firmware in Macs.

I wouldn’t like to hazard a guess at how much longer Apple will continue to support iBridge firmware for T2 chips. Firmware updates aren’t a required part of macOS updates, and most Macs cease to enjoy them well before they’re updated to their last macOS.

Tune for Performance: Do you really need a big internal SSD?

By: hoakley
12 December 2024 at 15:30

The most common economy that many make when specifying their next Mac is to opt for just 512 GB internal storage, and save the $/€/£600 or so it would cost to increase that to 2 TB. After all, if you can buy a 2 TB Thunderbolt 5 SSD for around two-thirds of the price, why pay more? And who needs the blistering speed of that expensive internal storage when your apps work perfectly well with a far cheaper SSD? This article considers whether that choice matters in terms of performance.

To look at this, I’m going to use the same compression task that I used in this previous article, my app Cormorant relying on the AppleArchive framework in macOS to do all the heavy lifting. As results are going to differ considerably when using other apps and other tasks, I’d like to make it clear that this can’t reach general conclusions that apply to every task and every app: your mileage will vary. My purpose here is to show how you can work out whether using a slower disk will affect your app’s performance.

Methods

In that previous article, I concluded that compression speed was unlikely to be determined by disk performance, as even with all 10 P cores running compression threads, that rate only reached 2.28 GB/s, around the same as write speed to a good Thunderbolt 3 SSD. For today’s tests, I therefore set Cormorant to use the default number of threads (all 14 on my M4 Pro) so it would run as fast as the CPU cores would allow when compressing my standard 15.517 GB IPSW test file.

I’m fortunate to have a range of SSDs to test, and here use the 2 TB internal SSD of my Mac mini M4 Pro, an external USB4 enclosure (OWC Express 1M2) with a 2 TB Samsung 990 Pro SSD, and an external 2 TB Thunderbolt 3 SSD (OWC Envoy Pro SX). As few are likely to have access to such a range, I included two disk images stored on the internal SSD, a 100 GB sparse bundle, and a 50 GB read-write disk image, to see if they could be used to model external storage. All tests used unencrypted APFS file systems.

The first step with each was to measure its write speed using Stibium. Unlike more popular benchmarking apps for the Mac, Stibium measures the speed across a wide range of file sizes, from 2 MB to 2 GB, providing more insight into performance. After those measurements, those test files were removed, the large test file copied to the volume, and compressed by Cormorant at high QoS with the default number of threads.

Disk write speeds

compressionbydisk1

Results in each test followed a familiar pattern, with rapidly increasing write speeds to a peak at a file size of about 200 MB, then a steady rate or slow decline up to the 2 GB tested. The read-write disk image was a bit more erratic, though, with high write speeds at 800 and 1000 MB.

At 2 GB file size, write speeds were:

  • 7.69 GB/s for the internal SSD
  • 7.35 GB/s for the sparse bundle
  • 3.61 GB/s for the USB4 SSD
  • 2.26 GB/s for the Thunderbolt 3 SSD
  • 1.35 GB/s for the disk image.

Those are in accord with my many previous measurements of write speeds for those types of storage.

Compression rates

Times to compress the 15.517 GB test file ranked differently:

  • 5.57 s for the internal SSD
  • 5.84 s for the USB4 SSD
  • 9.67 s for the sparse bundle
  • 10.49 s for the Thunderbolt 3 SSD
  • 16.87 s for the disk image.

When converted to compression rates, sparse bundle results are even more obviously an outlier, as shown in the chart below.

compressionbydisk2

There’s a roughly linear relationship between measured write speed and compression rate in the disk image, Thunderbolt 3 SSD, and USB4 SSD, and little difference between the latter and the internal SSD. These suggest that disk write speed becomes the rate-limiting factor for compression when write speed falls below about 3 GB/s, but above that faster disks make little difference to compression rate.

Poor performance of the sparse bundle was a surprise, given how close its write speeds are to those of the host internal SSD. This is probably the result of compression writing a single very large file across its 14 threads; as the sparse bundle stores file data on a large number of band files, their overhead appears to have got the better of it. I will return to look at this in more detail in the near future, as sparse bundles have become popular largely because of their perceived superior performance.

The difference in compression rates between USB4 and Thunderbolt 3 SSDs is also surprisingly large. Of course, a Mac with fewer cores to run compression threads might not show any significant difference: a base M3 chip with 4 P and 4 E cores is unlikely to achieve a compression rate much in excess of 1.5 GB/s on its internal SSD because of its limited cores, so the restricted write speed of a Thunderbolt 3 SSD may not there become the rate-limiting factor.

Conclusions

  • The rate-limiting step in task performance will change according to multiple factors, including the effective use of multiple threads on multiple cores, and disk performance.
  • There’s no simple model you can apply to assess the effects of disk performance, and tests using disk images can be misleading.
  • You can’t predict whether a task will be disk-bound from disk benchmark performance.
  • Even expensive high-performance external SSDs can result in noticeably poor task performance. Maybe that money would be better spent on a larger internal SSD after all.

Tune for Performance: Activity Monitor’s CPU view

By: hoakley
11 December 2024 at 15:30

Activity Monitor is one of the unsung heroes of macOS. If you understand how to interpret its rich streams of measurements, it’s thoroughly reliable and has few vices. When tuning for performance, it should be your starting point, a first resort before you dig deeper with more specialist tools such as Xcode’s Instruments or powermetrics.

How Activity Monitor works

Like Xcode’s Instruments and powermetrics, the CPU view samples CPU and GPU activity over brief periods of time, displays results for the last sampling period, and updates those every 1-5 seconds. You can change the sampling period used in the Update Frequency section of the View menu, and that should normally be set to Very Often, for a period of 1 second.

This was normally adequate for many purposes with older M-series chips, but thread mobility in M4 chips can be expected to move threads from core to core, and between whole clusters, at a frequency similar to available sampling periods, and that loses detail as to what’s going on in cores and clusters, and can sometimes give the false impression that a single thread is running simultaneously on multiple cores.

However, sampling frequency also determines how much % CPU is taken by Activity Monitor itself. While periods of 0.1 second and less are feasible with powermetrics, in Activity Monitor they would start to affect its results. If you need to see finer details, then you’ll need to use Xcode Instruments or powermetrics instead.

% CPU

Central to the CPU view, and its use in tuning for performance, is what Activity Monitor refers to as % CPU, which it defines as the “percentage of CPU capability that’s being used by processes”. As far as I can tell, this is essentially the same as active residency in powermetrics, and it’s vital to understand its strengths and shortcomings.

Take a CPU core that’s running at 1 GHz. Every second it ‘ticks’ forward one billion times. If an instruction were to take just one clock cycle, then it could execute a billion of those every second. In any given second, that core is likely to spend some time idle and not executing any instructions. If it were to execute half a billion instructions in any given second, and spend the other half of the time idle, then it has an idle residency of 50% and an active residency of 50%, and that would be represented by Activity Monitor as 50% CPU. So a CPU core that’s fully occupied executing instructions, and doesn’t idle at all, has an active residency of 100%.

To arrive at the % CPU figures shown in Activity Monitor, the active residency of all the CPU cores is added together. If your Mac has four P and four E cores and they’re all fully occupied with 100% active residency each, then the total % CPU shown will be 800%.

There are two situations where this can be misleading if you’re not careful. Intel CPUs feature Hyper-threading, where each physical core acquires a second, virtual core that can also run at another 100% active residency. In the CPU History window those are shown as even-numbered cores, and in % CPU they double the total percentage. So an 8-core Intel CPU then has a total of 16 cores, and can reach 1600% when running flat out with Hyper-threading.

coremanintel

cpuendstop

The other situation affects Apple silicon chips, as their CPU cores can be run at a wide range of different frequencies according to control by macOS. However, Activity Monitor makes no allowance for their frequency. When it shows a core or total % CPU, that could be running at a frequency as low as 600 MHz in the M1, or as high as 4,512 MHz in the M4, nine times as fast. Totalling these percentages also makes no allowance for the different processing capacity of P and E cores.

Thus an M4 chip’s CPU cores could show a total of 400% CPU when all four E cores are running at 1,020 MHz with 100% active residency, or when four of its P cores are running at 4,512 MHz with 100% active residency. Yet the P cores would have an effective throughput of as much as six times that of the E cores. Interpreting % CPU isn’t straightforward, as nowhere does Activity Monitor provide core frequency data.

tuneperf1

In this example from an M4 Pro, the left of each trace shows the final few seconds of four test threads running on the E cores, which took 99 seconds to complete at a frequency of around 1,020 MHz, then in the right exactly the same four test threads completed in 23 seconds on P cores running at nearer 4,000 MHz. Note how lightly loaded the P cores appear, although they’re executing the same code at almost four times the speed.

Threads and more

For most work, you should display all the relevant columns in the CPU view, including Threads and GPU.

tuneperf2

Threads are particularly important for processes to be able run on multiple cores simultaneously, as they’re fairly self-contained packages of executable code that macOS can allocate to a core to run. Processes that consist of just a single thread may get shuffled around between different cores, but can’t run on more than one of them at a time.

Another limitation of Activity Monitor is that it can’t tell you which threads are running on each core, or even which type of core they’re running on. When there are no other substantial threads active, you can usually guess which threads are running where by looking in the CPU History window, but when there are many active threads on both E and P cores, you can’t tell which process owns them.

You can increase your chances of being able to identify which cores are running which threads by minimising all other activity, quitting all other open apps, and waiting until routine background activities such as Spotlight have gone quiet. But that in itself can limit what you can assess using Activity Monitor. Once again, Xcode Instruments and powermetrics can make such distinctions, but can easily swamp you with details.

GPU and co-processors

Percentages given for the GPU appear to be based on a similar concept of active residency without making any allowance for frequency. As less is known about control of GPU cores, that too must be treated carefully.

Currently, Activity Monitor provides no information on the neural processing unit (ANE), or the matrix co-processor (AMX) believed to be present in Apple silicon chips, and that available using other tools is also severely limited. For example, powermetrics can provide power use for the ANE, but doesn’t even recognise the existence of the AMX.

Setting up

In almost all cases, you should set the CPU view to display All processes using the View menu. Occasionally, it’s useful to view processes hierarchically, another setting available there.

Extras

Activity Monitor provides some valuable extras in its toolbar. Select a process and the System diagnostics options tool (with an ellipsis … inside a circle) can sample that process every millisecond to provide a call graph and details of binary images. The same menu offers to run a spindump, for investigating processes that are struggling, and to run sysdiagnose and Spotlight diagnostic collections. The Inspect selected process tool ⓘ provides information on memory, open files and ports, and more.

Explainers

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

multiterms1

When an app is run, there’s a single runtime instance created from its single executable code. This is given its own virtual memory and access to system resources that it needs. This is a process, and listed as a separate entity in Activity Monitor.

Each process has a main thread, a single flow of code execution, and may create additional threads, perhaps to run in the background. Threads don’t get their own virtual memory, but share that allocated to the process. They do, though, have their own stack. On Apple silicon Macs they’re easy to tell apart as they can only run on a single core, although they may be moved between cores, sometimes rapidly. In macOS, threads are assigned a Quality of Service (QoS) that determines how and where macOS runs them. Normally the process’s main thread is assigned a high QoS, and background threads that it creates may be given the same or lower QoS, as determined by the developer.

Within each thread are individual tasks, each a quantity of work to be performed. These can be brief sections of code and are more interdependent than threads. They’re often divided into synchronous and asynchronous tasks, depending on whether they need to be run as part of a strict sequence. Because they’re all running within a common thread, they don’t have individual QoS although they can have priorities.

Tune for Performance: do more threads run faster?

By: hoakley
10 December 2024 at 15:30

One of the most distinctive features about modern Macs is that they have multiple cores, in Apple silicon models a minimum of eight. For an app to be able to make use of more than one core (or its equivalent) at a time it needs to divide its processing into threads, discrete blocks of code that can be run on different cores by macOS. If it doesn’t do that, then running that app on a high-end Pro, Max or Ultra chip is unlikely to be significantly faster than on a base model (assuming that task is CPU-bound).

Activity Monitor

You might be able to get a good idea as to how well an app makes use of multiple cores from watching it in use in Activity Monitor’s CPU History window, but in many cases that isn’t conclusive, and needs to be confirmed.

To illustrate ways to tackle this, I take the example of file compression. Some methods lend themselves to multiple threads better than others, and some may be implemented in a way that won’t accelerate when several cores are available. For the occasional user this might make little difference, but if you were to spend much of your day waiting for 10-100 GB files to compress, it merits a little exploration.

polycore1

This CPU History window from a compression task run on an M4 Pro is typically unhelpful. A single file compression is seen in the group of 4-5 peaks in CPU in the right half of each trace. Although compression used all ten P cores, it appears to have been moved between P clusters, seen by comparing the timing of peaks on Cores 8 and 10, 9 and 11. At no time does the task appear to exceed 50% active residency on any of the cores, though. All you can do is guess as to what might be going on, and whether it might run faster on twelve or more P cores.

polycore2

On another run, the same task is reported to have reached a peak of 500% CPU in 12 threads, but would it exceed that on more P cores, given that 10 were available?

Timing performance

As usual, I’ve been cheating a little to generate those results, by using my simple compression-decompression app Cormorant, in which I control how many threads it uses. If the app you’re trying to investigate offers a similar feature, then you can set up a standard task, here the compression of a 15.517 GB IPSW file, and time how long it takes using different numbers of threads.

polycore3

The answers from Cormorant, which conveniently performs its own timing, are:

  • 1 thread takes 49.32 seconds
  • 2 take 26.74 s
  • 3 take 18.60 s
  • 4 take 14.29 s
  • 5 take 11.21 s.

So Cormorant’s compression can make good use of more P cores, although with diminishing returns.

Very few apps give you this level of control, though. The only other compression utility that does appear to is Keka.

polycore4

In its settings, you can give its tasks a maximum number of threads, and even run them at custom Quality of Service (QoS) if you want them to be run in the background on E cores, and not interrupt your work on P cores.

Controlling threads

There is one way that you can limit the effective number of threads used by an arbitrary app, and that’s to run it in a Virtual Machine, as you control the number of virtual cores that it uses. While you can just about run a macOS VM on a single core alone, I suggest that a more workable starting point is two cores, and you can increase that number up to the total of P cores in the host without causing problems.

VMs have other virtues, including their relative lack of background processes, allowing their virtual cores to be almost entirely devoted to your test task. However, as they can’t run apps from the App Store other than Apple’s free suite of Pages, Numbers and Keynote, that could prevent you from using them for testing.

To set up a VM for thread tests, I duplicated a standard Sonoma 14.7.1 VM so I could throw it away at the end, opened it with five virtual cores, and copied over the test app Cormorant and file. I then closed that VM down, set it to use 2 virtual cores, opened it and ran my test. I repeated that with an increasing number of virtual cores up to a total of 5. Compression times are:

  • 2 vCPUs take 30.13 seconds
  • 3 take 21.36 s
  • 4 take 17.18 s
  • 5 take 13.59 s.

Those are only slightly slower than their equivalents from the host tests above.

Analysis

polycore5

Plotting those results out using DataGraph, the lines of best fit follow power laws:

  • for real cores, time = 49.8863/(T^0.91169)
  • for virtual cores, time = 54.737/(T^0.854318)

where T is the number of threads. That explains the apparently diminishing returns with increasing numbers of threads, although the maths isn’t as simple as we might like.

A better way to look at this is by calculating the rate of compression in GB/s, simply by dividing the file size of 15.517 GB by each time. Here we end up with straight lines from linear regression, that are more amenable to thought.

polycore6

Those regressions are:

  • for real cores, rate of compression = 0.0464 + (0.264 x T)
  • for virtual cores, rate of compression = 0.102 + (0.206 x T)

which are more generally useful.

Tuning performance

This might appear over-elaborate and of little practical use, but we now have a much better understanding of the factors limiting compression performance:

  • The more threads compression uses, the shorter time a task will take.
  • There are limits to that improvement, though, when substantially more than 5 threads are used.
  • Compression rates achieved in M4 P cores are significantly lower than read or write speeds of the internal SSD, so aren’t likely to be limited by the speed of a faster SSD (TB3 or USB4).
  • As compression appears to be CPU-bound, faster P cores would also be expected to result in shorter times.
  • Improving the efficiency of the compression code could increase performance.
  • Compression in a VM runs at about 78% of speed on the host.

To see how reliable these are, I therefore repeated the Cormorant test using all 10 P cores on the host Mac, which took 6.80 seconds, a little more than half the time for 5 cores. That’s a compression rate of 2.28 GB/s, rather less than the 2.69 GB/s predicted by the linear regression. That’s now approaching the write speed of some TB3 SSDs, and the rate-limiting step could then change to be the write performance of the storage being used, rather than CPU cores.

Inside M4 chips: CPU core management

By: hoakley
5 December 2024 at 15:30

Whether you’re a developer or user, gaining an understanding of how macOS manages the cores in an Apple silicon CPU is important. It explains what you will see in action when you open Activity Monitor, how your apps get to deliver optimal performance, and why you can’t speed up background tasks like Time Machine backups. In this series (links below) I’ve been trying to piece this together for the M4 family, and this article is my first attempt to summarise as much of the story as I know so far. I therefore welcome your comments, counter-arguments and improvements.

Scope

For the purposes of this article, I’ll consider a single thread that macOS is ready to load onto a CPU core for execution. For that to happen, five decisions are to be made:

  • which type of core, P or E,
  • which cluster to run it in,
  • which core within that cluster,
  • what frequency to run that cluster at,
  • the mobility of that thread between cores in the same cluster, and between clusters (when available).

Which type of core?

Since the early days of analysing M1 CPUs, it has been clear that the choice between P and E core types is made on the Quality of Service (QoS) assigned to the thread, availability of a core of that type, and whether the thread uses a co-processor such as the AMX. For the sake of generality and simplicity, I’ll here ignore the last of those, and consider only threads that are executed by the CPU alone.

QoS is primarily set by the process owning that thread, in the setting exposed to the programmer and user, although internally QoS is modulated by other factors including thermal environment. Threads assigned a QoS of 9 or less, designated Background, are allocated exclusively to E cores, while those with higher QoS of ‘user’ levels are preferentially allocated to P cores. When those are unavailable, they may be allocated to E cores instead.

This can be changed on the fly, reassigning higher QoS threads to run on E cores, but that’s not currently possible the other way around, so low QoS threads can’t be run on P cores. The sole exception to that is when run inside a virtual machine, when VM virtual cores are given high QoS, allowing low QoS threads within the VM to benefit from the speed of P cores.

Which cluster?

M4 Pro and Max variants have two clusters of P cores, so the next decision is which of those to run a higher QoS thread on:

  • if both clusters are shut down, one will be chosen and its frequency brought up;
  • if one cluster is already running and has sufficient idle residency (‘available residency’) to accommodate the thread, that will be chosen;
  • if one cluster already has full active residency, the other cluster will be chosen;
  • if both clusters already have full active residency, then the thread will be allocated to the E cluster instead.

This fills an active P cluster before allocating threads to the inactive one, and fills both P clusters before allocating a higher QoS thread to the E cluster.

m4coremanagement1

Cluster allocation is of course simpler on the base M4, where there’s only one E and one P cluster. Should Apple make an M4 Ultra as expected, then that would not only have four P clusters, but two E clusters. Until that happens, we can only speculate that similar would then apply, including the E clusters.

Which core within the cluster?

This is perhaps the simplest decision to make. If there’s only one core with available residency in that cluster, that’s the only choice. Otherwise macOS picks an arbitrary core from those available, apparently to ensure roughly even use of cores within each cluster.

What frequency?

For the E cluster, choice of frequency appears straightforward, with low QoS threads being run at minimum E core frequency of 1,020 MHz or slightly higher, and higher QoS threads spilt over from fully occupied P clusters are run at E core maximum frequency of 2,592 MHz.

P cluster frequency appears to be determined by the total active residency of that cluster after the new thread has been added. When that’s the only thread running in that cluster, maximum frequency of 4,512 MHz is chosen, but rising total active residency reduces that in steps down to about 3,852 MHz when all the cores in the cluster are at 100% active residency. In most cases, the big reduction in frequency occurs when going from about 200% to 300% total active residency. This currently appears to be part of a strategy to pre-emptively minimise the risk of thermal stress within the chip.

m4coremanagement2

Thread mobility

Once the thread has been loaded into a core in the optimal cluster running at the chosen frequency, it’s likely to be moved periodically, both to any other free core within that cluster, and to another cluster, when available. While this does occur in previous M-series CPUs, it appears particularly prominent in M4 variants.

Movement of threads within a cluster can occur quite frequently, every 0.1 second or so, particularly within the E cluster. Movement between clusters occurs less frequently, about every 4-5 seconds, and would only occur when the other cluster is shut down or idle, so free to run all the threads of the current cluster. This is most probably to ensure even thermal conditions within the chip.

Summary

The whole strategy is shown in the following diagram, also available as a tear-out PDF from here: m4coremanagement1

m4coremanagement3

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes
Inside M4 chips: Controlling frequency

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

How is Thunderbolt 5 doing so far?

By: hoakley
4 December 2024 at 15:30

When Apple launched its first M4 Macs just over a month ago, I was surprised that models with M4 Pro or Max chips offered Thunderbolt 5. Although there are still relatively few computers in use with support for TB5, and a dearth of peripherals, this article summarises early experience with this exotic new bus.

What TB5 peripherals are available?

As far as I’m aware, as of today there’s only one Thunderbolt 5 peripheral shipping in quantity, the Kensington SD5000T5 Thunderbolt 5 Triple Docking Station, with a total of three downstream TB5 ports. I’m just completing a full review of this, due to appear in MacFormat and MacLife magazines early next year.

SSDs have been announced by OWC in its Envoy Ultra, and Sabrent. The first of OWC’s have apparently started to ship, although they aren’t expected to become readily available until the New Year, when Sabrent’s are also expected.

OWC has also announced a TB5 hub, but that’s unlikely to appear until next year.

Several PCs are now available with TB5 support, although that seems to be fiddly to configure in Windows. Among those is the Razer Blade 18, which is even more expensive than a MacBook Pro with an M4 Max.

Other than those, there are lots of expensive TB5 cables, just precious little to connect to the other end.

Multiple displays

Many of those rushing to buy into TB5 are doing so because of its promised support for multiple displays. For example, the Kensington dock claims to support up to three 6K displays at 60 Hz with the M4 Max, and two with the M4 Pro. Although I have been unable to test those combinations, there are already reports that the M4 Max works well with three displays connected direct, but only two of those work when using the Kensington dock.

This has apparently taken Kensington and Apple by surprise, but until this has been addressed, I wouldn’t assume that you’ll be able to use all three displays attached to the dock.

Multiple SSDs

In early 2023, when TB4 hubs were becoming available, I wrote a whole series of articles here analysing their performance with a range of different SSDs. Links to those are given at the end. Those predated OWC’s superb Express 1M2 USB4 enclosure that now offers consistent and reliable performance for Apple silicon Macs, but not Intel models, which unfortunately lack support for USB4.

I have recently been revisiting SSD performance, both directly connected to my Mac mini M4 Pro, and working through the Kensington dock. Although some results are impressive, there are others that shock.

sysinfotb5

As shown in System Information, this dock connects to the host at 80 Gb/s, and to each USB4 drive at the expected 40 Gb/s.

On the bright side, 1M2 enclosures that return direct read/write speeds of 3.7/3.7 GB/s read almost as fast when attached through the dock, but their write speed drops to 2.3 GB/s, similar to many TB3 SSDs attached directly. You can even connect a USB4 drive to each of the dock’s three TB5 ports to benchmark them simultaneously, and get read/write results of 2.1/2.1 GB/s on each of them. That performance represents the maximum total data transfer capacity, matching claims of 6 GB/s made of TB5 SSDs, and equating to 80 Gb/s in TB5/USB4v2 symmetric mode.

tb5tests

Results from TB3 SSDs are more worrying. An award-winning certified Thunderbolt 3 SSD that achieves 2.9/2.2 GB/s read/write attached direct maintained a good read speed through the dock at 2.8 GB/s, but it almost ground to a halt during the write test, at 422 MB/s, that’s roughly the speed you’d expect from a basic SATA SSD.

You can read similar experiences during early testing of this dock for PC World.

For the time being, TB5 performs well with USB4 and directly connected TB3 SSDs, and the dock is a good solution for those wanting high-speed access to two or three USB4 SSDs in OWC Express 1M2 enclosures. The dock does have serious problems when writing to TB3 SSDs, though, where it may fall far short of expectations. Hopefully these problems will be resolved early next year.

Recommendations

  • Although TB5 promises much, initial tests show that it currently has problems meeting that.
  • Reports indicate that it may not yet support M4 Max chips driving three 6K displays at 60 Hz from a TB5 dock.
  • Performance claimed for TB5 SSDs has not yet been confirmed in independent tests.
  • Performance of TB3 SSDs attached to a TB5 dock demonstrates some very poor write speeds.
  • Performance of USB4 SSDs attached to a TB5 dock demonstrates better and more consistent results, although their write speed also falls.
  • TB5 cables and peripherals are expensive.
  • Thunderbolt 5 is still at an experimental stage, and may take some time before it realises its potential.

Thunderbolt performance and TB4 hubs

General hub performance
Write speed throttling
How faster SSDs can impair performance of slower ones
Three SSDs on one hub
Getting best performance from Thunderbolt on Apple silicon Macs: a practical guide

Testing with Stibium

When using the ‘gold standard’ method of testing storage using my free Stibium, you don’t normally need to restart the Mac between write and read speed measurements. This has changed with the Mac mini M4 Pro, at least. If you go straight on to measure read speeds, results will be bogus because of what appears to be extensive caching of the files written during the previous write test. That results in absurdly high read speeds of more than 6 GB/s in most cases. This is surprising, as a total of just over 53 GB of files are written during the full write test, which seems far more than macOS should ever cache successfully!

For these tests on external SSDs, I therefore quit Stibium after measuring write speed, unmount the volume tested, remount it in Disk Utility, and open Stibium again to perform the read tests. This apparently clears caches reliably, and read speeds are consistent and in accord with those expected.

Interests

I bought my own Mac mini M4 Pro, Kensington SD5000T5 Thunderbolt 5 Triple Docking Station, and all the OWC Express 1M2 enclosures and their SSDs, at their regular retail prices. The only product tested here that has been provided by a manufacturer is, rather sadly, the TB3 SSD.

Managing kernel extensions

By: hoakley
3 December 2024 at 15:30

In case the message isn’t clear yet, third-party kernel extensions are on their way out, particularly in Apple silicon systems. Although macOS continues to extend the capabilities of its kernel using nearly 700 of them, for almost all purposes user apps and devices should now have switched to using modern system extensions and their equivalents. This article considers how you can clean your Mac of those old kernel extensions (kexts), particularly in preparation for a new Mac.

The problem

In the past kexts have been used widely to support third-party hardware and features like software firewalls and security protection. Because those run at a highly privileged level, in what’s known as Ring 1, they have been a well-known cause of instability leading to kernel panics. They also provide malicious software with an opportunity to wreak havoc at the highest level. Replacing kexts with system extensions run at a user level should improve both stability and security, although experience shows that the first of those isn’t yet guaranteed, as macOS support for system extensions hasn’t been as free of bugs as it should have been, and sometimes has caused kernel panics.

From Big Sur onwards, the previous scheme of prelinked kernels has changed to use kext collections pooling kexts to be loaded into one of three:

  • The Boot Kext Collection (BKC), on the Sealed System Volume in /System/Library/KernelCollections (Intel Macs) or the Preboot volume (M-series Macs). This is the equivalent of the old prelinked kernel, and contains the kernel itself, and all the major system kernel extensions required for a Mac to function. This is typically about 65 MB in size on Intel.
  • The System Kext Collection (SKC), also on the Sealed System Volume of Intel Macs in /System/Library/KernelCollections, but not used on Apple silicon models. This contains other system kernel extensions, loaded after booting with the BKC, and typically around 370 MB.
  • The Auxiliary Kext Collection (AKC), stored on the Data volume in /Library/KernelCollections when it exists, is built and managed by the service kernelmanagerd. This contains all installed third-party kernel extensions, and is loaded after the other two. In Full Security mode on Apple Silicon there’s no AKC.

Cleaning up old kexts is a valuable move even if you’re not intending to migrate to an Apple silicon Mac, and has grown in importance with newer Macs. For a kext to be used on an Apple silicon Mac its boot security has to be reduced, third-party kexts must be enabled, and each one installed with authorisation through Privacy & Security settings, something you’ll probably want to avoid.

Uninstalling kexts

Before going any further, you need to check each of the kexts installed on your Mac to ensure that it can now be safely removed, either because there’s a better substitute, or because they have simply become orphaned. In many cases, on older Macs, kexts currently installed aren’t used at all, but have been abandoned there.

As you may be aware, kexts are now found in two different locations. They should have been originally installed by the app they came with, and saved first to /Library/Extensions. macOS then normally copies them from there to a folder nested inside /Library/StagedExtensions. That presents a further problem, as the latter are protected by SIP and can’t be tampered with. Cleaning up old kexts might thus appear an impossible task.

The good point about /Library/StagedExtensions is that its contents are structured to inform you of where the original kext came from: if it’s in /Library/StagedExtensions/Applications, then the kext was installed by that app, otherwise its folder path should lead you in the right direction.

For kexts installed by an app, or its installer, the responsibility for removing that kext rests with the app or its dedicated uninstaller. Apps that use kexts have to install them before use, for which most run scripts to move or copy the kext into the /Library/Extensions folder. They will then use additional scripts to update or remove that kext, that should save you the trouble. First visit that app’s support pages, and consult the procedure described there for uninstalling it. If the copy you have installed on your Mac isn’t the current version, you are likely to be advised to update to that latest version in order to uninstall it correctly.

Orphaned kexts

When a kext appears to have become orphaned from its app, or perhaps was installed independently, you’ll need to remove it manually. The steps required are:

  • Delete all apps and services that might try to use that kext, and any other kexts that might depend on it.
  • Unload the kext.
  • Delete the kext from the /Library/Extensions folder.
  • Restart to allow the auxiliary kext collection to be rebuilt.

It’s important to remove the app and any services that might try using that kext, as some might even try to reinstall it if they’re run. Others that might be run from LaunchAgents or LaunchDaemons could get into difficulties without the kext. Good uninstall scripts should deal with this thoroughly without you having to resort to looking for yourself.

Kexts are now unloaded using a command of the form
sudo kmutil unload -b [bundle-identifier]
or
sudo kmutil unload -p [bundle-path]
where bundle-identifier is a reverse URL like com.highpoint-tech.kext.HighPointRR, or the bundle-path is similar to /Library/Extensions/HighPointRR.kext. kmutil is the replacement for the old kextunload command. Although you could still use the latter, it will merely call kmutil to do its work.

With the kext unloaded, you can then delete it from its accessible location in /Library/Extensions. macOS controls its own staged copy in /Library/StagedExtensions, which it ultimately may get around to removing as well. If you want to remove all kexts from the staging directory, use the command
sudo kmutil clear-staging
although that will also clear any that you might still need, and could require those to be reinstalled.

With those cleared away, you’re ready to restart, which should trigger macOS to rebuild the auxiliary kext collection containing third-party kexts before they’re loaded during the restart. If the auxiliary kext collection isn’t rebuilt correctly, you can force that with the command
sudo kmutil rebuild
and authorising that, followed by restarting the Mac to bring those changes into effect.

You can then check to see if that kext has been loaded using the command
kmutil showloaded --collection aux
to list all loaded kexts in the auxiliary kext collection.

Migration to Apple silicon

Not only is Migration Assistant reluctant to try installing kexts on your new Mac, because of their requirements for use on Apple silicon Macs, it’s unable to enable and load any kexts. However, if you do need to use kexts on an M-series Mac, you’re going to have to enable them first using Startup Security Utility in Recovery Mode, then run the app’s install sequence and authorise them in Privacy & Security settings. That isn’t something that can happen behind your back, and without changing boot security settings no third-party extensions can be loaded. By default, your new Mac should start as clean as a whistle.

Summary

  • This is a good time to clean up old kernel extensions on your Mac.
  • Where possible, follow instructions for their removal provided on their support site.
  • Use an app’s uninstall feature, or its uninstaller, if provided.
  • Manual uninstalls require deletion of apps and code relying on that kext, unloading using kmutil, deletion from /Library/Extensions, and restarting.
  • Don’t worry about what may be left behind in /Library/StagedExtensions.
  • Old kexts can’t be enabled automatically by Migration Assistant for an Apple silicon Mac.

Further reading

Installing a kernel extension (Apple)
Installing system extensions and drivers (Apple)

Inside M4 chips: Controlling frequency

By: hoakley
2 December 2024 at 15:30

To realise best performance and energy efficiency from the big.LITTLE architecture in Apple’s M-series chips requires careful management on the part of macOS. There’s much more to it than balancing loads over conventional multi-core CPUs with a single type of core, as each execution thread needs to be run in an optimal location. When deciding where to run a CPU thread, macOS controls:

  • which type of core, P or E, primarily determined by the thread’s Quality of Service (QoS), and core availability;
  • which cluster to run it in, for chips with more than one cluster of that type, set to try to keep as few clusters active as necessary;
  • which core within that cluster, determined by core availability, and semi-randomised to even out core use;
  • what frequency to run that cluster at, in turn depending on the core type and the thread’s QoS;
  • mobility of that thread between cores in the same cluster, and between clusters (when available).

Over the last four years, I have explored the rules apparently used for the first two, and the choice of frequency in E cores. This article looks in more detail at how the frequency of P clusters appears to be determined in M1, M3 and particularly M4 chips.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. In tests reported in the previous article, I used those given as Cluster HW active frequency to demonstrate distinctive patterns seen on M4 P cores running different numbers of test threads.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests detailed previously. Frequencies for the first two tests are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, a cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

To examine this further I first climbed a mountain.

Climbing a mountain

For this test I used three copies of my test app to run a total of three identical threads of my in-core floating point test code in a mountain pattern. I first started powermetrics gathering data, then launched the first thread, followed by the second, and then the third. My objective was to observe an initial period when just one test thread was running, a second with the second test thread in addition, a third when all three threads would be running, and then watch the sequence reverse as each thread ended. This is shown in the results below.

m43appflopt1

This chart shows active residencies by core and cluster for the P cores in an M4 Pro, with 5 P cores in each of its two P clusters, during this test. For the first 15 sample periods (1.5 seconds), a single test thread is moved around between cores in the second P cluster (P1). That’s joined by the second thread run on another core in the same cluster, until sample 30, when the third thread is added, pushing the total active residency to 300%.

At that point, all three threads are moved to three cores in the first P cluster (P0), whose bars are shown in blues and green. The first thread completes in sample 37, leaving two threads with 200% active residency to continue in that cluster until sample 50, when the second thread completes, leaving just one running. In sample 54 (5.4 seconds after test start), that one remaining thread is moved back to complete on core P11 in the second cluster late in sample 63.

In that period of 6.3 seconds, each of the two P clusters has run 1, 2 and 3 threads.

m43appflopt2

This graph shows cluster frequencies over the same period, this time given in seconds elapsed rather than sample number. The red line and points show the frequency of cluster P1, and blue for P0. Those undergo step changes when each cluster is running test threads. The inactive cluster is normally shut down with a frequency of 0 MHz, although there are some brief spikes from that as well.

m43appflopt3

Combining active residency bars in yellow with core frequency lines, it’s clear that cluster frequency is close to core maximum at 4,500 MHz when only a single thread is running. With two threads, it’s reduced to 4,400 MHz, and down to 3,900 MHz when all three threads are running. Those changes are symmetrical for loading and unloading clusters, and shown no signs of hysteresis (different values during loading and unloading).

Closer examination gives frequencies of 4,511 MHz at 100% active residency, 4,415 MHz at 200%, and 3,924 MHz at 300%e. The latter is 87% of maximum frequency, a large enough reduction to be reflected in performance. Essentially identical figures are found for NEON tests as well as these for floating point.

Although this test method can give highly reproducible results, the floating point and NEON tests used don’t resemble threads seen in everyday use. The next step is to extend that by looking at thread numbers and frequency when running more normal code.

Compressing a file

Fortunately, I have already built a suitable platform for real-world testing in a one-trick pony named Cormorant, a basic compression-decompression utility using Apple Archive. Although not a patch on serious apps like Keka, Cormorant can set the number and QoS of threads to be run during compression/decompression. Because it relies on Apple’s framework, it actually runs more than just the threads set in its controls, but still provides a way to control active residency. I therefore ran a test compression (15.5 GB IPSW image file) at maximum QoS to ensure it’s dispatched to P cores, and 1-3 threads.

Time taken to compress the test file changes greatly according to the number of threads used:

  • 1 thread takes 49.6 s, at 313 MB/s;
  • 2 threads takes 26.8 s, at 578 MB/s, 191% of the throughput of the single thread;
  • 3 threads takes 18.7 s, at 829 MB/s, 265% of the single thread.

These appear to follow the pattern of frequencies observed on my in-core tests.

m4corm1thread1

This chart shows the opening 3 seconds of single-thread compression, with cluster frequency in the points and line, and total active residency multiplied by 10 (to scale to a common y axis) in pale blue bars. Two significant periods are shown: in samples 12-21, active residency is high, between 300-430%, and frequency is lower at around 4,000 MHz. Following that, active residency falls to about 200% and frequency rises to 4,200 MHz.

Because active residency was so variable in these tests, I pooled paired values for that and cluster frequency, gathered over 3 second periods, and plotted those.

m4corm1thread2

Although at active residencies below about 180% there’s a wide scatter, above that there’s a good linear regression, showing steady decline in frequency over active residencies ranging from 180% to 450%.

The following two graphs show equivalents for tests using 2 and 3 threads. The first of those has two outliers at total active residencies above 490%, corresponding to unusual conditions during the test. I have therefore excluded those from subsequent analyses.

m4corm2thread1

m4corm3thread1

The last step is to pool paired results from all three test conditions, and arrive at a line of best fit.

m4corm1-3Poolthread1

Between total cluster residencies of 150-500%, this works best with a quadratic curve with the equation
F = 4630.05 – (2.0899 x R) + (0.0010107 x R^2)
from which a different relationship is predicted between F, frequency in MHz, and R, total active residency in %.

My other real-world test makes use of the fact that, when virtualising macOS, the number of virtual cores on the host is specified.

Hosting virtual cores

Although virtualisation relies on frameworks run on the host, experience is that its demand on host P cores is constrained to the number of virtual cores allocated, with each of those resulting in 100% active residency, equating to the whole of a P core on the host. powermetrics started collecting sample periods immediately before a macOS 14 VM was launched, and the first 3 seconds (30 samples) were collected and analysed for VMs with 1-3 virtual cores.

m4vm1thread1

This shows the first of those, a VM allocated just a single virtual core, with cluster frequencies shown as red and blue lines, and total active residency multiplied by 10 in the pale blue bars. With a steady total active residency of 100%, active cluster frequency was about 4,500 MHz. Note that sample 7 included transfer of the threads from P1 to P0 in a sharp peak to a total active residency of over 500%.

Average frequencies can thus be calculated for each of the three tests, at 100-300% active residencies.

Set frequencies

I now have estimates of cluster frequencies for cluster total active residencies from:

  • in-core tests using floating point
  • in-core tests using NEON
  • compression
  • virtualisation

against which I compare a matrix multiplication test that may be run on shared matrix co-processors (AMX). These are shown in the table below.

m4pclusterfreq

Running a single thread in a cluster should result in a total active residency of 100%, for which macOS sets the cluster frequency at P core maximum, of 4,400-4,511 MHz. That for 200% is lower, at between 4,000-4,400 MHz, and falls off further to about 3,800 when all 5 cores are at 100% active residency. Frequencies set for the vDSP_mmul test are significantly lower throughout, supporting the proposal that test isn’t being run conventionally in P cores, but in a co-processor.

A sixth thread would then be loaded onto the other P cluster, where cluster frequency would be set at P core maximum again, progressively reducing with additional threads until that cluster was also running at about 3,800 MHz.

Following this, I returned to the tests I have performed over the last four years on M1 and M3 P cores. Although I haven’t analysed those formally, I now believe that their frequencies are controlled by macOS as follows:

  • M1 1 core at 3,228 MHz, 2 cores 3,132 MHz, 3-6 cores 3,036 MHz.
  • M3 1 core at 3,624 MHz (below maximum of 4,056), 2-6 cores 3,576 MHz.

The range of frequencies in the M1 and M3 is narrower, resulting in less difference in performance between single- and multi-core tests. However, the M4 falls to 87% maximum frequency at 3 threads and more, which is substantial. It’s worth noting that Geekbench single-core results for the M4 are around 3,892 and would scale up to a multi-core result of 38,920 on an M4 Pro with 10 P cores, whereas the actual multi-core score is about 22,700, 58% of the scaled value. Although the effects of lower frequency can’t account for all that difference, they must surely contribute to it.

Why?

Two plausible contenders for the reason that macOS reduces P cluster frequency with increasing active residency are for thermal management, hence reliability, and when competing for a limited resource, perhaps the L2 cache shared within each cluster.

Reductions in cluster frequency seen here isn’t thermal throttling, though. Tests were intentionally kept brief in order to accommodate their results in reasonably short series of powermetrics results. Power use was highest in the NEON and vDSP_mmul tests, and lowest in floating point, although there don’t appear to be matching differences in frequency control. As noted in the previous tests, High Power mode didn’t alter frequency control, although frequencies were reduced in Low Power mode.

It’s most likely that this frequency regulation is pre-emptive, and based not just on CPU cores, but allows for likely heat output in the rest of the Mac.

Key information

  • When running on Apple silicon Macs, macOS modulates ‘cluster HW active frequency’ of P cores, limiting frequency to below maximum when cluster total active residency exceeds 100%.
  • Although small in M1 variants, this is most prominent in M4 variants, where a total active residency of 300% may reduce cluster frequency to 87% of maximum.
  • Frequency limitation is most probably part of a pre-emptive strategy in thermal management.
  • Frequency limitation is at least partly responsible for non-linear changes in performance with increasing recruitment of P cores, as illustrated in single- and multi-core benchmarks.
  • Control of P cores by macOS is complex, particularly in M4 variants.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Inside M4 chips: Matrix processing and Power Modes

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Acknowledgements

Several of you have contributed to discussions here, but Maynard Handley has for several years provided sage advice, challenging discussion, and his personal mine of information. Thank you all.

Inside M4 chips: Matrix processing and Power Modes

By: hoakley
27 November 2024 at 15:30

For much of the last four years, CPU and GPU core performance have been of primary importance to those using Apple silicon Macs. Despite that, special interest has developed in those cores that Apple doesn’t talk about, in its matrix co-processor, AMX. Since the first M1 it has been believed that each CPU core cluster has its own AMX, and more recently it has been shown to be capable of impressive performance. With the increasing prominence of computationally intensive features including AI, this is growing in importance.

When the M4 first became available in iPads earlier this year, researchers concluded its chip has a new AMX co-processor that can now be programmed using SME matrix extensions supported by the new version of the Arm instruction set in the M4, ARMv9.2-A. The best accounts of that challenging work are given on the Friedrich Schiller University’s site Hello SME, and Taras Zakharko’s GitHub.

From early work by Dougall Johnson on the M1, it has been known that some of the functions in Apple’s vast Accelerate maths libraries can run code on the AMX. Thanks to the guidance of Maynard Handley, a year ago I concluded that one of those is the vDSP_mmul function in the vDSP sub-library. This article reports tests of that function in a Mac mini M4 Pro running Sequoia 15.1.1, leads on to an explanation of previous results using floating point and NEON tests, and considers the effects of Power Modes.

Methods

I used the same methods as described previously. These consist of running many tight loops of test code designed to be confined as much as possible to register access. The test used here uses vDSP_mmul, multiplying two 16 x 16 32-bit floating point matrices, and consists of 15 x 10^6 such loops. Its source code is given in the Appendix at the end. During each test run, the command tool powermetrics gathers core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph.

For comparison, I also show here my results from equivalent tests using assembly code for the NEON vector processor, as given in that previous article.

Power used by thread

The first graphs show average power use during each test with increasing numbers of threads, where error bars indicate a spread of +1 standard deviation.

m4powervdsp1

The linear relationship for 1-10 threads, run on 1-10 P cores, is a better fit for vDSP_mmul, shown above, than NEON below. Run on P cores, each vDSP_mmul thread used about 3.6 W, significantly greater than NEON at 3.0 W. However, when those high QoS threads spilt over onto E cores, that relationship for vDSP_mmul broke down, leaving the highest power use that for 10 threads, one for each P core available in the M4 Pro chip used.

There’s no evidence of the steps seen below in NEON at 2-3 and 7-8 threads.

m4powerneon1

Execution time

m4powervdsp2

Total execution time has a strongly linear relationship too, for vDSP_mmul at about 2.6 seconds per thread. Again, that relationship broke down once test threads had spilt over onto E cores, unlike in the graph below for NEON.

m4powerneon2

m4powervdsp3

There are also obvious differences in the time required to execute each thread. vDSP_mmul (above) showed a linear increase from 1-5 threads, followed by a constant time for 5-10 threads. Once threads were running on E cores, the relationship became far looser, as seen in the red regression line. In NEON below, the relationship was weakest in the 1-5 thread range, and closer on the E cores from 11-14 threads.

m4powerneon3

Energy use

m4powervdsp4

When run on P cores, there’s a good line of fit indicating energy use of 9.2 J per thread (above), compared with NEON (below) at 7.7 J. The red line of best fit for the E core section from 11-14 cores suggests E core energy use of about 4.8 J per thread, again higher than NEON, which shows a much better fit and only about 3 J per thread.

m4powerneon4

Maximum total energy use estimated for vDSP_mmul was just over 140 J, while for NEON it was only about 90 J.

The overall picture of vDSP_mmul is thus different from those seen in floating point and NEON tests. When run on P cores alone, vDSP_mmul behaves more linearly, using significantly more power and energy. Once running threads on E cores, though, that breaks down and performance falls, rather than simply slowing.

The role of frequency

There has long been a tacit assumption that, when running on P cores, computationally intensive threads such as those used in these tests are run at a fairly constant frequency close to maximum. Looking back at my earlier results on M1 and M3 cores, though, frequencies aren’t so consistent, and in many cases not that close to maximum either.

powermetrics provides more frequency figures than you know what to do with, although most are derived and to some extent imaginary, making reconciliation difficult. Taking the best estimate of core frequency as that given as Cluster HW active frequency, patterns seen on M4 P cores are distinctive.

m4frequenciesByThreads

This graph shows those frequencies for the active P cluster by the number of threads, for floating point, NEON and vDSP_mmul tests. Frequencies for the first two are identical, at P core maximum for a single thread, then falling sharply from 2 to 3 threads. When more threads are run, the cluster that’s fully active is run at the same frequency as that for 5 threads (P cluster size on this M4 Pro), while the other P cluster follows the same frequencies shown in the graph for the number of threads it’s running.

Frequencies are controlled by macOS, and this suggests it adopts a standard pattern when running the two in-core tests, and a different one for vDSP_mmul presumably geared to performance of the AMX. Changes in frequency also account for the steps seen at 2-3 and 7-8 (= 5+2 and 5+3) threads in the NEON graphs above.

Power Modes

There has been considerable interest in the Power Mode setting available in macOS when running on M4 Pro and Max chips. To assess its effects I ran tests in 10 threads to fully occupy the P cores, at each of the three Power mdes.

There was no difference between results for the default Automatic and High Power modes, as expected. This is because the effect of High Power mode isn’t to change frequencies or power use in the short-term, but by more aggressive fan use enabling higher frequencies to be sustained for prolonged periods, when in Automatic mode they would be throttled.

Low Power does have substantial effects on core frequency, performance and power use, though. When running floating point tests in 10 threads, their cluster frequency was reduced from 3,852 to 3,624, 94% of Automatic and High Power. That reduced power use from a mean of 13.9 W to 11.2 W, and increased the time to complete threads. Time taken by floating point threads increased to 106% of Automatic and High, while that for NEON increased to 135% and vDSP_mmul to 177%. While the reduced performance for floating point threads is unlikely to be noticeable, for vector and matrix threads that’s likely to obvious to the user.

Key information

  • vDSP_mmul matrix multiplication from the vDSP sub-library in Accelerate behaves consistently with it being performed in the AMX co-processors in M4 Pro chips.
  • vDSP_mmul threads used significantly more power than NEON, reaching a maximum of just over 36 W when fully occupying all 10 P cores.
  • When spilt over to E cores, vDSP_mmul threads were much slower and their performance erratic, consistent with the E cluster having a smaller and significantly less performant AMX.
  • In-core tests (floating point and NEON) show common frequency regulation according to the number of cores active in each P core cluster. This runs a single thread at maximum frequency, then reduces sharply from 2 to 3 threads/cores. This accounts for the deviations from linearity observed in power use and performance. That pattern doesn’t appear in vDSP_mmul threads, though.
  • High Power and Automatic modes are identical in short-term tests.
  • Low Power mode reduces P cluster frequency and power use. Although its effects are unlikely to be noticeable in floating point threads, effects on vector and matrix threads are greater, and performance reductions are likely to be obvious to the user.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance
Inside M4 chips: CPU power, energy and mystery
Finding and evaluating AMX co-processors in Apple silicon chips (M1 and M3)

Appendix: Source code

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount

Apple describes vDSP_mmul() as performing “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Inside M4 chips: CPU power, energy and mystery

By: hoakley
25 November 2024 at 15:30

Few comparisons or benchmarks for M-series chips take into account the reason for equipping Apple silicon chips with more than one CPU core type, according to Arm’s big.LITTLE architecture. Measuring single- or multi-core performance ignores the purpose of E cores, and estimating overall power use can’t compare those core types. This article tries to estimate the cost in terms of power and energy of running identical tests on M4 P and E cores, and thereby provide insight into some of the most distinctive features of Apple silicon, and their benefits.

Methods

To run these two in-core performance tests I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Both tests used here are written in assembly code, and aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run. Those are:

  • 64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
  • 32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a Quality of Service of the maximum of 33. That ensures they are run preferentially on P cores, but will spill over to E cores when no P core is available, when more than 10 threads are run concurrently. Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

For these tests, the total number of loops to be executed in each thread was set at 5 x 10^8 for floating point, and 3.5 x 10^9 for NEON. Those values were chosen to take 2-3 seconds per thread, to ensure the whole test period was available for analysis.

Immediately before running each test, I launch powermetrics from the command line, to gather core power and performance data in sampling periods of 0.1 second for a total of 50 samples. Its output is piped into a text file, which is then analysed using Numbers and DataGraph. All tests were conducted on a Mac mini M4 Pro with 10 P and 4 E cores, running macOS 15.1.1 in standard power mode.

Each test was inspected individually, and seen to contain the following phases:

  1. small initial activity resulting from bringing the GUI app into focus, and clicking the Run button;
  2. a brief period of low activity, typically with total CPU power at below 50 mW;
  3. 1-2 sample periods when threads are loaded onto the cores;
  4. 15-21 sample periods when threads are run, whose total CPU power measurements are collected for analysis;
  5. 1-2 sample periods when threads are unloaded;
  6. a return to low activity, typically with total CPU power returning below 50 mW.

Means and standard deviations were then calculated for each series of power measurements, and pooled with times taken to execute threads.

Power used by thread

The first pair of graphs shows average power use for the number of threads run, shown here with error bars giving the range of +1 standard deviation. These show two sections: for 1-10 threads, when all were running on P cores, and for 11-14 threads, when the 10 P cores were fully committed and 1-4 threads spilt over to run on E cores at their maximum frequency. Maximum power used during testing was just short of 34 W.

m4powerflopt1

That for the floating point test above, and NEON below, have regression lines fitted, indicating that:

  • Each additional floating point thread required 1,300 mW on P cores, and 110 mW on E cores.
  • Each additional NEON thread required 3,000 mW on P cores, and 280 mW on E cores.
  • P cores thus required 11-12 times the power of E cores, or E cores used 8-9% of the power of P cores.

m4powerneon1

Although linear regressions aren’t a bad fit, there’s consistent deviation from the linear relationship seen in previous analyses on M1 and M3 cores. More remarkably, the pattern of deviation is identical between these two tests, although they run in different units in these cores. In both cases, power use was high for 2 and 7 threads, while that for 3 and 8 threads was slightly lower. The only unusual pattern seen in powermetrics output was that, when running 2 and 7 threads, thread mobility was much higher than in other tests.

Previous tests on M1 and M3 P cores found that each additional floating point thread run on those requires about 935 mW, indicating a substantial increase in power used by M4 P cores when running at their higher maximum frequency. E cores in an M1 Pro require about 100 mW each when running at maximum frequency, similar to those in the M4.

Execution time

As power is the rate of energy use over time, the next step is to examine total execution time for all the threads running concurrently, which should form a linear relationship with different gradients for P and E cores. The next two graphs demonstrate that.

m4powerflopt2

For both floating point (above) and NEON (below), there’s a tight linear relationship between total execution time and numbers of threads. Floating point demonstrates that each thread costs 2.4 seconds on P cores and 3.6 seconds on E cores, making E core execution time 150% that of P cores. NEON is similar, at 2.5 seconds on P cores and 3.4 seconds on E cores, for a ratio of 136%.

m4powerneon2

Time taken for the slowest thread to complete execution shows interesting finer detail.

m4powerflopt3

For both tests, performance falls into several sections according to the number of threads run. With less than 5 threads run, there’s a sharp rise in time taken per thread. From 5-10 threads, time required remains constant, before increasing from 10-14 threads, when additional threads are spilt over onto E cores.

This has implications for anyone trying to measure core performance, as it demonstrates that a single thread can run disproportionately fast, compared with 3-10 threads. Basing any conclusion or comparison on a single thread completing in little more than 2 seconds, when 5 concurrent threads would take 2.34 seconds, 117% of the single thread, could be misleading.

m4powerneon3

Energy use

Although power use determines heat production, so is an important factor in determining cooling requirements, total energy required to execute threads is equally important for Macs running from battery. Simply reducing core frequency will reduce power used, but by extending the time taken to complete tasks, it may have no effect on energy used, and battery endurance. My final two graphs therefore show estimated total energy used when running test threads on P and E cores, the ultimate test of any big.LITTLE CPU design such as that in the M4.

m4powerflopt4

Graphs for floating point (above) and NEON (below) are inevitably similar in form to those for power, with a near-linear section from 1-10 cores, when the threads are run only on P cores, and from 11-14 cores when they also spill over to E cores.

Fitted regression lines provide the energy cost for each additional thread:

  • For floating point, each thread run on a P core costs 3.1 J, and for an E core 1.5 J, making the energy used by an E core 47% that of a P core.
  • For NEON, P cores cost 7.7 J per thread, and E cores 3.0 J, making the energy used by an E core 38% that of a P core.

It’s important to remember that the E cores here aren’t being run at frequencies for high efficiency, but at their maximum so they can substitute for the P cores that are already in use.

m4powerneon4

Considering the small deviations from those linear relationships, it appears that running 2, 6 or 7 threads on P cores requires slightly more energy than predicted from the regression lines shown.

Unfortunately, assessing the energy used by E cores running at low frequencies, as they normally do when performing background tasks, is fraught with inaccuracies due to their low power use. My previous estimate for floating point tests is that a slow-running E core uses less than 45 mW per thread, and for the same task requires about 7% of the energy used by a P core running at maximum frequency, but I have lower confidence in the accuracy of those figures than in those above for higher frequencies.

Key information

  • When running the same code at maximum frequency, E cores used 8-9% of the power of P cores.
  • Power use when running 2 or 7 threads was anomalously high, possibly due to high thread mobility.
  • Execution on E cores was significantly slower than on P cores, at 136-150% of the time required on P cores.
  • Single-core performance measurements may not be accurate reflections of performance on multiple cores.
  • When running the same code at maximum frequency, energy used by an E core is expected to be 38-47% that of a P core.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores
Inside M4 chips: CPU core performance

Appendix: Source code

_fpfmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
FMOV D4, D0
FMOV D5, D1
FMOV D6, D2
LDR D7, INC_DOUBLE
fp_while_loop:
SUBS X4, X4, #1
B.EQ fp_while_done
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
B fp_while_loop
fp_while_done:
FMOV D0, D4
LDR LR, [SP], #16
RET

_neondotprod:
STR LR, [SP, #-16]!
LDP Q2, Q3, [X0]
FADD V4.4S, V2.4S, V2.4S
MOV X4, X1
ADD X4, X4, #1
dp_while_loop:
SUBS X4, X4, #1
B.EQ dp_while_done
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S
FADD V2.4S, V2.4S, V4.4S
B dp_while_loop
dp_while_done:
FMOV S0, S2
LDR LR, [SP], #16
RET

A brief history of Mac CPUs

By: hoakley
23 November 2024 at 16:00

Macs have used four different architectures for their Central Processing Units over the last 40 years. From their launch by Steve Jobs on 24 January 1984, for the first decade they used Motorola 68K CPUs, then switched to PowerPCs designed by an alliance of Apple, IBM and Motorola, which were used for 12 years. After 14 years being built around Intel processors from 2006, Macs most recently changed a third time to use Apple’s own Arm-based chips.

Over those 40 years, continuous improvements in capabilities and performance of CPUs have transformed Mac OS and the apps it supports.

Motorola 68K

CPUs execute instructions in synchrony with a clock whose frequency determines the rate of instruction execution. The Motorola 68000 processor in the original Mac 128K ambled along at a clock speed of just 8 MHz. The last 68K models featuring 68040 CPUs had raised that to 33 MHz, and added specialist Memory Management Units (MMUs) and floating point units. The latter first appeared as 68881 and 68882 maths co-processors, but were later integrated into the 68040.

MMUs were particularly important for the implementation of virtual memory. When the Macintosh II was introduced in 1987, it was the first Mac that could be fitted with Motorola’s optional 68851 paged MMU, required for it to run Apple’s A/UX port of Unix with virtual memory support. Strangely, Apple’s own MMU fitted in the standard Mac II didn’t support virtual memory. Its 68020 CPU was also the first in Macs to use 32 bits rather than the 16 of the original 68000.

AIM PowerPC

When introduced in 1994, the first Power Macs came with PowerPC 601 or 601+ CPUs running at frequencies up to 110 MHz, nearly 14 times faster than the original Mac 128K. Just over a decade later, the last Power Mac G5 raised that to dual two-core CPUs at 2,500 MHz, more than 20 times the clock frequency, and the previous model had offered dual 2,700 MHz CPUs.

PowerPCs had their origins in IBM’s high-end POWER architecture based on a reduced instruction set (RISC) intended to be run at higher frequencies. Initially, these CPUs used a 32-bit design, but progressed to 64 bits. Not only do they have integrated floating point units that were extended for Apple, but later models include AltiVec vector processing for single-precision floating point and integer operations.

High CPU frequencies bring higher power consumption and heat output. A dual-core G5 with a PowerPC 970MP CPU used a maximum of 100 W at 2,000 MHz, and some of the last G5 Macs used liquid cooling to cope with the heat generated at higher frequencies. Those didn’t prove long-lived, with coolant leaks a common and fatal failing.

Intel x86

In early 2006 Apple started releasing its new range of Macs using Intel CPUs. With the exception of a base model of Mac mini, those came with 2-core Core Duo processors running at up to 2 GHz, and were soon followed by the first Mac Pros featuring two 64-bit 2-core Xeon 5100 CPUs (Woodcrest) at up to 3 GHz. By the following year, the first 8-core Mac Pro was available.

Earlier increases in CPU frequency gradually petered out. The last Intel Mac Pro was available with cores running at 2.5-3.5 GHz, boosted to a maximum of up to 4.4 GHz. Instead, high-end models offered as many as 28 cores and drew power up to 900 W. More typical of late desktop Macs were Intel Core i9 CPUs with 6-8 cores at similar frequencies. Adding more processor cores has been an effective way to run more code at the same time. Tasks are divided into threads that can run relatively independently of one another. Those threads can then be distributed across several CPU cores.

Rising power consumption and heat output were becoming even more of a problem in MacBook Pro models.

Apple Arm

Well before Apple had joined IBM and Motorola in the AIM alliance, it had co-founded the company based in Cambridge, England, that was to become ARM (for Acorn RISC Machines). Its RISC processor was used in Apple’s Newton MessagePad of 1993, and in 2010 Apple released its first iPhone and iPad designed around a single-core 32-bit Arm CPU running at a cool and economical 1 GHz.

From before macOS Mojave in 2018, Apple was preparing for its next migration, to its own integrated Systems on a Chip (SoC), starting with the M1 in 2020. The first iPhone to incorporate two CPU core types was the iPhone 7 of 2016, in its A10 Fusion SoC. Rather than simply adding more cores, Apple had adopted the Arm big.LITTLE architecture, in which background threads are run on slower, more efficient CPU cores, and higher priority user processes run on faster, more performant cores.

The first two families of M1 and M2 chips have cores grouped in clusters of no more than 4, but Apple increased cluster size to 6 in the M3 and M4. While the M1 family consists of two designs, one for the base variant, and the second for both Pro and Max, and doubled in the Ultra, M3 and M4 families have distinct designs for their Pro and Max variants. For the M4, this offers a full range from 8 to 16 cores in total, with an anticipated Ultra extending to 32. In addition to CPU cores with built-in vector processing (NEON), these chips incorporate specialist co-processors such as a neural engine and a proprietary matrix co-processor, AMX.

Performance (‘big’) cores have increased in maximum frequency, from 3.2 GHz in the M1 to 4.5 GHz in the M4 four years later.

Trends

The period 1984-2007 was dominated by increasing CPU frequency, as demonstrated in the two charts below.

maccpuhistoryfreqlin

This chart uses a conventional linear Y axis to demonstrate that frequency rose rapidly during the decade from 1997. As the form of this curve is S-shaped, the chart below shows the same data with a logarithmic Y axis.

maccpuhistoryfreqexp

Since about 2007, Macs haven’t seen substantial frequency increases. Many factors limit the maximum frequency that a processor can run at, including its physical dimensions, but among the most significant in practical terms are its power requirements and heat output, hence its need for cooling. Thus, the period 2005-2017 became dominated by increasing core count.

maccpuhistorycores

This chart shows how the number of processors and cores inside Macs didn’t start rising until around 2005, just as frequencies were topping out. Thus, many of the CPU performance improvements from 2007 onwards have been the result of providing more cores. But there’s a practical limit as to how many of those cores will get used, which is where processing more data becomes important, as it has from 1998 onwards.

It’s remarkable how much of Mac OS has survived if not flourished over those 40 years that our Macs have gone from a pedestrian Motorola 68000 processor to the 12 performance cores capable of 4.5 GHz in an M4 Max chip.

Inside M4 chips: CPU core performance

By: hoakley
20 November 2024 at 15:30

There’s no doubt that the CPUs in M4 chips outperform their predecessors. General-purpose benchmarks such as Geekbench demonstrate impressive rises in both single- and multi-core results, in my experience from 3,191 (M3 Pro) to 3,892 (M4 Pro), and 15,607 (M3 Pro) to 22,706 (M4 Pro). But the latter owes much to the increase in Performance (P) core count from 6 to 10. In this series I concentrate on much narrower concepts of performance in CPU cores, to provide deeper insight into topics such as core types and energy efficiency. This article examines the in-core performance of P and E cores, and how they differ.

P core frequencies have increased substantially since the M1. If we set that as 100%, M3 P cores run at around 112-126% of that frequency, and those in the M4 at 140%.

E cores are more complex, as they have at least two commonly used frequencies, that when running low Quality of Service (QoS) threads, and that when running high QoS threads that have spilt over from P cores. Low QoS threads are run at 77% of M1 frequency when on an M3, and 105% on an M4. High QoS threads are normally run at higher frequencies of 133% on the M3 E cores (relative to the M1 at 100%), but only 126% on the M4.

Methods

To measure in-core performance I use a GUI app wrapped around a series of loading tests designed to enable the CPU core to execute that code as fast as possible, and with as few extraneous influences as possible. Of the seven tests reported here, three are written in assembly code, and the others call optimised functions in Apple’s Accelerate library from a minimal Swift wrapper. These tests aren’t intended to be purposeful in any way, nor to represent anything that real-world code might run, but simply provide the core with the opportunity to demonstrate how fast it can be run at a given frequency.

The seven tests used here are:

  • 64-bit integer arithmetic, including a MADD instruction to multiply and add, a SUBS to subtract, an SDIV to divide, and an ADD;
  • 64-bit floating point arithmetic, including an FMADD instruction to multiply and add, and FSUB, FDIV and FADD for subtraction, division and addition;
  • 32-bit 4-lane dot-product vector arithmetic (NEON), including FMUL, two FADDP and a FADD instruction;
  • simd_float4 calculation of the dot-product using simd_dot in the Accelerate library.
  • vDSP_mmul, a function from the vDSP sub-library in Accelerate, multiplies two 16 x 16 32-bit floating point matrices, which in M1 and M3 chips appears to use the AMX co-processor;
  • SparseMultiply, a function from Accelerate’s Sparse Solvers, multiplies a sparse and a dense matrix, that may use the AMX co-processor in M1 and M3 chips.
  • BNNSMatMul matrix multiplication of 32-bit floating-point numbers, here in the Accelerate library, and since deprecated.

Source code of the loops is given in the Appendix.

The GUI app sets the number of loops to be performed, and the number of threads to be run. Each set of loops is then put into the same Grand Central Dispatch queue for execution, at a set Quality of Service (QoS). Timing of thread execution is performed using Mach Absolute Time, and the time for each thread to be executed is displayed at the end of the tests.

I normally run tests at either the minimum QoS of 9, or the maximum of 33. The former are constrained by macOS to be run only on E cores, while the latter are run preferentially on P cores, but may spill over to E cores when no P core is available. All tests are run with a minimum of other activities on that Mac, although it’s not unusual to see small amounts of background activity on the E cores during test runs.

The number of loops completed per second is calculated for two thread totals for each of the three execution contexts. Those are:

  • P cores alone, based on threads run at high QoS on 1 and 10 P cores;
  • E cores at high frequency (‘fast’), run at high QoS on 10 cores (no threads on E cores) and 14 (4 threads on E cores);
  • E cores at low frequency (‘slow’), run at low QoS on 1 and 4 E cores.

Results are then corrected by removing overhead estimated as the rate of running empty loops. Finally, each test is expressed as a percentage of the performance achieved by the P cores in an M1 chip. Thus, a loop rate double that achieved by running the same test on an M1 P core is given as 200%.

P core performance

As in subsequent sections, these are shown in the bar chart below, in which the pale blue bars are for M1 P cores, dark blue bars for M3 P cores, and red for M4 P cores.

m134coreperf1

As I indicated in my preview of in-core performance, there is little difference between integer performance between M3 and M4 P cores, but a significant increase in floating point, which matches that expected by the increased frequency in the M4.

Vector performance in the NEON and simd dot tests, and matrix multiplication in vDSP mmul rise higher than would be expected by frequency differences alone, to over 160% of M1 performance. The latter two tests are executed using Accelerate library calls, so there’s no guarantee that they are executed the same on different chips, but the first of those is assembly code using NEON instructions. SparseMultiply and BNNS matmul are also Accelerate functions whose execution may differ, and don’t fare quite as well on the M4.

E core slow performance

m134coreperf2

On E cores, threads run at low QoS are universally at frequencies close to idle, as reflected in their performance, still relative to an M1 P core at much higher frequency. Frequency differences account for the relatively poor performance of M3 E cores, and improvement in results for the M4. Those are disproportionate in vector and matrix tests, which could be accounted for by the M4 E core running those at higher frequencies than would be normal for low QoS threads.

Although the best of these, NEON, is still well below M1 P core performance (73%), this suggests a design decision to deliver faster vector processing on M4 E cores, which is interesting.

E core fast performance

m134coreperf3

Ideally, when high QoS threads overspill from P cores, it’s preferable that they’re executed as fast as they would have been on a P core. Those in the M1 fall far short of that, in scalar and vector tests only delivering 40-60% of a P core, although that seemed impressive at the time. The M3 does considerably better, with vector and one matrix test slightly exceeding the M1 P core, and the M4 is even faster in vector calculations, peaking at over 130% for NEON assembly code.

Far from being a cut-down version of its P core, the M4 E core can now deliver impressive vector performance when run up to maximum frequency.

M4 P and E comparison

Having considered how P and E cores have improved against those in the M1, it’s important to look at the range of computing capacity they provide in the M4. This is shown in the chart below, where pale blue bars are P cores, red bars E cores at high QoS and frequency, and dark blue bars E cores at low QoS and frequency. Again, these are all shown relative to P core performance in the M1.

m134coreperf4

Apart from the integer test, scalar floating point, vector and matrix calculations on P cores range between 140-175% those of the M1, a significant increase on those expected from frequency increase alone. Scalar and vector (but not matrix) calculations on E cores at high frequency are slower, although in most situations that shouldn’t be too noticeable. Performance does drop off for E cores at low frequency, though, and would clearly have impact for code.

Given the range of operating frequencies, P and E cores in the M4 chip deliver a wide range of performance at different power levels, and its power that I’ll examine in the next article in this series.

Key information

  • M4 P core maximum frequency is 140% that of the M1. That increase in frequency accounts for much of the improved P core performance seen in M4 chips.
  • E core frequency changes are more complex, and some have reduced rather than risen compared with the M1.
  • P core floating point performance in the M4 has increased as would be expected by frequency change, and vector and matrix performance has increased more, to over 160% those of the M1.
  • E core performance at low QoS and frequency has improved in comparison to the M3, and most markedly in vector and matrix tests, suggesting design improvements in the latter.
  • E core performance at high QoS and frequency has also improved, again most prominently in vector tests.
  • Across their frequency ranges, M4 P and E cores now deliver a wide range of performance and power use.

Previous articles

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM
Inside M4 chips: E and P cores

Appendix: Source code

_intmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
int_while_loop:
SUBS X4, X4, #1
B.EQ int_while_done
MADD X0, X1, X2, X3
SUBS X0, X0, X3
SDIV X1, X0, X2
ADD X1, X1, #1
B int_while_loop
int_while_done:
MOV X0, X1
LDR LR, [SP], #16
RET

_fpfmadd:
STR LR, [SP, #-16]!
MOV X4, X0
ADD X4, X4, #1
FMOV D4, D0
FMOV D5, D1
FMOV D6, D2
LDR D7, INC_DOUBLE
fp_while_loop:
SUBS X4, X4, #1
B.EQ fp_while_done
FMADD D0, D4, D5, D6
FSUB D0, D0, D6
FDIV D4, D0, D5
FADD D4, D4, D7
B fp_while_loop
fp_while_done:
FMOV D0, D4
LDR LR, [SP], #16
RET

_neondotprod:
STR LR, [SP, #-16]!
LDP Q2, Q3, [X0]
FADD V4.4S, V2.4S, V2.4S
MOV X4, X1
ADD X4, X4, #1
dp_while_loop:
SUBS X4, X4, #1
B.EQ dp_while_done
FMUL V1.4S, V2.4S, V3.4S
FADDP V0.4S, V1.4S, V1.4S
FADDP V0.4S, V0.4S, V0.4S
FADD V2.4S, V2.4S, V4.4S
B dp_while_loop
dp_while_done:
FMOV S0, S2
LDR LR, [SP], #16
RET

func runAccTest(theA: Float, theB: Float, theReps: Int) -> Float {
var tempA: Float = theA
var vA = simd_float4(theA, theA, theA, theA)
let vB = simd_float4(theB, theB, theB, theB)
let vC = vA + vA
for _ in 1...theReps {
tempA += simd_dot(vA, vB)
vA = vA + vC
}
return tempA
}

16 x 16 32-bit floating point matrix multiplication

var theCount: Float = 0.0
let A = [Float](repeating: 1.234, count: 256)
let IA: vDSP_Stride = 1
let B = [Float](repeating: 1.234, count: 256)
let IB: vDSP_Stride = 1
var C = [Float](repeating: 0.0, count: 256)
let IC: vDSP_Stride = 1
let M: vDSP_Length = 16
let N: vDSP_Length = 16
let P: vDSP_Length = 16
A.withUnsafeBufferPointer { Aptr in
B.withUnsafeBufferPointer { Bptr in
C.withUnsafeMutableBufferPointer { Cptr in
for _ in 1...theReps {
vDSP_mmul(Aptr.baseAddress!, IA, Bptr.baseAddress!, IB, Cptr.baseAddress!, IC, M, N, P)
theCount += 1
} } } }
return theCount

Apple describes vDSP_mmul() as performinng “an out-of-place multiplication of two matrices; single precision.” “This function multiplies an M-by-P matrix A by a P-by-N matrix B and stores the results in an M-by-N matrix C.”

Sparse matrix multiplication

var theCount: Float = 0.0
let rowCount = Int32(4)
let columnCount = Int32(4)
let blockCount = 4
let blockSize = UInt8(1)
let rowIndices: [Int32] = [0, 3, 0, 3]
let columnIndices: [Int32] = [0, 0, 3, 3]
let data: [Float] = [1.0, 4.0, 13.0, 16.0]
let A = SparseConvertFromCoordinate(rowCount, columnCount, blockCount, blockSize, SparseAttributes_t(), rowIndices, columnIndices, data)
defer { SparseCleanup(A) }
var xValues: [Float] = [10.0, -1.0, -1.0, 10.0, 100.0, -1.0, -1.0, 100.0]
let yValues = [Float](unsafeUninitializedCapacity: xValues.count) {
resultBuffer, count in
xValues.withUnsafeMutableBufferPointer { denseMatrixPtr in
let X = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: denseMatrixPtr.baseAddress!)
let Y = DenseMatrix_Float(rowCount: 4, columnCount: 2, columnStride: 4, attributes: SparseAttributes_t(), data: resultBuffer.baseAddress!)
for _ in 1...theReps {
SparseMultiply(A, X, Y)
theCount += 1
} }
count = xValues.count
}
return theCount

Apple describes SparseMultiply() as performing “the multiply operation Y = AX on a sparse matrix of single-precision, floating-point values.” “Use this function to multiply a sparse matrix by a dense matrix.”

Sparse matrix multiplication

var theCount: Float = 0.0
let inputAValues: [Float] = [ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 ]
let inputBValues: [Float] = [
1, 2,
3, 4,
1, 2,
3, 4,
1, 2,
3, 4]
var inputADescriptor = BNNSNDArrayDescriptor.allocate(
initializingFrom: inputAValues,
shape: .imageCHW(3, 4, 2))
var inputBDescriptor = BNNSNDArrayDescriptor.allocate(
initializingFrom: inputBValues,
shape: .tensor3DFirstMajor(3, 2, 2))
var outputDescriptor = BNNSNDArrayDescriptor.allocateUninitialized(
scalarType: Float.self,
shape: .imageCHW(inputADescriptor.shape.size.0,
inputADescriptor.shape.size.1,
inputBDescriptor.shape.size.1))
for _ in 1...theReps {
BNNSMatMul(false, false, 1,
&inputADescriptor, &inputBDescriptor, &outputDescriptor, nil, nil)
theCount += 1
}
inputADescriptor.deallocate()
inputBDescriptor.deallocate()
outputDescriptor.deallocate()
return theCount

Inside M4 chips: E and P cores

By: hoakley
18 November 2024 at 15:30

In the two previous articles (links at the end), I explored some of the features and properties of Performance (P) cores in Apple’s latest M4 chips. This article looks at their Efficiency (E) cores by comparison.

M4 family

In the three current M4 designs, there are only two variations in terms of E cores:

  • Base M4, with 6 E cores, except for a cheaper variant with only 4 active E cores.
  • M4 Pro and Max, with 4 E cores, including ‘binned’ variants.

Apple is expected to release an Ultra variant in 2025, with two M4 Max chips in tandem, providing a total of 8 E cores. Apart from the number of cores, all E cores are the same, and different from P cores.

E core architecture

All E cores are arranged in a single cluster of 4 or 6, sharing common L2 cache, and running at the same frequency (clock speed). Analysis of M1 cores implies that each E core has roughly half the number of processing units, where there is more than one such unit in the P core, giving an M1 E core roughly half the compute capacity of the P core. I haven’t seen any comparable analysis of cores in later M families, although differences in power consumption imply there remain substantial differences in processing units and compute capacity.

Frequency

Like P cores, E cores can be set to run at any of 5 values between the minimum of 1,020 MHz and maximum of 2,592 MHz (1.0-2.6 GHz). When running macOS, cluster frequency is set by macOS at a kernel level; other operating systems may offer more direct control. This range of frequencies is significantly narrower than that of E cores in the M3, which range between 744-2,748 MHz.

E cores idle at 1,020 MHz, and although they can be shut down altogether, that’s exceptional given the steady demand for macOS background threads to be run on them. Nevertheless, powermetrics still reports their ‘down’ residencies separately from idle residencies.

Instruction set

This is believed to be identical to ARMv9.2-A without Scalable Vector Extension (SVE) supported by M4 P cores, enabling the same threads to be run on either core type.

Single thread comparisons

One way to appreciate the contrasts between core types is to compare a single intensive in-core thread run in each. For this purpose, I used a tight loop of floating point calculations, running at two different Quality of Service (QoS) settings, in macOS 15.1.

Single thread at high QoS on P cores

m4singlePflopt1

This thread was initially loaded onto P13 (red) in the second (P1) cluster, and after 3.7 seconds was moved to P5 (blue) in the first (P0) cluster. After a further 4.6 seconds running on that, it was moved back to the second (P1) cluster, to run on P11 (purple). During this run, there was almost no other activity on the two P clusters, and the inactive cluster was therefore shut down while this thread was running on the other.

m4singlePflopt2

The active cluster was run at the maximum frequency of 4,511 MHz throughout. Just before the thread was moved to a different cluster, that was brought up and run up to maximum frequency ready to run the thread.

m4singlePflopt3

Total CPU power remained similar throughout the period the thread was being executed, but there is a small and consistent difference according to which cluster was active: the first (P0) brought power use of about 2,520 mW, 50 mW higher than the second (P1) at about 2,470 mW. This matches the difference reported previously, and merits assessment in other M4 Pro chips to determine whether this is a general feature.

Single thread at high QoS on E cores

There are methods of running code, such as the in-core floating point loop test used here, on E cores: they can be run with a low QoS (Background), so that macOS allocates them to run on only E cores, or they can be spilt over from high QoS threads when there are more threads than available P cores. On an M4 Pro chip, that requires 11 threads, which results in one of those being allocated to the E cluster, as described next.

m4singlePonEflop1

This chart shows active residency on the four E cores with a single high QoS thread spilt onto them. While cores E1, E2 and E3 appear to handle other threads over this period of more than six seconds, core E0 appears to run at 90-100% active residency executing the spilt thread. Note that this thread wasn’t moved between cores over that period of over six seconds.

E cluster frequency remained constant throughout at its maximum of 2,592 MHz. CPU power use was inevitably dominated by the ten P cores running at 100% active residency and maximum frequency, remaining at just under 14,000 mW. Unfortunately, using powermetrics it’s not possible to estimate the power use of the E cluster directly.

Single thread at low QoS on E cores

This is very different from the spilt thread at high QoS.

m4singleEflop1

There’s no evidence here that any single core in the E cluster ran a thread at 100% active residency. Instead it appears to have been moved rapidly and freely around the cores, with many 0.1 second sampling intervals spanning its execution in more than one core over that period.

m4singleEflop2

Cluster frequency was a steady minimum of 1,050-1,060 MHz, with superimposed spikes when it rose briefly to the maximum of 2,592 MHz. This suggests that the single thread would most probably have been run at close to core minimum frequency, had there not been additional threads to run.

m4singleEflop3

A similar picture is seen in power use, with spikes from a low background of about 40-45 mW required by the single thread alone.

Single thread behaviours

These can be summarised as:

  • P core (high QoS) runs at 100% active residency on a single P core at maximum frequency, and is switched between clusters irregularly (about every 3.7-4.6 seconds). Total power use is about 2,500 mW.
  • High QoS spilt over to E cores runs at 90-100% active residency on a single E core at maximum frequency, and is either not switched between cores at all, or only infrequently.
  • E core (low QoS) runs at about 100% and is moved frequently between all E cores in the cluster, at close to minimum frequency. Total power use is about 40-45 mW.

Performance, power and efficiency

Although I’ll be returning to more detailed comparisons of performance and power use between P and E cores, I provide a single illustration here, for the in-core floating point task used above.

Running 2 x 10^9 loops in each thread, P cores at maximum frequency take 9.2-9.7 seconds per thread, and use about 2,500 mW per thread. E cores running low QoS threads at close to minimum frequency take about four times as long, 38.5 seconds, but use less than 45 mW power per thread. Total energy used to complete one thread is therefore over 23 J when run on P cores, and less than 1.7 J when run on E cores. E cores therefore use only 7% of the energy that P cores do performing the same task.

Key information

  • Current M4 chips feature 4-6 CPU E cores.
  • M4 E cores are arranged in a single cluster of 4 or 6, sharing L2 cache and running at a common frequency.
  • The E core cluster can be shut down (exceptionally), idling at their minimum frequency of 1,020 MHz, or at one of 6 set frequencies up to a maximum of 2,592 MHz, as controlled by macOS.
  • Their instruction set is the same as M4 P cores, ARMv9.2-A without its Scalable Vector Extension (SVE).
  • They use 40-45 mW when at low frequencies, but it’s not currently feasible to measure directly their maximum power use at high frequencies.
  • macOS allocates threads to E cores when their QoS is 9 (Background), and when a thread with higher QoS can’t be allocated to a P core because they are all busy. Management of frequencies and core allocation differ between those two cases.
  • High QoS threads on E cores are run at maximum frequency and appear not to move between cores.
  • Low QoS threads on E cores are run at close to minimum frequency and are highly mobile between cores.
  • Low QoS threads running on E cores run more slowly than higher QoS threads running on P cores, but E core power use is much lower, resulting in considerable saving in total energy use for the same computational task.

Previous article

Inside M4 chips: P cores
Inside M4 chips: P cores hosting a VM

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

M4 Macs can’t virtualise older macOS

By: hoakley
14 November 2024 at 15:30

If you’ve already got a new M4 Mac and tried to run a macOS virtual machine on it, then you might have been disappointed. It seems that M4 chips can’t virtualise any version of macOS before 13.4 Ventura. So before you trade in or pass on your M1, M2 or M3 Mac, if you need access to older VMs, you might like to check whether this affects you.

I’m indebted to Csaba Fitzl for drawing my attention to this problem, and for reporting it to Apple in Feedback FB15774587. It has also been reported as affecting UTM, and I believe affects all other macOS virtualisation software for Apple silicon.

The bug

Running a macOS VM for any version before 13.4 Ventura on an M4 Mac results in a black screen, and the VM fails to boot. This is true whatever settings are used in the virtualiser, even if it’s set to boot the VM in Recovery mode. It’s also true when that VM has been built on that Mac: although that appears to complete successfully, when first run that VM opens as a black screen and never proceeds with personalisation and setup.

Currently the only way to run a VM with macOS prior to 13.4 Ventura is to do so on a host with an M1, M2 or M3 chip.

Can this be fixed?

Unfortunately, as this bug prevents the VM from booting, there’s no reliable way to access its log to discover what’s going wrong there. There’s no sign of the failure in the host’s log either: the host appears to initialise its Virtio and other support normally, without errors or faults. After those, virtualisation processes on the host fall silent as they wait for the VM to start, which never happens.

There is a useful clue in Activity Monitor, though: in its CPU pane, despite being allocated multiple virtual cores, only one is seen to be active on the host. That implies the failure is occurring before the VM kernel boots the other cores, an event that occurs early during the kernel boot phase. Until that point, pre-boot phases and the kernel run on just a single CPU core.

macOS 13.4 updated iBoot to version 8422.121.1, so at first sight the VM could be failing when running older firmware. That doesn’t appear likely, as version 8422.121.1 was also installed in the 12.6.6 security update, so 12.7.x shouldn’t suffer this problem, but it does.

It thus appears most likely that this bug strikes in the early part of kernel boot, in which case the most feasible solution would be to fix the bug in macOS kernels prior to 13.4, and promulgate new IPSW image files for those. I suspect that’s very unlikely to happen, and as far as I’m aware it would be the first time that Apple has issued revised IPSWs.

Which macOS can you virtualise?

Support for lightweight virtualisation of macOS on Apple silicon Macs was still in progress in the first version of macOS to run on M1 chips, Big Sur. You therefore cannot create or run Big Sur VMs.

macos12

The first versions that can run in VMs are macOS 12 Monterey, although prior to 12.4 they can sometimes be a bit fractious. They also have major limitations, such as not supporting shared folders with the host.

macos121

This is 12.1 with its old System Preferences, running happily on an M3 Pro.

macos13

macOS Ventura should run well on M1, M2 and M3 hosts, and 13.4 and later on M4 hosts too.

macos14

macOS Sonoma should run even better on all current Apple silicon Macs, and delivers a much improved display with autoscaling. However, 14.2 and 14.2.1 don’t automatically support shared folders because of a bug that was fixed in 14.3.

macos15

macOS Sequoia is fully compatible, and adds support for iCloud Drive and some other Apple Account features, although still won’t run App Store apps. It may also fail to install the required extras to support Apple Intelligence.

Summary

  • Currently, M4 Macs can only run VMs of macOS Ventura 13.4 and later, 14 Sonoma, and 15 Sequoia.
  • M1, M2 and M3 Macs can run VMs of macOS Monterey 12.0.1 and later.
  • macOS Big Sur 11 can’t be virtualised on Apple silicon.

Further information

Viable and virtualisation
Why macOS 14.2 & 14.2.1 VMs lose shared folders, and how to work around it

Postscript

I’m grateful for the suggestion that it might be possible to work around this problem by running the older VM on a single core. Although I did manage to do that on an M3 (the first time I have seen recent macOS running on just one core!), that fails just the same on an M4, I’m afraid.

Inside M4 chips: P cores hosting a VM

By: hoakley
13 November 2024 at 15:30

One common but atypical situation for any M-series chip is running a macOS virtual machine. This article explores how virtual CPU cores are handled on physical cores of an M4 Pro host, and provides further insight into their management, and thread mobility across P core clusters.

Unless otherwise stated, all results here are obtained from a macOS 15.1 Sequoia VM in my free virtualiser Viable, with that VM allocated 5 virtual cores and 16 GB of memory, on a Mac mini M4 Pro running macOS Sequoia 15.1 with 48 GB of memory, 10 P and 4 E cores.

How virtual cores are allocated

All virtualised threads are treated by the host as if they are running at high Quality of Service (QoS), so are preferentially allocated to P cores, even though their original thread may be running at the lowest QoS. This has the side-effect of running virtual background processes considerably quicker than real background threads on the host.

In this case, the VM was given 5 virtual cores so they could all run in a single P cluster on the host. That doesn’t assign 5 physical cores to the VM, but runs VM threads up to a total of 500% active residency across all the P cores in the host. If the VM is assigned more virtual cores than are available in the host’s P cores, then some of its threads will spill over and be run on host E cores, but at the high frequency typical of host threads with high QoS.

Performance

There is a slight performance hit in virtualisation, but it’s surprisingly small compared to other virtualisers. Geekbench 6.3.0 benchmarks for guest and host were:

  • CPU single core VM 3,643, host 3,892
  • CPU multi-core VM 12,454, host 22,706
  • GPU Metal VM 102,282, host 110,960, with the VM as an Apple Paravirtual device.

Some tests are even closer: using my in-core floating point test, 1,000 Mloops run in the VM in 4.7 seconds, and in the host in 4.68 seconds.

Host core allocation

To assess P core allocation on the host, an in-core floating point test was run in the VM. This consisted of 5 threads with sufficient loops to fully occupy the virtual cores for about 20 seconds. In the following charts, I show results from just the first 15 seconds, as representative of the whole.

vmpcoresm4pro0

When viewed by cluster, those threads were mainly loaded first onto the first P cluster (pale blue bars), where they ran for just over 1 second before being moved to the second cluster (red bars). They were then regularly switched between the two P clusters every few seconds throughout the test. Four cycles were completed in this section of the results, with each taking 2.825 seconds, so threads were switched between clusters every 1.4 seconds, the same time as I found when running threads on the host alone, as reported previously.

For most of the 15 seconds shown here, total active residency across both P clusters was pegged at 500%, as allocated to the VM in its 5 virtual cores, with small bursts exceeding that. Thus that 500% represents those virtual cores, and the small bursts are threads from the host. Although the great majority of that 500% was run on the active P cluster, a total of about 30% active residency consisted of other threads from the VM, and ran on the less active P cluster. That probably represents the VM’s macOS background processes and overhead from its folder sharing, networking, and other Virtio device use.

vmpcoresm4pro1

When broken down to individual cores within each cluster, seen in the first above and the second below, total activity differs little across the cores in the active cluster. During its period in the active cluster, each core had an active residency of 80-100%, bringing the cluster total to about 450% while most active.

vmpcoresm4pro2

In case you’re wondering whether this occurs on older Apple silicon, and it’s just a feature of macOS Sequoia, here’s a similar example of a 4-core VM running 3 floating point threads in an M1 Max with macOS 15.1, seen in Activity Monitor’s CPU History window. There’s no movement of threads between clusters.

vmm1maxtest1

P core frequencies

powermetrics, used to obtain this data, provides two types of core frequency information. For each cluster it gives a hardware active frequency, then for each core it gives an individual frequency, which often differs within each cluster. Cores in the active P cluster were typically reported as running at a frequency of 4512 MHz, although the cluster frequency was lower, at about 3858 MHz. For simplicity, cluster frequencies are used here.

vmpcoresm4pro3

This chart shows reported frequencies for the two P clusters in the upper lines. Below them are total cluster active residencies to show which cluster was active during each period.

The active cluster had a steady frequency of just below 3,900 MHz, but when it became the less active one, its frequency varied greatly, from idle at 1,260 MHz up to almost 4,400 MHz, often for very brief periods. This is consistent with the active cluster running the intensive in-core test threads, and the other cluster handling other threads from both the VM and host.

CPU power

Several who have run VMs on notebooks report that they appear to drain the battery quickly. Using the previous results from the host, the floating point test used here would be expected to use a steady 7,000 mW.

vmpcoresm4pro4

This last chart shows the total CPU power use in mW over the same period, again with cluster active residency (here multiplied by 10), added to aid recognition of cluster cycles. This appears to average about 7,500 mW, only 500 mW more than expected when run on the host alone. That shouldn’t result in a noticeable increase in power usage in a notebook.

In the previous article, I remarked on how power used appeared to differ between the two clusters, and this is also reflected in these results. When the second cluster (P1) is active, power use is less, at about 7,100 mW, and it’s higher at about 7,700 when the first cluster (P0) is active. This needs to be confirmed on other M4 Pro chips before it can be interpreted.

Key information

  • macOS guests perform almost as well as the M4 Pro host, although multi-core benchmarks are proportionate to the number of virtual cores allocated to them. In particular, Metal GPU performance is excellent.
  • All threads in a VM are run as if at high QoS, thus preferentially on host P cores. This accelerates low QoS background threads running in the VM.
  • Virtual core allocation includes all VM overhead from the VM, such as its macOS background threads.
  • Guest threads are as mobile as those of the host, and are moved between P clusters every 1.4 seconds.
  • Although threads run in a VM incur a small penalty in additional power use, this shouldn’t be significant for most purposes.
  • Once again, evidence suggests that the first P cluster (P0) in an M4 Pro uses slightly more power than the second (P1). This needs to be confirmed in other systems.
  • powermetrics can’t be used in a VM, not unsurprisingly.

Previous article

Inside M4 chips: P cores

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Inside M4 chips: P cores

By: hoakley
11 November 2024 at 15:30

This is the first in a series diving deeper into Apple’s new M4 family of chips. This starts with details of its Performance (P) cores. Comparisons of their performance against cores in earlier M-series chips will follow separately when I have completed them.

M4 family

There are currently three M4 designs:

  • Base M4, with 4 P and 6 E cores, also available in a cheaper variant with only 4 active E cores, and a ‘binned’ variant for iPads with only 3 active P cores.
  • M4 Pro, with 10 P and 4 E cores, also available in a ‘binned’ variant with only 8 active P cores.
  • M4 Max, with 12 P and 4 E cores, also available in a ‘binned’ variant with only 10 active P cores.

Apple is expected to release an Ultra variant in 2025, consisting of two M4 Max chips connected and working in tandem, providing a total of 24 P and 8 E cores.

Apart from the number of cores in each design, their caches and memory, all P cores are the same, and different from E cores.

P core architecture

All CPU cores are arranged in clusters of up to 6. All cores within any given cluster share L2 cache, and are run at the same frequency (clock speed). The Base M4 has a single cluster of 4 P cores, while the Pro and Max have two clusters of 5 and 6 cores respectively.

Frequency

A prominent feature of both P and E cores is their variable frequency (clock speed). In the case of P cores, this can be set to any of 17 values between the minimum of 1,260 MHz and maximum of 4,512 MHz (1.3-4.5 GHz). When running macOS, cluster frequencies are set by macOS at a kernel level; other operating systems may offer more direct control.

P cores idle at 1,260 MHz, but can also be shut down altogether. Previous M-series chips have been reported by the powermetrics command tool as sometimes being idle at a frequency of 0 MHz, but the M4 is the first to have idle and down states reported separately, for example:
CPU 4 active residency: 0.00%
CPU 4 idle residency: 0.00%
CPU 4 down residency: 100.00%

when that core and its whole cluster are shut down rather than just idling. It’s not clear whether this is merely an administrative change, or M4 cores implement this state differently from previous cores.

Instruction set

There’s confusion over the Instruction Set Architecture (ISA) supported by M4 cores. This is explained in the LLVM source, where it’s claimed that they’re “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE). Some might consider that’s closer to ARMv8.7-A, one version more recent than the M3’s ARMv8.6-A.

Although this is now fully supported in LLVM clang, it’s not clear how fully it’s supported by Xcode, for example.

Power

When shut down, a P core consumes no power, of course, and at idle with no active residency, it uses only 1-2 mW, according to measurements reported by powermetrics.

Maximum power consumption rises to approximately 1,400 mW when running intensive floating point calculations at 100% active residency, and to approximately 3,230 mW when running NEON vector instructions at 100% active residency.

macOS core allocation

Threads are normally allocated by macOS to an available P core when their designated Quality of Service (QoS) is higher than 9 (Background), for example when using Dispatch, formerly branded Grand Central Dispatch (GCD). Running threads may also be moved periodically between P cores in the same cluster, and between clusters. Previous M-series chips appear to move threads less frequently, and may leave them to run to completion after several seconds on the same core, but threads appear to be considerably more mobile when running on M4 P cores.

VM4core4threadPcoresARes

This bar chart shows 4 threads from 4 virtual CPUs in a VM running for 3 seconds at 100% active residency. For almost all that period, the threads remain running on the 4 physical cores of the first P cluster in this M1 Max, with the second P cluster remaining idle for much of that time.

The following charts show 4 threads of intensive in-core floating point arithmetic running on the P cores of an M4 Pro.

m4threads1clusters

When viewed by cluster, those threads are loaded first onto the second P cluster (red bars), where they run for 0.4 seconds before being moved to the first cluster (pale blue bars). After running there for 1.3 seconds, they’re moved back to the second cluster for a further 1.3 seconds, before completing on the first cluster.

The next two bar charts show each cluster separately, illustrating thread mobility within them.

m4threads2cluster1

When running on the first cluster (above), threads appear to be moved to a different core approximately every 0.3 second, as they do when on the second cluster (below).

m4threads3cluster2

m4threads4frequency

Cluster frequency matches this movement, with each cluster being run up to maximum frequency (shown here averaged across the whole cluster) to process the threads running on its cores. The black line below those for the P clusters shows the small changes in average frequency for the E cluster over this period.

m4threads5power

This last chart shows the total CPU power use in mW over the same period. Of particular interest here is the consistent difference in power use reported by powermetrics between the two P clusters: the first (P0) used a steady 6,000 mW when running these four threads, whereas the second (P1) used slightly less, at 5,700-5,800 mW. That could be the result of measurement error in powermetrics, peculiar to this particular chip, or could reflect an underlying difference between the two clusters.

Thread mobility makes interpreting CPU History in Activity Monitor difficult, as the fastest frequency of sampling available there is every second, while powermetrics was sampling every 0.1 second when gathering the data above. As groups of threads may be moved between clusters every 1.3 seconds or so, this can give the impression that threads are being run on both clusters simultaneously. Once again, great care is needed when interpreting the data shown by Activity Monitor.

Key information

  • Current M4 chips offer 4-12 CPU P cores.
  • M4 P cores are arranged in clusters of up to 6, sharing L2 cache and running at a common frequency.
  • P core clusters can be shut down, idling at their minimum frequency of 1,260 MHz, or at one of 18 set frequencies up to a maximum of 4,512 MHz, as controlled by macOS.
  • Their instruction set is “technically” ARMv9.2-A, but without its Scalable Vector Extension (SVE).
  • They use 1-2 mW when idle, rising to peaks of 1,400 mW (floating point) or 3,230 mW (NEON vector code).
  • macOS preferentially allocates them threads at all QoS higher than 9 (Background).
  • Threads running on M4 P cores are mobile, and may be moved to another core in the same cluster frequently, and after just over a second may be transferred to a core in the other P cluster, when available.
  • Thread mobility makes interpretation of the CPU History window in Activity Monitor very difficult.

Explainer

Residency is the percentage of time a core is in a specific state. Idle residency is thus the percentage of time that core is idle and not processing instructions. Active residency is the percentage of time it isn’t idle, but is actively processing instructions. Down residency is the percentage of time the core is shut down. All these are independent of the core’s frequency or clock speed.

Last Week on My Mac: Mac mini M4 Pro first impressions, cores and more

By: hoakley
10 November 2024 at 16:00

My Mac mini M4 Pro arrived four days early, and swiftly exceeded all expectations. Everyone who has seen this new design remarks on how tiny it is, and I still keep looking at it, wondering how the smallest Mac I have ever owned is also by far the fastest. For those who collect numbers, it matched the growing collection in Geekbench’s database, with a single-core score of 3,892, a multi-core of 22,706, and Metal GPU of 110,960. Of course those new MacBook Pros with M4 Max chips are doing better with their extra two CPU cores and twice the number of GPU cores, but second best is still far superior to anything I’ve run before.

Not only is this my fastest Mac ever, but it was also the quickest and simplest to commission. It replaces my Mac Studio M1 Max, and now sits under my Studio Display using the Studio’s old keyboard and trackpad. As I’ll explain in a future article, I opted to migrate to the mini during initial setup using the Studio’s backup SSD, a sleek OWC Envoy Pro SX connected by Thunderbolt 3. Including the inevitable macOS update, the whole process took less than two hours, from the start of unboxing to the first full Time Machine backup.

With 48 GB memory and an internal SSD of 2 TB, I was keen to benchmark that once its initial backup was done. Although I have reduced confidence in these figures, there’s no doubt that this SSD is significantly quicker than the 2 TB in the Studio; how much quicker is open to debate. AmorphousDiskMark reported a write speed of 7.7 GB/s, while my own Stibium gave 10.3 GB/s across a broad range of file sizes. There was closer agreement on read speed, at around 6.8 GB/s. Both of those are comfortably above what I expect when I can finally get hold of an external Thunderbolt 5 SSD.

CPU cores

My M4 Pro is the full-performance variant, with 10 P and 4 E cores, set in three clusters: an E cluster with all four E cores, and two clusters of five P cores. Since its M3 chips, Apple has increased the maximum number of cores in a cluster from 4 (in M1 and M2 chips) to 6 (M3 and M4).

M4 P cores run at frequencies between 1260 and 4512 MHz (1.3-4.5 GHz), their maximum frequency being 111% that of the M3 P core, and 140% that of the M1 P core. M4 E cores have a narrower range of frequencies than those in the M3, between 1020-2592 MHz (1.0-2.6 GHz), so running at 137% of M3 frequency when running low QoS threads, and 94% of M3 when running at maximum frequency for high QoS threads spilt over from P cores. Those are going to make comparisons between E core performance very interesting indeed.

Single-thread and single-core comparisons of P core throughput are inevitably in the M4’s favour, with significantly better performance across integer, floating point and vector performance, as shown in the chart below.

M4M3multiTests

The Y axis here gives loop throughput per second for my four basic in-core performance tests, a tight assembly code integer math loop, another tight assembly code loop of floating point math, NEON vector processor assembly code, and a tight loop calling an Accelerate routine run in the NEON unit. Pale blue bars are results for the M1, purple for the M3, and red for the shiny new M4.

Although improvements in integer and floating point performance might appear small here, these results are for single threads. When scaled up across more P cores, the differences are magnified, as shown in the regressions below.

M4vM3floatTimevThreads

This plots total times taken to execute multiple threads, each consisting of 10^9 (one thousand million, or one billion) loops of floating point assembly code, against the number of threads, the same as the number of cores used, as each core runs one thread. According to the fitted linear regression equations, each thread/core of 10^9 loops takes the same period: 6.04 seconds for the M3 P cores, and 4.69 for the M4. Those equate to 1.66 x 10^8 loops/second per thread for the M3, and 2.13 x 10^8 for the M4. On that basis, the M4 performs at 128% that of the M3, significantly better than expected by the 111% increase in maximum core frequency. That requires the M4 to have improved processing speed for the same frequency of the M3, a frequency-independent improvement.

Core allocation

One striking difference between M4 CPU cores and those of previous Apple silicon chips is their pattern of core allocation, something you may notice even in Activity Monitor, for all its faults. This is best illustrated in its CPU History window.

For these two series of tests, I waited until the Mac was idling quietly, with little happening on its CPU cores. I then ran, in rapid succession, a series of my in-core floating point tests of 10^9 loops at high QoS, starting with a single thread, then two, and so on up to six or eight threads. This results in a pattern distinct from anything you’ll see in Intel cores, as I have shown here since the early days of the M1.

m3profloptcompo

This is the series seen on my MacBook Pro M3 Pro running macOS 15.2 beta, and is typical of M1, M2 and M3 chips. Although the cores are out of order here, these are the six cores in the chip’s single P cluster. At the left is the obvious peak from 1 thread running on Core 9, then the 2 threads of the next test appear on cores 9 and 12. Three fully occupy 7, 9 and 12, with a little spilt onto 8, and so on until all six are fully occupied with 6 threads.

Activity Monitor is a little too crude to see how distinct this is, for which you have to resort to shorter sampling periods of 0.1 second, and the finer detail of active residency reported by powermetrics. Those confirm that, much of the time, these heavy in-core compute loads run in a single thread on a single core, and if they are moved around, it’s relatively infrequently. E cores are different, though.

Compare that with a similar sequence, this time for 1-8 threads, on the M4 Pro. Its ten P cores are divided into two clusters, which I’ve separated here with the blue line, although within each cluster the cores aren’t in numeric order.

m4profloptcompo

Reading again from the left, a single thread is run in cores 6 and 8, with a little in 14 in the second cluster. Two threads are run in all five cores of the second cluster, with a little at the end from those of the first cluster. Three threads are similar, with significant contributions from all ten cores, and from then on they are similar.

So, in the M4 P cores threads are no longer allocated to single cores, and it appears from the CPU History window that they may even be allocated across more than one cluster. In fact, when the detailed active residency and frequency data from powermetrics are used, while threads are often moved between two cores in the same cluster, macOS will normally avoid unnecessarily running threads on other clusters, although it will happily move all threads from one cluster to another.

The rationale behind running all threads within the same cluster is clear, as CPU core frequencies of all cores in the same cluster are the same. Constraining threads to a single cluster thus allows others to idle and use less energy. While that continues in M4 chips, it isn’t clear why threads are moved between cores so frequently, or moved together to another cluster, and whether that brings improvements in performance.

Why % CPU in Activity Monitor isn’t what you think

By: hoakley
8 November 2024 at 15:30

One of the most frequently quoted measurements in Activity Monitor is % CPU. When the fans blow, or a Mac becomes sluggish, it’s often the first figure to look at and wonder why that process is pulling over 100%. That’s the first thing you learn: here, per cent doesn’t mean out of 100, but the sum of the real percentage for each core. As this Mac has eight cores, that allows it up to 800%, until you realise that Intel processors can use Hyper-Threading to pretend that each is two cores, making a maximum of 1,600%, not bad for a single processor. Then you come completely unstuck with Apple silicon.

Apple defines % CPU as “the percentage of CPU capability that’s being used”, a phrase that doesn’t appear to have any definition either in Apple’s documentation or elsewhere. So like everyone else, you assume that 100% for a core represents its maximum processing capacity. Only you couldn’t be more incorrect: in truth, the number given as % CPU is nothing like that.

Tests

To examine this more clearly, I have an app that runs tight loops of assembly code to load CPU cores on Apple silicon chips and measure their performance. For this, I got it to run several threads, each consisting of 500 million loops of floating point calculations. By running those at different Quality of Service (QoS) settings, macOS will send them to different core types.

When I run two threads at low QoS so they’re run only on E cores of a MacBook Pro M3 Pro, each takes 10.2 seconds to complete. If I instead run eight threads at high QoS, the first six of those will be run on the P cores, but the remaining two spill over onto two of the E cores, where they complete in just 3.0 seconds each.

Yet, according to % CPU in Activity Monitor, the two threads on the E cores come to 200%, and the two from eight run in the second test also come to 200%. You can see this in the CPU History window.

m3activitymon

Here are the 6 E cores at the top, and 6 P cores below. ① marks the first test on two E cores, and shows the total of 200% spread across all six E cores over the 10.2 seconds it took to complete. ② marks the second test, with all 6 P cores and two E cores hitting 100% each, for a total of 800%, as shown in Activity Monitor’s window.

By now I’m sure you guessing how running the same code on the E cores could vary in speed by a factor of 3.4 (10.2 against 3.0 seconds) although they’re the same % CPU: that measurement doesn’t take into account core frequency.

Frequency

Unlike traditional Intel CPUs, CPU cores in Apple silicon chips can be run at a wide range of frequencies, as set by macOS. To make this frequency control a bit simpler, cores are grouped into clusters, that also share L2 cache, and within each cluster all cores run at the same frequency. The M3 Pro chip consists of two clusters, one of 6 E cores, the other of 6 P cores.

When macOS loads two of those E cores with low QoS threads, it sets them to run at a low frequency to make the most of their energy efficiency. When it loads two E cores with threads that were intended to be run on P cores, as they have a high QoS, it runs the E cluster up to high frequency, so they perform better.

Measuring frequency of CPU cores is straightforward using the powermetrics command tool. When those threads were being run on E cores at low QoS, frequency was held at their minimum of 744 MHz, but run at high QoS their frequency was set to maximum, at 2748 MHz, 3.7 times faster. If full load at maximum frequency is 100% of their capability, then at low QoS they were really running at 27%, not the 100% given by Activity Monitor, and that largely accounts for the difference in time that those threads took to complete their floating point calculations.

What is % CPU?

What Activity Monitor actually shows as % CPU or “percentage of CPU capability that’s being used” is what’s better known as active residency of each core, that’s the percentage of processor cycles that aren’t idle, but actively processing threads owned by a given process. But it doesn’t take into account the frequency or clock speed of the core at that time, nor the difference in core throughput between P and E cores.

The next time that you open Activity Monitor on an Apple silicon Mac, to look at % CPU figures, bear in mind that while those numbers aren’t completely meaningless, they don’t mean what you might think, and what’s shown as 100% could be anything between 27-100%.

Which M4 chip and model?

By: hoakley
7 November 2024 at 15:30

In the light of recent news, you might now be wondering whether you can afford to wait until next year in the hope that Apple then releases the M4 Mac of your dreams. To help guide you in your decision-making, this article explains what chip options are available in this month’s new M4 models, and how to choose between them.

CPU core types

Intel CPUs in modern Macs have several cores, all of them identical. Whether your Mac is running a background task like indexing for Spotlight, or running code for a time-critical user task, code is run across any of the available cores. In an Apple silicon chip like those in the M4 family, background tasks are normally constrained to efficiency (E) cores, leaving the performance (P) cores for your apps and other pressing user tasks. This brings significant energy economy for background tasks, and keeps your Mac more responsive to your demands.

Some tasks are normally constrained to run only on E cores. These include scheduled background tasks like Spotlight indexing, Time Machine backups, and some encoding of media. Game Mode is perhaps a more surprising E core user, as explained below.

Most user tasks are run preferentially on P cores, when they’re available. When there are more high-priority threads to be run than there are available P cores, then macOS will normally send them to be run on E cores instead. This also applies to threads running a Virtual Machine (VM) using lightweight virtualisation, whose threads will be preferentially scheduled on P cores when they’re available, even when code being run in the VM would normally be allocated to E cores.

macOS also controls the clock speed or frequency of cores. For background tasks running on E cores, their frequency is normally held relatively low, for best energy efficiency. When high-priority threads overspill onto E cores, they’re normally run at higher frequency, which is less energy-efficient but brings their performance closer to that of a P core. macOS goes to great lengths to schedule threads and control core frequencies to strike the best balance between energy efficiency and performance.

Unfortunately, it’s normally hard to see effects of frequency in apps like Activity Monitor. Its CPU % figures only show the percentage of cycles that are used for processing, and make no allowance for core frequency. It will therefore show a background thread running at low frequency but 100%, the same as a thread overspilt from P cores running at the maximum frequency of that E core. So when you see Spotlight indexing apparently taking 200% of CPU % on your Mac’s E cores, that might only be a small fraction of their maximum capacity if they were running at maximum frequency.

There are no differences between chips in the M4 family when it comes to each type of CPU core: each P core in a Base variant is the same as each in an M4 Pro or Max, with the same maximum frequency, and the same applies to E cores. macOS also allocates threads to different types of core using the same rules, and their frequencies are controlled the same as well. What differs between them is the number of each type of core, ranging from 4 P and 4 E in the 8-core variant of the Base M4, up to 12 P and 4 E in the 16-core variant of the M4 Max. Thus, their single-core benchmark results should be almost identical, although their multi-core results should vary according to the number of cores.

Game Mode

This mode is an exception to normal CPU and GPU core use, as it:

  • gives preferential access to the E cores,
  • gives highest priority access to the GPU,
  • uses low-latency Bluetooth modes for input controllers and audio output.

However, my previous testing didn’t demonstrate that apps running in Game Mode were given exclusive access to E cores. But for gamers, it now appears that the more E cores, the better.

GPU cores

These are also used for tasks other than graphics, such as some of the more demanding calculations required for Machine Learning and AI. However, experience so far with Writing Tools in Sequoia 15.1 is that macOS currently offloads their heavy lifting to be run off-device in one of Apple’s dedicated servers. Although having plenty of GPU cores might well be valuable for non-graphics purposes in the future, for now there seems little advantage for many.

Thunderbolt 5

M4 Pro and Max, but not Base variants, come equipped with Thunderbolt ports that not only support Thunderbolt 3 and 4, but 5, as well as USB4. Thunderbolt 5 should effectively double the speed of connected TB5 SSDs, but to see that benefit, you’ll need to buy a TB5 SSD. Not only are they more expensive than TB3/4 models, but at present I know of only one range that’s due to ship this year. There will also be other peripherals with TB5 support, including at least one dock and one hub, although neither is available yet. The only TB5 accessories that are already available are cables, and even they are expensive.

TB5 also brings increased video bandwidth and support for DisplayPort 2.1, although even the M4 Max can’t make full use of that. If you’re looking to drive a combination of high-res displays, consult Apple’s Tech Specs carefully, as they’re complicated.

Although TB5 will become increasingly important over the next few years, TB3/4 and USB4 are far from dead yet and are supported by all M4 models.

Which M4 chip?

The table below summarises key figures for each of the variants in the M4 family that have now been released. It’s likely that next year Apple will release an Ultra, consisting of two M4 Max chips joined in tandem, in case you feel the burning desire for 24 P and 8 E cores.

m4configs2

Models available next week featuring each M4 chip are shown with green rectangles at the right.

There are two variants of the Base M4, one with 4P + 4E and 8 GPU cores, the same as Base variants in M1 to M3 families. There’s also the more capable variant, for the first time with 4P + 6E, which promises to be a better all-rounder, and when in Game Mode. It also has an extra couple of GPU cores.

The M4 Pro also comes in two variants, this time differing in the number of P cores, 8 or 10, and GPU cores, 16 or 20. Those overlap with the M4 Max, with 10 or 12 P cores and 32 or 40 GPU cores. Thus the gap between M4 Pro and Max isn’t as great as in the M3, with the GPUs in the M4 Max being aimed more at those working with high-res video, for instance. For more general use, there’s little difference between the 14-core Pro and Max.

Memory and storage

Chips in the M4 family also determine the maximum memory and internal SSD capacity. Apple has at last eliminated base models with only 8 GB of memory, and all now start with at least 16 GB. Base M4 chips are limited to a maximum of 32 GB, while the M4 Pro can go up to 64 GB, and the 16-core Max up to 128 GB, although in its 14-core variant, the Max is only available with 36 GB (I’m very grateful to Thomas for pointing this out below).

Unfortunately, Apple hasn’t increased the minimum size of internal SSD, which remains at 256 GB for some base models. Smaller SSDs may be cheaper, but they are also likely to have shorter lives, as under heavy use their small number of blocks will be erased for reuse more frequently. That may shorten their life expectancy to much less than the normal period of up to 10 years, as was seen in some of the first M1 models. This is more likely to occur when swap space is regularly used for virtual memory. I for one would have preferred 512 GB as a starting point.

While Base M4 chips come with SSDs up to 2 TB in size, both Pro and Max can be supplied with internal SSDs of up to 8 TB.

I hope this proves useful in guiding your decision.

Last Week on My Mac: M4 incoming

By: hoakley
3 November 2024 at 16:00

Almost exactly a year after it released its first Macs featuring chips in the M3 family, Apple has replaced those with the first M4 models. Benchmarkers and core-counters are now busy trying to understand how these will change our Macs over the coming year or so. Before I reveal which model I have ordered, I’ll try to explain how these change the Mac landscape, concentrating primarily on CPU performance.

CPU cores

CPUs in the first two families, M1 and M2, came in two main designs, a Base variant with 4 Performance and 4 Efficiency cores, and a Pro/Max with 8 P and 2 or 4 E cores, that was doubled-up to make the Ultra something of a beast with its 16 P and 4 or 8 E cores. Last year Apple introduced three designs: the M3 Base has the same 4 P and 4 E CPU core configuration as in the M1 and M2 before it, but its Pro and Max variants are more distinct, with 6 P and 6 E in the Pro, and 10-12 P and 4 E cores in the Max. The M4 family changes this again, improving the Base and bringing the Pro and Max variants closer again.

As these are complicated by sub-variants and binned versions, I have brought the details together in a table.

mcorestable2024

I have set the core frequencies of the M4 in italics, as I have yet to confirm them, and there’s some confusion whether the maximum frequency of the P core is 4.3 or 4.4 GHz.

Each family of CPU cores has successively improved in-core performance, but the greatest changes are the result of increasing maximum core frequencies and core numbers. One crude but practical way to compare them is to total the maximum core frequencies in GHz for all the cores. Strictly speaking, this should take into account differences in processing units between P and E cores, but that also appears to have changed with each family, and is hard to compare. In the table, columns giving Σfn are therefore simply calculated as
(max P core frequency x P core count) + (max E core frequency x E core count)

Plotting those sum core frequencies by variant for each of the four families provides some interesting insights.

mcoresbars2024

Here, each bar represents the sum core frequency of each full-spec variant. Those are grouped by the variant type (Base, Pro, Max, Ultra), and within those in family order (M1 purple, M2 pale blue, M3 dark blue, M4 red). Many trends are obvious, from the relatively low performance expected of the M1 family, except the Ultra, and the changes between families, for example the marked differences in the M4 Pro, and the M3 Max, against their immediate predecessors.

Sum core frequencies fall into three classes: 20-30, 35-45, and greater than 55 GHz. Three of the four chips in the M1 family are in the lowest of those, with only the M1 Ultra reaching the highest. The M4 is the first Base variant to reach the middle class, thanks in part to its additional two E cores. Two of the M4 variants (Pro and Max) have already reached the highest class, and any M4 Ultra would reach far above the top of the chart at 128 GHz.

Real-world performance will inevitably differ, and vary according to benchmark and app used for comparison. Although single-core performance has improved steadily, apps that only run in a single thread and can’t take advantage of multiple cores are likely to show little if any difference between variants in each family.

Game Mode is also of interest for those considering the two versions of the M4 Base, with 4 or 6 E cores. This is because that mode dedicates the E cores, together with the GPU, to the game being played. It’s likely that games that are more CPU-bound will perform significantly better on the six E cores of the 10-Core version of the iMac, which also comes with a 10-core GPU and four Thunderbolt 4 ports.

Memory and GPU

Memory bandwidth is also important, although for most apps we should assume that Apple’s engineers match that with likely demand from CPU, GPU, neural engine, and other parts of the chip. There will always be some threads that are more memory-bound, whose performance will be more dependant on memory bandwidth than CPU or GPU cores.

Although Apple claims successive improvements in GPU performance, the range in GPU cores has started at 8 and attained 32-40 in Max chips. Where the Max variants come into their own is support for multiple high-res displays, and challenging video editing and processing.

Thunderbolt and USB 3

The other big difference in these Macs is support for the new Thunderbolt 5 standard, available only in models with M4 Pro or M4 Max chips; Base variants still only support Thunderbolt 4. Although there are currently almost no Thunderbolt 5 peripherals available apart from an abundant supply of expensive cables, by the end of this year there should be at least one range of SSDs and one dock shipping.

As ever with claimed Thunderbolt performance, figures given don’t tell the whole story. Although both TB4 and USB4 claim ‘up to’ 40 Gb/s transfer rates, in practice external SSD performance is significantly different, with Thunderbolt topping out at about 3 GB/s and USB4 reaching up to 3.4 GB/s. In practice, TB5 won’t deliver the whole of its claimed maximum of 120 Gb/s to a single storage device, and current reports are that will only achieve disk transfers at 6 GB/s, or twice TB4. However, in use that’s close to the expected performance of internal SSDs in Apple silicon Macs, and should make booting from a TB5 external SSD almost indistinguishable in terms of speed.

As far as external ports go, this widens the gap between the M4 Pro Mac mini’s three TB5 ports, which should now deliver 3.4 GB/s over USB4 or 6 GB/s over TB5, and its two USB-C ports that are still restricted to USB 3.2 Gen 2 at 10 Gb/s, equating to 1 GB/s, the same as in M1 models from four years ago.

My choice

With a couple of T2 Macs and a MacBook Pro M3 Pro, I’ve been looking to replace my original Mac Studio M1 Max. As it looks likely that an M4 version of the Studio won’t be announced until well into next year, I’m taking the opportunity to shrink its already modest size to that of a new Mac mini. What better choice than an M4 Pro with 10 P and 4 E cores and a 20-core GPU, and the optional 10 Gb Ethernet? I seldom use the fourth Thunderbolt port on the Studio, and have already ordered a Kensington dock to deliver three TB5 ports from one on the Mac, and I’m sure it will drive my Studio Display every bit as well as the Studio has done.

If you have also been tempted by one of the new Mac minis, I was astonished to discover that three-year AppleCare+ for it costs less than £100, that’s two-thirds of the price that I pay each year for AppleCare+ on my MacBook Pro.

I look forward to diving deep into both my new Mac and Thunderbolt 5 in the coming weeks.

Check Writing Tools using AIR

By: hoakley
29 October 2024 at 15:30

Apple has made great play over the privacy provided in its new AI tools. If you’ve just updated your Apple silicon Mac to Sequoia 15.1 and are wondering how you can check on this for Writing Tools, this article explains how.

aisettings1

When running on a capable Mac, with an M-series chip, macOS captures details of all AI use in its Apple Intelligence Report (AIR). Control and access that from its new entry in Privacy & Security settings, where you’ll find it towards the end, just above the final Security section. Open that, and you’ll see you can set the Report Duration to 15 minutes, 7 days, or turn it off altogether. As report sizes can grow quickly with a little use of Writing Tools, I suggest you start off with 15 minutes, or you might get overwhelmed.

aisettings2

When you want to browse a report, simply click on the button to Export Activity, and save the AIR report.

Apple Intelligence Reports are written out to JSON files that can be viewed using a text editor if you don’t have a specialist JSON editor. They’re usually bulky, and much of their content may be encoded binary that’s of little meaningful use. However, at the start you’ll see a series of modelRequests.

Each modelRequest begins with the timestamp of the request, given in decimal seconds since 1970. That’s followed by a UUID, information on the prompt template used, and shortly after that is the text that was extracted and used by Writing Tools. For longer passages of text, you may see that it’s divided up into a series of shorter sections that match the paragraphs given in a summary.

After that input text, the language localisation is given, currently en_US as other variants and languages won’t be available until macOS 15.2 later this year. Next, the response is provided, as inserted into the Writing Tools or text window. That section ends with:

  • model, the name of the AI model used, such as com.apple.fm.language.instruct_server_v1.text_summarizer, and the version.
  • clientIdentifier, such as com.apple.WritingTools.xpc.WritingToolsViewService for normal use of Writing Tools in an app.
  • executionEnvironment, currently expected to be PrivateCloudCompute, which tells you where the AI processing took place.

After the list of modelRequests, you’ll probably see a long series of privateCloudComputeRequests full of incomprehensible data for sepAttestations and provisioningCertificateChains, part of the validation information for use of PrivateCloudCompute. If this all seems a little long-winded, try looking in the logs when Writing Tools are in use!

I’m very grateful to Tim, who has drawn my attention to these reports, and points out that use of PrivateCloudCompute appears confined to macOS at the moment. A similar report is also available for iOS 18.1, but iPhones don’t appear to rely on PrivateCloudCompute in the same way.

We must remember that, while Apple considers Writing Tools now ready for general use, it’s still officially a beta-release, and over the coming months is likely to undergo significant change. This poses the question of whether Writing Tools will run on-device in the future, something only Apple can answer. What appears to happen at present is that the only local processing that takes place is tokenisation of text to prepare it for remote processing using Apple’s PrivateCloudCompute service, which actually performs the heavy lifting before returning its results to the Mac. However, macOS also appears to wake up the slumbering Neural Engine (ANE) for most Writing Tools services. Why that happens remains a mystery.

If you want to watch progress as AI features develop in macOS, you may find Apple Intelligence Reports a useful way to track that. If you do come across entries that seem to have used on-device services instead of PrivateCloudCompute, please let us know.

Last Week on My Mac: Writing Tools

By: hoakley
27 October 2024 at 16:00

Early this coming week those fortunate enough to have an Apple silicon Mac will be able to update Sequoia to version 15.1 and try out Writing Tools at last. I’ve been testing and using them since the first beta of 15.1 back when the daylight lasted well into the night. I’ve been surprised at how lukewarm some have been at their arrival, particularly as most of those comments have come from those who make a living from their writing.

Perhaps the best way to show what Writing Tools can do is with an example, using the text from yesterday’s article A brief history of Mac firmware, a little over a thousand fairly technical words. I concentrate here on Writing Tools’ summary modes rather than its rewriting features, although they’re also thoroughly sound and useful in the right context.

Summary

Producing a succinct single paragraph summary is a challenge for even the most literate of humans. I’m impressed with Writing Tools’ 62 word offering of:
Mac firmware has undergone significant changes, from the Macintosh ROM in Classic Macs to Open Firmware in PowerPC models and EFI in Intel Macs. EFI, later evolving to UEFI, provides hardware access and boot services, while Apple silicon Macs employ a verified chain of trust for enhanced security. Firmware updates are now integrated into macOS installers and updaters, ensuring security and accessibility.

Key points

With so much information in the original, a key point summary should provide a better reflection of my original thousand words. Rather than reproducing the full 250 words generated by Writing Tools, here’s an excerpt covering one of the more challenging sections:

  • Firmware Security Issue: Researchers demonstrated proof-of-concept attacks on Mac firmware, highlighting potential vulnerabilities.
  • Firmware Update Policy Change: Apple transitioned from separate firmware updates to delivering them through system upgrades and updates.
  • Firmware Security Concern: In 2017, Duo Labs researchers discovered many Macs running outdated firmware, raising security concerns.
  • Apple’s Response: Apple hired security researchers and developed eficheck, a tool to check and report firmware versions to Apple.

I don’t think I could better that.

List

The third type of summary suitable for that article is a detailed list. This strips everything else out to leave terse factual sentences. For the whole text, it reduces to 833 words, a reflection on its already dense information. You’re unlikely to want to use this with longer passages, but here’s the same section given as key points above:

  • In March 2015, two security researchers from LegbaCore demonstrated proof-of-concept attacks on the BIOS of several computers.
  • The attacks could have been used to implant malicious code.
  • Later that year, Kovah and Trammell Hudson turned their attention to Macs.
  • They demonstrated a firmware worm named Thunderstrike 2.
  • For the first nine years of Intel Macs, Apple had provided EFI firmware updates separately from updates to OS X.
  • In 2015, Apple changed the way that it supplied firmware, delivering it only as part of system upgrades and updates.
  • Although older separate firmware updates are still available, those were the last.
  • In 2017, Rich Smith and Pepijn Bruienne of Duo Labs discovered that many Macs were running outdated firmware.
  • Their concern was about the security risk posed by outdated firmware.
  • Apple had already been busy hiring Xeno Kovah and Corey Kallenberg who started work there in November 2015, and Nikolaj Schlej, another firmware security researcher, who joined them the following August.
  • They developed a new tool eficheck, released in High Sierra on 25 September 2017.
  • eficheck checked current firmware against a local database of versions known to be ‘good’, and with the user’s permission sent a report to Apple in the event that it found discrepancies.

Table

The fourth summary option is to generate a table. Unfortunately, my example wouldn’t produce a useful table without substantial additional knowledge. However, I’ve found this useful on long passages from fiction, where it can summarise relationships between different characters, and similar tasks.

On device and on target

Once Sequoia 15.1 has been released and I’ve had a chance to explore the internals of Writing Tools further, I’ll look at its processing and energy costs. Two important features distinguish it from other contemporary AI tools: all data remains on-device throughout, and it’s primarily using your text rather than a large language model built from vast quantities of text garnered from around the internet.

Privacy doesn’t generally worry me particularly, as much of what I write on Macs is destined in some way or another to be published, whether it’s in an article here, one in the magazines that I write for, or source code that will be built into apps. However, I do take exception to others making money out of my labours without my express consent, so I’ll generally be only too happy to keep my AI on-device.

I also think it’s important to draw a clear distinction between what Writing Tools offers, and the likes of ChatGPT. Now that I’m testing Sequoia 15.2 beta, I have been looking at that contrast. While you can’t ask Writing Tools questions (why would you want to when you have the whole text and its summaries?), I thought I’d see how ChatGPT answered one of my stock test questions for AI: what is the SSV?

At my first asking, ChatGPT didn’t have sufficient context, and told me that it’s a side-by-side vehicle, so I refined my question to what is the SSV in macOS?

Although much of its answer was correct and informative, the second sentence stated with complete confidence that the SSV was introduced in macOS Catalina, which is of course completely incorrect, as Catalina has a read-only System volume but not a Signed System Volume as was introduced in Big Sur. But you’d only spot that serious factual error if you already knew the answer.

Give me Writing Tools and my own fact-checking, thank you.

How Macs boot securely, or can’t

By: hoakley
24 October 2024 at 14:30

Earlier this week, I explained how the Signed System Volume (SSV), Data volume and cryptexes are integrated into the boot volume group, to support a secure boot process. This article outlines how modern Macs tackle the problem of booting securely.

The aim of a secure boot process is to ensure that all steps from the Boot ROM to the operating system are verified against any unauthorised change, and the code loaded and run is as intended. A simple operating system might achieve that by running only code contained in a boot ROM, but that’s woefully inadequate for any modern general-purpose operating system such as macOS, which also needs to be updated and upgraded during a Mac’s lifetime. Thus the great bulk of macOS has to be loaded and run from mutable storage, now SSDs. Those and a great deal else require specialised cores, with their own firmware, and features like the Secure Enclave. This is achieved in a cascade, where each step provides access to more of the Mac’s hardware, until many of Sequoia’s 670 kernel extensions are loaded and ready.

Intel Mac without T2 chip

Older models of Macs without a T2 chip follow a classic and insecure process when booting. Their Boot ROM loads UEFI firmware, and that in turn loads boot.efi, the macOS booter, without performing any verification. The macOS booter then loads the prelinked kernel from disk, again without verifying it. When the kernel opens the SSV, any checks on that can only be cursory, as Recovery for these Macs doesn’t offer controls in the form of a Startup Security Utility.

In High Sierra (2017), Apple introduced eficheck to periodically run checks on the version and integrity of UEFI firmware, although that doesn’t take place during the boot process, and was discontinued in macOS 14 Sonoma.

Intel Mac with T2 chip

These are the first Macs to support Secure Boot, thanks to their T2 chip, which is based on a variant of Apple’s A10 chip, dating back to 2016; the first model featuring a T2 chip was released at the end of 2017. As shown in the diagram at the end of this section, these Macs start their boot process with their Boot ROM verifying the iBoot ‘firmware’ for the T2 chip. That in turn verifies the kernel and its extensions for the T2, and that verifies the UEFI for the Intel side of the Mac.

Booting the Intel chipset proceeds similarly to older Intel Macs, but each step verifies the code to be run by the next, until its immutable kernel is loaded and boots the rest of macOS. Early in that stage, the kernel verifies the SSV before proceeding any further. Failure in any of the verifications halts the boot process, if you’re lucky in Recovery or T2 DFU mode.

SecureBoots1

This diagram compares boot processes in the three modern Mac architectures.

Apple silicon Mac

In the absence of any Intel chipset, Apple decided to implement its own Secure Boot, although there are options that could have allowed it to remain with UEFI. M-series chips tackle this in four steps:

  1. Boot ROM, which verifies the Low Level Bootloader (LLB).
  2. LLB, sometimes described as the first stage, concerned with loading and booting some auxiliary cores, security policy, and verifying the second stage, iBoot.
  3. iBoot, which continues validations and verifications, including signatures and root hash of the SSV, before handing over to the kernel.
  4. The kernel, which boots macOS.

Apple silicon chips contain many specialist cores responsible for implementing hardware features such as Thunderbolt. Firmware for each of those has to be verified and loaded to boot those cores, a task performed by LLB and iBoot.

Security policy for each boot volume group is set in its LocalPolicy, and has to be loaded and validated by LLB. The SSV is verified by iBoot prior to handing over to the kernel, to ensure the file system has been checked before it’s mounted.

When running in Full Security, the only kernel extensions to be loaded are those supplied in macOS, forming the standard Boot Kernel Collection. If the user has set that boot volume group to Reduced Security and opted for it to load third-party kernel extensions, those are contained in the Auxiliary Kernel Collection, and validated by iBoot. Once the kernel and extensions collection have been loaded, the latter is locked in memory with SCIP (System Coprocessor Integrity Protection) prior to iBoot handing over to the kernel to boot.

As with T2 Macs, any failure of verification during Secure Boot should leave that Mac either in Recovery mode, or in DFU mode ready to be connected to another Mac for its firmware to be refreshed, or restored from scratch.

External boot disks

T2 Secure Boot doesn’t support booting from an external disk, which is only allowed by reducing the security setting in Startup Security Utility. When designing its M-series Macs, Apple wanted them to benefit from Secure Boot when starting up from an external disk, and incorporated this into its design.

This is implemented by the Mac always starting the boot process internally, with the LLB and iBoot being run from internal storage. Bootable external disks must have an ‘owner’ to associate them with a LocalPolicy loaded by LLB. That enables iBoot to validate the Boot Kernel Collection, SSV and other components in the external boot volume group, then to hand over to its kernel to boot macOS from that disk, instead of the internal SSD.

It took a few versions of Big Sur before this worked reliably, but this should now be robust, provided that it’s set up correctly by the user. However, it’s often incorrectly claimed that Apple silicon Macs can only start up from external disks by reducing security.

Further reading

Apple’s Platform Security Guide:
Boot process for an Intel-based Mac
Boot process for a Mac with Apple silicon
Signed system volume security

This blog:
Booting an M1 Mac from hardware to kexts: 1 Hardware
Booting an M1 Mac from hardware to kexts: 2 LLB and iBoot
Booting an M1 Mac from hardware to kexts: 3 XNU, the kernel
Make a Ventura bootable external disk for an Apple silicon Mac
Booting macOS on Apple silicon: LocalPolicy
Ownership of Apple silicon Macs matters: how it can stop external bootable disks
Booting macOS on Apple silicon: Multiple boot disks

A brief history of FileVault

By: hoakley
19 October 2024 at 15:00

Encrypting all your data didn’t become a thing until well after the first release of Mac OS X. Even then, the system provided little support, and most of us who wanted to secure private data relied on third-party products like PGP (Pretty Good Privacy).

pgp2003

FileVault 1

Apple released the first version of FileVault, now normally referred to as FileVault 1 or Legacy FileVault, in Mac OS X 10.3 Panther in 2003. Initially, that only encrypted a user’s Home folder into a sparse disk image, then in 10.5 Leopard it started using sparse bundles instead. These caused problems with Time Machine backup when it too arrived in Leopard, and proved so easy to crack that in 2006 Jacob Appelbaum and Ralf-Philipp Weinmann released a tool, VileFault, to decrypt FileVault disk images.

filevault2004

FileVault 1 was controlled in the Security pane of System Preferences, shown here in 2004.

newuser2004

Each new user added in the Accounts pane could have their Home folder stored in an encrypted disk image. Encryption keys were based on the user’s password, with a master password set for all accounts on the same Mac.

FileVault 2

FileVault 2 was introduced in Mac OS X 10.7 Lion in 2011, and at last provided whole-volume encryption based on the user password. Encryption was performed using the XTS-AES mode of AES with a 256-bit key, by the CPU. At that time, more recent Intel processors had instructions to make this easier and quicker, but all data written to an encrypted volume had to be encrypted before it was written to disk, and all data read from it had to be decrypted before it could be used. This imposed significant overhead of around 3%, which was more noticeable on slower storage such as hard disks, and with slower Macs.

Apple didn’t implement this by modifying the HFS+ file system to add support for encryption, but by adding encryption support to CoreStorage, the logical volume manager. In theory this would have enabled it to encrypt other file systems, but I don’t think that was ever done.

Turning FileVault on and off was quite a pain, as the whole volume had to be encrypted or decrypted in the background, a process that could take many hours or even days. Most users tried to avoid doing this too often as a result so, while FileVault 2 was secure and effective, it wasn’t as widely used as it should have been.

These screenshots step through the process of enabling FileVault in 2017.

lockratsec

Control was in the FileVault tab in System Preferences.

filevault01

iCloud Recovery was added as an alternative to the original recovery key.

filevault02

Encryption began following a restart, and then proceeded in the background for however long it took. Shrewd users enabled FileVault when a minimum had been installed to the startup volume, to minimise time taken for encryption.

filevault03

With a minimal install, it was possible to complete initial encryption in less than an hour. With full systems, it could take days if you were unlucky.

Although FileVault has had a few security glitches, it has done its job well. Perhaps its greatest threat came in the early days of macOS Sierra, when Ulf Frisk developed a simple method for retrieving the FileVault password for any Mac with a Thunderbolt port. An attacker could connect a special Thunderbolt device to a sleeping or locked Mac, force a restart, then read the password off within 30 seconds. This exploited a vulnerability in the handling of DMA, and was addressed by enabling VT-d in EFI, in Sierra 10.12.2 and 10.12.4.

Hardware encryption

The next big leap forward came at the end of 2017, with the release of the first Macs with T2 chips, as intermediates on the road to Apple silicon. One of Apple’s goals in T2 and Apple Silicon chips was to make encrypted volumes the default. To achieve that, T2 and M-series chips incorporate secure enclaves and perform encryption and decryption in hardware, rather than using CPU cycles.

The Secure Enclave incorporate the storage controller for the internal SSD, so all data transferred between CPU and SSD passes through an encryption stage in the enclave. When FileVault is disabled, data on protected volumes is still encrypted using a volume encryption key (VEK), in turn protected by a hardware key and a xART key used to protect from replay attacks.

filevaultpasswords1

When FileVault is enabled, the same VEK is used, but it’s protected by a key encryption key (KEK), and the user password is required to unwrap that KEK, so protecting the VEK used to perform encryption/decryption. This means that the user can change their password without the volume having to be re-encrypted, and allows the use of special recovery keys in case the user password is lost or forgotten. Keys are only handled in the secure enclave.

Securely erasing an encrypted volume, also performed when ‘erasing all content and settings’, results in the secure enclave deleting its VEK and the xART key, rendering the residual volume data inaccessible even to the secure enclave itself. This ensures that there’s no need to delete or overwrite any residual data from an encrypted volume: once the volume’s encryption key has been deleted, its previous contents are immediately unrecoverable.

eacas

Coverage of boot volumes by encryption varies according to the version of macOS. Prior to macOS Catalina, where macOS has a single system volume, the whole of that is encrypted; in Catalina, both System and Data volumes are encrypted; in Big Sur and later, the Signed System Volume (SSV) isn’t encrypted, nor are Recovery volumes, but the Data volume is.

External disks

Hardware encryption and FileVault’s ingenious tricks aren’t available for external disks, but APFS was designed to incorporate software encryption from the outset. As with internal SSDs, the key used to encrypt the volume contents isn’t exposed, but accessed via a series of wrappers, enabling the use of recovery keys if the user password is lost or forgotten. This involves a KEK and VEK in a similar manner to FileVault on internal SSDs. As the file system on the volume is also encrypted, after the KEK and VEK have been unwrapped, the next task in accessing an encrypted volume is to decrypt the file system B-tree using the VEK.

Enabling FileVault has been streamlined in recent years, as shown here in System Settings last year, for an external SSD, thus not using hardware encryption.

filevault1

FileVault control has moved to Privacy & Security in System Settings.

filevault2

The choice of iCloud Recovery or a recovery key remains.

filevault3

Because only the Data volume is now encrypted, enabling FileVault before populating the Home folder allows encryption to be almost instantaneous, on an external disk.

Virtual machines

The most recent enhancement to FileVault protection extends support to Sequoia virtual machines running on Sequoia hosts. Apple hasn’t yet explained how that one works, although I suspect the word exclave is likely to appear in the answer.

If your Mac has a T2 or Apple silicon chip and you haven’t enabled FileVault, then you’re missing one of the Mac’s best features.

❌
❌